Week 12 — Text (11/8–12)¶
The second midterm is on Tuesday, November 9, 2021.
This week we are going to talk about unstructured data, particularly text.
🧐 Content Overview¶
Midterm B, November 9
Quiz 12, November 11
🚩 Midterm B¶
The second midterm will be in-class (9AM) on Tuesday, November 9, 2021. It follows the same structure and rules as Midterm A, and is over material through Week 11.
🎥 Unstructured Data¶
This video introduces the week and describes the key ideas of extracting features from unstructured data.
🎥 Unicode and Encodings¶
In this video, I describe Unicode and text encodings.
🎥 The Text Processing Pipeline¶
This video discusses the basic steps of text processing, beginning with tokenization. The result is a document/term matrix, possibly normalized.
🎥 Vectors and Similarity¶
This video describes the concept of a vector representation, and how to compute the similarity between two documents.
🎥 Classifying Text¶
This video introduces classifying text, and the use of a naïve Bayes classifier based on term frequencies.
📓 Spam Filter Example¶
The Spam Filter Example demonstrates tokenization and classification with text.
🚩 Week 12 Quiz¶
The Week 12 quiz is on Canvas.