Week 12 — Text (11/7–11)#


The second midterm is this week.

This week we are going to talk about unstructured data, particularly text.

🧐 Content Overview#

Element Length

🎥 Unstructured Data


🎥 Unicode and Encodings


🎥 Text Processing Pipeline


🎥 Vectors and Similarity


🎥 Classifying Text


This week has 1h2m of video and 0 words of assigned readings. This week’s videos are available in a Panopto folder.

📅 Deadlines#

  • Midterm B, November 12

  • No quiz due to midterm.

🚩 Midterm B#

The second midterm will be released on in the evening on **{date}wk12 wed xlong, and due at **midnight, Saturday, November 12, 2022. It follows the same structure and rules as Midterm A, and is over material through Week 11.

🎥 Unstructured Data#

This video introduces the week and describes the key ideas of extracting features from unstructured data.

🎥 Unicode and Encodings#

In this video, I describe Unicode and text encodings.


🎥 The Text Processing Pipeline#

This video discusses the basic steps of text processing, beginning with tokenization. The result is a document/term matrix, possibly normalized.


🎥 Vectors and Similarity#

This video describes the concept of a vector representation, and how to compute the similarity between two documents.

🎥 Classifying Text#

This video introduces classifying text, and the use of a naïve Bayes classifier based on term frequencies.


📓 Spam Filter Example#

The Spam Filter Example demonstrates tokenization and classification with text.

📩 Assignment 6#

Assignment 6 is available and is due November 20.