Week 12 — Text (11/8–12)


The second midterm is on Tuesday, November 9, 2021.

This week we are going to talk about unstructured data, particularly text.

🧐 Content Overview

Element Length

🎥 Unstructured Data


🎥 Unicode and Encodings


🎥 Text Processing Pipeline


🎥 Vectors and Similarity


🎥 Classifying Text


This week has 1h2m of video and 0 words of assigned readings. This week’s videos are available in a Panopto folder and as a podcast.

📅 Deadlines

  • Midterm B, November 9

  • Quiz 12, November 11

🚩 Midterm B

The second midterm will be in-class (9AM) on Tuesday, November 9, 2021. It follows the same structure and rules as Midterm A, and is over material through Week 11.

🎥 Unstructured Data

This video introduces the week and describes the key ideas of extracting features from unstructured data.

🎥 Unicode and Encodings

In this video, I describe Unicode and text encodings.

🎥 The Text Processing Pipeline

This video discusses the basic steps of text processing, beginning with tokenization. The result is a document/term matrix, possibly normalized.


🎥 Vectors and Similarity

This video describes the concept of a vector representation, and how to compute the similarity between two documents.

🎥 Classifying Text

This video introduces classifying text, and the use of a naïve Bayes classifier based on term frequencies.

📓 Spam Filter Example

The Spam Filter Example demonstrates tokenization and classification with text.

🚩 Week 12 Quiz

The Week 12 quiz is on Canvas.

📩 Assignment 6

Assignment 6 is available and is due November 21.