Week 12 — Text

This week's videos are also available in a Panopto playlist.

Unstructured Data

This video introduces the week and describes the key ideas of extracting features from unstructured data.

Video

Slides

In this video, I describe Unicode and text encodings.

Video

Slides

This video discusses the basic steps of text processing, beginning with tokenization. The result is a document/term matrix, possibly normalized.

Video

Slides

The Week 12 quiz is on Blackboard.

This video describes the concept of a vector representation, and how to compute the similarity between two documents.

Video

Slides

This video introduces classifying text, and the use of a naïve Bayes classifier based on term frequencies.

Video

Slides

The Spam Filter Example demonstrates tokenization and classification with text.

Assignment 5 is due November 11, 2020.

The second midterm will be released at 5PM on Wednesday, November 11.

Assignment 6 is available and is due November 22, 2020.