Skip to content

Week 12 — Text

This week's videos are also available in a Panopto playlist.

Unstructured Data

This video introduces the week and describes the key ideas of extracting features from unstructured data.

Unicode and Encodings

In this video, I describe Unicode and text encodings.

Resources

The Text Processing Pipeline

This video discusses the basic steps of text processing, beginning with tokenization. The result is a document/term matrix, possibly normalized.

Resources

Week 12 Quiz

The Week 12 quiz is on Blackboard.

Vectors and Similarity

This video describes the concept of a vector representation, and how to compute the similarity between two documents.

Classifying Text

This video introduces classifying text, and the use of a naïve Bayes classifier based on term frequencies.

Resources

Spam Filter Example

The Spam Filter Example demonstrates tokenization and classification with text.

Assignment 5

Assignment 5 is due November 11, 2020.

Midterm 2

The second midterm will be released at 5PM on Wednesday, November 11.

Assignment 6

Assignment 6 is available and is due November 22, 2020.