Week 12 — Text
This week's videos are also available in a Panopto playlist.
Unstructured Data
This video introduces the week and describes the key ideas of extracting features from unstructured data.
Unicode and Encodings
In this video, I describe Unicode and text encodings.
Resources
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
- Python Unicode HOWTO
- Twitter thread on the politics of emoji decomposition
The Text Processing Pipeline
This video discusses the basic steps of text processing, beginning with tokenization. The result is a document/term matrix, possibly normalized.
Resources
- CountVectorizer
- TfidfVectorizer
- NLTK
- Stanza (formerly StanfordNLP)
Week 12 Quiz
The Week 12 quiz is on Blackboard.
Vectors and Similarity
This video describes the concept of a vector representation, and how to compute the similarity between two documents.
Classifying Text
This video introduces classifying text, and the use of a naïve Bayes classifier based on term frequencies.
Resources
- KNeighborsClassifier
- MultinomialNB (Naïve Bayes)
Spam Filter Example
The Spam Filter Example demonstrates tokenization and classification with text.
Assignment 5
Assignment 5 is due November 11, 2020.
Midterm 2
The second midterm will be released at 5PM on Wednesday, November 11.
Assignment 6
Assignment 6 is available and is due November 22, 2020.