Week 12 — Text (11/7–11)#

Attention

The second midterm is this week.

This week we are going to talk about unstructured data, particularly text.

🧐 Content Overview#

Element	Length
🎥 Unstructured Data	2m57s
🎥 Unicode and Encodings	28m45s
🎥 Text Processing Pipeline	17m36s
🎥 Vectors and Similarity	6m58s
🎥 Classifying Text	6m9s

This week has 1h2m of video and 0 words of assigned readings. This week’s videos are available in a Panopto folder.

📅 Deadlines#

Midterm B, November 12
No quiz due to midterm.

🚩 Midterm B#

The second midterm will be released on in the evening on **{date}wk12 wed xlong, and due at **midnight, Saturday, November 12, 2022. It follows the same structure and rules as Midterm A, and is over material through Week 11.

🎥 Unstructured Data#

This video introduces the week and describes the key ideas of extracting features from unstructured data.

Video (2m57s)

Slides

Oh, again. Now going to talk about how to start processing unstructured data, particularly text data.
So this video, we're going to introduce the idea of processing unstructured text, unstructured data modeling,
going to understand a little bit about some of the different types of features we can try to extract from unstructured data.
So so far in this class, we've been dealing primarily with tabular data, fully structured data of various forms.
Or we have for each instance, we have an equivalent set of features. Some feature values might be missing.
Well, we've got all of these features.
As I just as we discussed earlier in the class, there's additional types of data, semi's structured data has defined fields got,
but they can vary significantly in terms of what fields we get for what objects and unstructured data doesn't have defined fields.
There may be these aren't mutually exclusive.
You might have some unstructured data like text or images that have structured or some a structure made, a data that come with them.
You can treat the meta data just like we treat or other structured data.
But we're gonna be talking this week about how to start working with the unstructured data.
So to get features from data, there's a few things we can do.
We can get raw features so we can get words. For example, we got text.
We might say, well, the words are features and the values associated with those features say are the number of times the word appears.
We've got image data. We've got pixels. Then we can get some features that are readily convert to a readily think.
We can think of them in some ten tabular form, like size or length like this.
The number of characters in a document is a feature. We could also use additional models like there's a sentiment and sentiment inference models that
will try to determine if a sentence or some text is positive as a positive or negative sentiment.
In other models that we can apply in order to compute additional features from some of our raw data this week we're going to be processing and strong,
unstructured text using words as features. The result will be tabular.
You'll have a column for each word, but we're going to be using words as features no when you use it to build classifiers.
There's an example notebook in this week's material that walks you through Luke using
the topics that we're learning in the videos to build a classifier for email spam.
So to wrap up unstructured data doesn't come with the nice data frame style features.
You don't get the rows and the columns that are all set up. You can do nice math on the columns.
We're gonna directly use content is a large feature space and we're using words as features.
We have as many features as there are different words in the content. We'll also see how to derive.
We also set up to be able to derive smaller set of features from the data.

🎥 Unicode and Encodings#

In this video, I describe Unicode and text encodings.

Video (28m45s)

Slides

Resources and Links

Important Python code:

CS 533INTRO TO DATA SCIENCE Michael Ekstrand UNICODE AND ENCODINGS Learning Outcomes Understand how text is stored in memory and files Encode and decode text for storage and communication Distinguish between characters, code points, and bytes Photo by Mauro Sbicego on Unsplash Text Input: Someone must have slandered Josef K., for one morning, without having done anything truly wrong, he was arrested. What is this? A sentence in the English language (the first of the English translation of The Trial by Franz Kafka) A sequence of characters A sequence of bytes encoding those characters ASCII American Standard Code for Information Interchange English letters, symbols, control codes Represented in 7 bits Numbers: 48–57 Uppercase: 65–90 Lowercase: 97–122 Beyond ASCII Encoding only English is naïve Latin and Cyrillic alphabets: extended code pages Latin-1: 0–127 ASCII, 128–255 western European accents Latin-15: 0–127 ASCII, 128–255 eastern European accents Windows-1252: both (latin-1 with eastern EU in reserved space) KOI8-R: 0–127 ASCII, 128–255 Russian Cyrillic characters Ideograph languages: multi-byte character sets (e.g. Big5, GB2312) Must know encoding. Very hard to mix encodings and therefore languages (e.g. Turkish and Chinese) Unicode International standard for text representation Text is a sequence of code points It is encoded into bytes Universal encodings can encode all valid Unicode UTF-8, UTF-16, UCS-4 Historic encodings defined as translations for a subset of Unicode Decode KOI8-R to Unicode code points Encode to Windows-1252 (may fail!) Unicode Coverage Goal: all non-fictional written human language Includes ancient scripts like Phoenician Includes math symbols, IPA, emoji, etc. Does not include fictional languages (Klingon, Elvish, etc) Python String Types Python has two string types: str is a text string, a sequence of code points s[7] returns the 8th code point in the string bytes is a byte string, a sequence of 8-bit bytes bs[7] returns the 8th byte in the string Convert with: str.encode() → bytes bytes.decode() → str Pandas String Types Historically: ‘object’ storing Python strings Either bytes or str Still the default Now: ‘string’ type Get with .astype('string') Or ‘string’ dtype while loading file Both support string methods as .str.method() Encoding and Decoding Python objects: bytes.decode() / str.encode() Pandas series: .str.decode() / .str.encode() Code Points and Characters Each character is a code point (often written U+004A) Characters in LATIN-1 use their Extended ASCII numbers Combining Characters Combining characters modify the preceding character Add an accent Modify shape Assemble complex emoji Complex Emoji woman running: dark skin tone U+1F3C3 — runner U+1F3FF — skin tone dark U+200D — zero-width joiner U+2640 — female U+FE0F — display as emoji Code Points ≠ Characters A character can require multiple code points len(string) is # of code points, not # of characters So it won’t tell you how wide it is when printed A character can be encoded multiple ways ‘ö’ is either U+00F6 or U+006F U+0308 (‘o’ + combining diaresis) U+2160 is ROMAN NUMERAL ONE, indistinguishable from U+0049 (LATIN CAPITAL LETTER I) Code Point Data Unicode database tracks many properties of code points: Type (letter, number, punctuation, space, etc.) Case Many more properties… Code points divided into blocks Python unicodedata module provides access Normalization Unicode defines normal forms: NFD (decomposed): always use combining characters NFC (composed): use single code points whenever possible NFK[DC]: NFD/NFC with “compatibility characters” replaced Roman Numeral One turns into Latin Capital I Non-normalized strings may not be = Normalization Methods Python: unicodedata.normalize(form, string) 'Mötley Crüe'.normalize('NFD’) Length may change Pandas: .str.normalize(form) Unicode Subtleties str.lower() converts string to lower-case str.casefold() eliminates case distinctions E.g. ‘ß’ becomes ‘ss’ Many rules locale-specific (e.g. sort order) locale.strcoll (compare two strings using locale rules) locale.strxform (convert strings into comparable strings) Universal Encodings UTF-8: code points take 1–4 bytes ASCII is valid UTF-8 UTF-16: 2 bytes per basic character Characters outside Basic Multilingual Plane (BMP) use 4 bytes Two-point “surrogate” Requires byte order UCS-4: 4 bytes per character, also requires byte order Python Internal Representation Stores strings with one of: Latin-1 (one byte per character) UCS-2 (two bytes per character, only BMP) UCS-4 Uses most compact representation that can store the string len(s) returns the number of code points Other Internal Representations Many systems use UTF-16 Java JavaScript Windows & Mac APIs But: often treat it as UCS-2 Java/JS charAt(i) returns UTF-16 code unit length is 2-byte pairs First Steps of Processing Decode the text to convert bytes into text Normalize the text to render into consistent form Unicode normal form Optionally: strip accents (will lose meaning!) Optionally: casefold Points to Remember Always need to know what our encoding is Decode bytes to get a text string (seq. of code points) Code points do not map 1:1 with characters Length of string does not mean printed # of characters Normalization makes strings comparable Casefolding makes strings comparable case-insensitively Sorting is locale-specific If you have single-byte text that’s not UTF-8, Latin-1 is common Wrapping Up “String” is not a simple data type. Text is encoded both for on-disk storage and in-memory processing. Photo by Marcel Kovačič on Unsplash

This video I'm going to talk with you about Unicode and Encodings. How is text actually represented and stored?
We talked a while ago about encodings and codings for different kinds of data.
We forecasted that we're going to talk about text we've seen in some of our initial processing.
Occasionally we need to provide an encoding. What's up with all of that?
So learning outcomes for the video are few. To understand how text is stored in memory and in files,
encode and decode text for storage and communication and to distinguish between characters, code points and bytes.
So we have text, we have some input. We have say here with a first sentence of the book, The Trial by Franz Kafka.
Someone must have slandered Josef K for one morning without having done anything truly wrong.
He was arrested and this was a sentence of the English language.
It's a sequence of words. It's a sequence of characters. And we store it as a sequence of bytes encoding those characters.
Computer has to store everything in byte. We've talked about how you start flooding poll numbers a little bit here where the text is stored in bytes.
So for a long time in U.S. based computing, we stored text in what was called ASCII, the American Standard Code for Information Interchange,
and it assigned a number in the range zero to one hundred and twenty seven seven bits for all the English letters.
A number of symbols and control codes.
You'll find the numbers in in range 48 to fifty seven uppercase letters go from sixty five to ninety or lowercase from ninety seven
to 122 etc. to sign these codes that we could store in bytes and just the lower seven bits of the byte so you could use the top,
the upper bit for for other control purposes. It was able to store written English text using most of the characters and punctuation.
That's in common use. So few problems with this one.
The whole world doesn't speak English,
and we need to be able to encode a lot of different languages for our languages that use Latin, insert Latin or Cyrillic alphabet.
This is often able to be done with extended code pages. So Latin one extends ASCII, so the number zero through to 127 are ASCII, 128 through 255.
So when a bit in a bite, this is the difference where whether the first bit is set or not, the high order bit, is it set or not?
Are for used for characters with Western European accents and some control characters, Latin fifteen does the same with Eastern European windows.
Twelve fifty two combines both.
It's compatible with Latin one, but it uses a range of characters that Latin one reserves for control codes to store the Eastern European accents.
Califate eight are uses the uses the upper one hundred and twenty seven characters for
Russian Cyrillic characters and so allows you this to store Russian and English in KEYT r.
But many languages, such as Chinese, Korean, Japanese,
they have many more characters that can fit in a bite even if the entire bite was allocated to it.
And so you use a double bite in code.
A multi byte character set often uses two bytes per character, such as Big Five for traditional Chinese and Jeab twenty 12,
which is the Chinese national standard for simplified Chinese, and it also covers Russian.
Now for a sequence of bytes that's text. In order to properly interpret this text, you have to know the encoding.
So that's one problem. It's another encoding in order to decode it properly.
Also, it's very hard to mix encodings and therefore languages because if you've got text that's encoded in GIBI twenty three.
Twelve. Well, you can't write Turkish on that. So if you've got a document that you're trying to have parts in Chinese and parts in Turkish,
you have to have a way to switch back and forth between code pages within a single document gets very complex.
And so Unicode was developed to try to unify these.
Unicode is an international standard for text representation in which text is represented as a sequence of what's called a code point.
These code points basically can think about. You can think of a Unicode text string is an array of code points.
These are code points are encoded into bytes.
And so here, again, as we saw with other data types, there's this distinction between data and it's encodings.
We actually kind of have three levels here because you've got characters code
points and then the bytes that are used to actually store the code points.
We have universal encoding that can encode all valid Unicode. And as I said, we've got the other encoding that have historically been used.
Latin, one U.S. ASCII califate are they have been redefined as encodings for subsets of Unicode.
So if you if you want to process KEYT are in modern program,
what you do is you decode the K away eight R and Unicode process Unicode and then you can read encode into whatever.
Now not all encodings can take all characters.
So if you try to encode Russian text into Windows twelve fifty two, it will fail because Windows twelve fifty two doesn't have the Cyrillic alphabet.
Unicode is intended to cover all nonfictional human languages, so it includes.
Any in the goals include any characters use to write any human language spoken today,
it includes ancient languages that are no longer spoken, such as it has Phoenician characters.
It includes a lot of things like mass symbols, the international phonetic alphabet emoji and a lot of other symbols.
You'll sometimes find the people using these for various strange effects, like to get a italics in a tweet or something.
What it's actually doing is it's going and grabbing naff characters and using those,
if you want to see sometime, use your computer's assistive technology to read aloud such content.
And you'll see here what it's doing under the hood.
And it winds up being very, very difficult for for visually impaired users to engage with because the read aloud is gonna be completely mangled.
But you've got a lot of different symbols across the space that are being used for human communication.
Unicode specifically does not include fictional languages. You're not going to find clicking on our Elvish characters in Unicode.
People have there are portions of Unicode that are reserved for private use force of software, and there are people who have mapped,
cling on or Elvish characters into those into those code points and then developed fonts that are capable of rendering them,
but they're not part of the Unicode standard. So to store all of this, python has two string types.
The SDR is a text string when you write a string. This was your single or your double quote.
This is what you get. You get a text string and a string is a C ATEC string is a sequence of code points.
So SRF seven is going to return the eighth, the eighth character or code point does its character and does it as a single,
does it as a string of a single character.
But the Owada function o r d will then get you the numeric code point from one of these single character strings.
Bites as a bite string. And here it's a sequence of eight bit bites.
If you ask for B.S. subset of a bite string B.S. and you have four B.S. up seven,
you're going to get the eighth bite in the strings and start from zero. You can convert between them with the end code and decode function.
So ENCODE takes a text string and encodes it to bytes. Decode takes a bite string and decodes it to stir.
And both of these you need to provide the Kodak, the name of the Kodak that you want to use UTF eight ASCII,
Latin one big five, whatever Kodak it is that you want to use.
Pandas also provide string types historically for pandas one, and it's still supported and still the default,
you would use an object series to stat stores, python strings, either bite's or Sturr objects.
Pandas can work with both pandas also now has a string type.
You can convert an object type storying strings to a panda string type.
You can also convert a you can also use the string D type when you're loading CSC as V file and then it will.
Lowden's a panda string, Cullom, both of them. The pandas has these.
These families have access of type specific methods and accessors that stirred up whatever four
strings that works on both the objects Cyle Strings series and the the new string data type.
So, as I said, to encode and decode, you have a python, you can do bytes that decode,
insert and encode panda series also provides, encode and decode operations off of this Sturr accessor now.
So you've got these strings. They're sequences of code points. How do those relate to characters?
So the code points in the simple form. The code points record the numeric values used for different characters.
So if I write Joseph K, the name of our character from the.
From the sentence at the beginning, it's going to be recorded as this sequence of code points for A as J.
O is six F SS seven three E six five.
These are in hexadecimal. These are hexadecimal numbers. So that abiders. Exactly.
To two digits. And any character that's in Latin one.
So in terms of letters, it's English letters, it's Western European accent.
Their code point is the same as their Latin one encoding number. And so the ASCII numbers are there.
Unicode code points for the Western European accents. Their Latin one by value is is the Unicode code point.
So O with an mail out is F six use an amount is F c.
And so we get these accented characters being stored this way.
But code points and characters don't necessarily have a one to one relationship.
So if we have Motley Crue and it's we can encode it, as with the O is the, um, lot of SSX that is the character O with Rumaila.
But you can also encode. You can also store that same character as two code points, six F, which is O and three oh eight which is combining diary sis.
And so what that does when it sees that sequences.
Oh. And then it says OK, complete combining diaries is and put the diaries is on top of it.
These combined there are a lot of things combining characters that can add accents.
They can stack on top of each others. You can stick like diaries. Is Anatoliy around a hat or whatever?
If you see sometimes people writing text has like stacked accents coming off the tops and bottoms that supposed to look like Lovecraftian horror.
They're doing that with these combining the characters just stack more and more, combining characters on top of each other.
They can also, in some cases, modify the shape of a character. They can assemble complex emoji.
There's some other use cases as well.
But code points have can have these relationships with code point can modify code points that come before in rare occasions after.
So you don't when you ask a string for its length, that's gonna give you the number of code points, which is not the same as the number of characters.
So if you did if if we had Motley Crue stored this way and you did a LAN on it,
you're gonna get 13 instead of 11, even though it's only going to print eleven characters.
You're going to get you have 13 code points. This does let you assign lots to characters that don't commonly have them, for example, diaries.
The diaries of the unallowed is usually only applied to. Disavowals.
And so if you want if you want Spinal Tap. You need to use an which is going to be six B and you need three O eight.
The combined guy races, there are a few human languages that do use this character, but they have not added a pre combined character,
Unicode, and they don't need one because you can make one out of the raw end and the combining character correction.
This is not six. This is six e. So one a complex use for these are complex emojis.
So emoji can have a number of modifiers and layers on top of them.
And so if you have the woman with dark skin tone, running emoji.
The skin tone is actually amplify entered as a Unicode combining modifier.
And so this this emoji here is actually made up of six or a five code points in sequence runner.
The modifier for dark skin tone, a joint character, the modifier for female or the character for female.
And then the another character, which means display as emoji rather than text.
And it puts all of those together to get a woman running with dark skin tone.
The politics of this in terms of what's the default and what's the default for different emoji are interesting and worth contemplating.
But you can use but this idea of being able to combine code points allows you to build up arbitrarily modified characters.
And so they define many of the more complex emojis are defined by these sets of combinator of combinations,
rather than defining a separate code point for every possible configuration of a of a particular emoji.
So the key point here is that Kote points are not equal to characters, a character can require multiple code points.
And this comes in. This becomes relevant when you'd say compute the string length.
If you want to say line strings up there, OBC edge cases where it's wrong.
The only way truly to tell how wide a character,
how wide a string is going to be and display is to ask a rendering engine to render it in a fort and measure how wide it is.
Also, though, a character can be encoded multiple ways, as we saw, the oath diaries can be either F six or it can be six F followed by three or eight.
And what this means is that simple string equality will sometimes.
You can have equivalent strings that are not equal. We're gonna see in a little bit normalization techniques that let you let let me deal with that.
But there's no semantic difference between the two ways of encoding the O with a diary source.
But if when you compare the strings bite, when you compare the strings code point by point, they're not going to be equal.
Also, you have some things such as Roman numeral. There's a code point for Roman numeral one.
And there's it's indistinguishable from the Latin capital letter.
I and it's in their speak for compatibility reasons because some encoding.
So we had we had all of these different encodings that are encoding text in different ways before we tried to unify everything in the Unicode.
Jimmy, Twenty three. 12 had encoding, had characters for four latt for Roman numerals,
and so to translate that into Unicode, they had to have be able to map those into code points.
It's the code point is it's called a compatibility code point,
and it's not distinguishable from the Latin letteri, except that it's a different code point.
So there's a Unico database that tracks many properties of code point to get identifies,
which ones are these compatability code points and what they're what they're equivalent to?
It tracks the type you can marks for each code point is a letter, a number of punctuation, a combiner upper lower case.
There's a lot of different properties in the Unicode database. Also, the code points themselves are divided into blocks like the Latin one supplement.
The Python Unicode data module is a starting point for being for getting access to this
database and and making inquiries about your about your about different characters.
But another thing we can do with these and the Unicode data module gives you a function for this is you can normalize them.
So Unicode defined some normal forms and SFD normal form decomposed.
It always uses the combining characters. So if you if you have a A seven.
So if you have an F six. It's going to translate that to a sex F and a three o eight.
And the normal form composed using single code points whenever possible.
So it is going to go the other way around.
It's going to if it sees a six f three away, it's going to turn it into an F six.
Then there's also JFK variance, so JFK, DNF, Casey, they do the same thing,
but they also normalize compatibility character so that Roman numeral one is going to turn into a Latin capital left eye.
So once you've normalized strings, then it's a lot more likely that two semantically equivalent strings are going to also be
equal because you don't have the issue of like they're different just because the oh,
the new plot was was encoded differently. That's normalized.
So you can more directly compare them.
You can look up, you can say if you're using Unicode for file names, you can look up the file correctly and know they're going to get it right,
even if it was typed in or even if it was initially encoded in a different a different encoding.
So Unico data provides you a method to normalize.
So I can say a motley crew to normalize and F.D. and it will normalize it to the decomposed normal form.
PANDAS has normalization with that stirred up normalized normalization, super useful.
As I said, to in order to make sure that Mottley is only going to be encoded one way,
even if you've got files that are coming from different systems that might have defaulted to whether they use it composed or decomposed characters.
There's some other subtleties to Unicode as well, though.
So the lower method converts a string to lower case, but there can still be distinctions that don't make a semantic difference.
There's another function case foaled that converts to a lower case and goes farther to eliminate case distinctions.
For example, the German letter Stronsay will become two S's.
It's this, it's equivalent, but lower is not going to convert it because it's already it's already lowercase letter.
But it's going to convert it down to these two S's so that you get that there's absolutely no distinctions there anymore.
But there's other rules that are locale specific.
For example, if you want to sort strings and you want to do it correctly, the rules for doing that depend on the language, not just the characters.
And so your computer has something called a little count each.
Prosit, you can change locales for a process. Each process has a locale that it's running in.
The defines a number of things that could define your date formats and things like that.
But it also defines sort orders.
And the Python locale module gives you the sterkel function, which allows you to compare two strings based on their locale specific store order.
And it also gives you an X Asterix form function, which given a string,
it'll convert it into another string such that just doing a normal lexical graphic code
point by code point comparison on the resulting strings will sort them in the correct order.
So if you're going to Sawtell, if you're going to sort a lot.
It can be useful to say make another column of your data frame that has the store X foreign versions of your strings and then use that as your source.
Keep one thing to be aware of, though, that when you're doing that,
your program is going to have different results based on look how in which it's running.
And so for reproducibility, you might need to specify a particular locale to run your program in.
So as I said earlier, there are some universal encodings that can encode all Unicode characters,
UTF eight is probably the most common, particularly for storing files on disk.
If you have a choice, store your files in UTF eight.
Could points tape between one and four bites, ASCII is valid UTF eight, because the code points in the range zero through one twenty seven.
Ah, in the. The code point number is the same as the ASCII number, and also they get stored in one bite and you and UTF eight.
So ASCII just is immediately UTF eight. Anything outside the ASCII range gets encoded in two to four bytes.
UTF 16 uses two bytes per basic character. And so any character ins or any character inside what's called the basic multilingual plane,
which covers most currently spoken human languages, doesn't include emoji or mathematical symbols.
It will use two bytes. Characters outside of it will use four bytes.
A couple notes in the UTF 16 encoding and requires you to know the bite order because some computing systems are what's called big endian,
where the the higher order byte comes before the lower order byte in memory and others are little endian where it's the other way around.
Intel processors work in little endian.
And so there's a charac there's a two character or two byte sequence that goes at the beginning of the file that marks its brightwater.
UTF eight. So UTF eight is a great default in general, and it's particularly a great default for English,
for European languages, because a lot of them will take one to two bytes.
But if you're storing text that's written in Chinese or Japanese.
Most of those characters are going to require three bytes and UTF eight, but only two bytes and UTF 16.
And so UTF 16 can be a more compact format for storing files, storing text in those languages.
Then there's UCSF four, which stores four bytes per character. It also requires a buy order and UCSC for it.
It's just it's the sequence of four byte. It's an array of thirty twos or you and thirty twos.
And each one just stores the code point directly. There is no special encodings.
Internally, what Python does is it stores strings in the most compact form it can from the following list.
If if the string is Latin valid, Latin one, it stores it with one in Latin, one with one byte per character.
Is that so if if it's code points are all in the range zero to 255. It will store it as Latin one with one bite per character.
If they aren't. But all of the characters in the string are in the basic multilingual plane.
Then it all stored in UCSC two, which is two bytes per character, but it can only store text in the BNP.
And if it's not, if it can't fit in UCSC too, then it stores it in use.
Yes. Four. So it's an array of in thirty twos. And no matter.
This is transparent to you as he is a python. When you're writing in Python, you do not see the difference in how it's stored.
Unless you go get the size of the object. Land is always going to return the number of code points you can get.
It's just storing it more in the most compact form it can.
If you're writing a C ext module, then you'll see the difference between the different storage formats of a strength.
Other systems, many of them use UTF 16.
So Java, JavaScript, Windows and Mac API as they natively use usually UTF 16 for legacy reasons are often treated as UCSC too.
So you see, S2 was used in the 90s when there were fewer than 64000 code points in Unicode and you could store them in just two bytes.
UTF 16. First for four characters in the book Basic Multilingual Plane.
There's no difference between UTF 16 and UCSD two, Usaia,
UTF 16 extends UCSC to with the ability to say this character uses an extra pair of bytes to encode the additional characters.
So if you ask Java or Java script to give you the length of a string,
it's actually going to give you the number of two byte pairs UTF sixteen code units,
which will be the length of the number of code points if everything's in the basic multilingual plane.
But as soon as you've emoji or mathematical symbols in your string, it's gonna be wrong.
Also, the char at function that gives you an individual code code point is going to give you the UTF 16
code unit if you need the actual code point for a character that's in the upper parts of Unicode.
You're gonna need another function. So all this said, when you get some text, there's first two first steps of processing your text.
The first is to decode the text and convert the bytes into text that you can then process in Unicode.
Then we may want to normalize the text to put it into a consistent form.
Unicode normal form is often useful. We might want to strip accents.
This is going to lose some meaning, but it can make. It can reduce distinctions between words in ways that can occasionally be useful for processing.
We might want the case for it so that our text is all lower case and we don't have to deal with
deal with mixed case text because we don't if we don't want to make distinctions between case.
Some points that you need, you need to remember when processing tax, and I've also linked to a reading that provide some additional material.
You always need to know what you're encoding is when you're getting text from the
outside world or you're writing text back out on a modern Mac or Linux system.
It's going to default to UTF eight on a Windows system.
It's still usually defaults to code page twelve fifty to. You need to decode bytes in order to get a text string, which is a sequence of code points.
But even once you have your code points, they don't necessarily map one to one with characters.
And the length of a string does not mean the printed number of characters because you can have combine hours and things that are going to.
You can have combiner that are going to make multiple bytes of us. One character that zero with Joyner doesn't show up,
etc. You can use normalization to improve the comparability of strings so that you say to normalize that.
So you always use combiner, you never use combine hours.
And then case folding is a tool that allows you to make strings comparable case insensitively by removing distinctions between upper and lower case.
You have to pay attention to Lukow as well. If you want to sort things correctly for your current human language,
you need to do so in a way that's based on the specific locale because different ones have different sorting roles.
If you have single byte encoding of text and it's not UTF eight, not all byte sequences are valid,
UTF eight, then Latin one is a common one and Latin one will always succeed.
If it's wrong, you're going to have some of your accents and other non English characters.
They will be incorrect. But if you're just trying to get it read in.
You can start looking at them Latin one is common. Unfortunately, there's not a you can look at it.
Try different encodings and get the one that gets you meaningful text and language.
But there's not a there's not a foolproof way to detect the encoding used for text.
So to wrap up string is not a simple data type. Text is encoded both for on disk storage and in memory processing.
You've got these multiple layers. We have characters. Characters are represented with code points.
Code points are encoded in the bytes. There's not the one to one relationships that we might think going up and down the stack,
but it enables us now to, in a very flexible fashion, represent a wide array of written human language.

Resources#

🎥 The Text Processing Pipeline#

This video discusses the basic steps of text processing, beginning with tokenization. The result is a document/term matrix, possibly normalized.

Video (17m36s)

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand TEXT PROCESSING PIPELINE Learning Outcomes Understand the different stages of text processing Extract vectors from text Know some of the things that are possible beyond what we will discuss in class Photo by Andrej Lišakov on Unsplash Text Processing Input: Someone must have slandered Josef K., for one morning, without having done anything truly wrong, he was arrested. Output: Features Vectors Scores Decisions Corpus Text Processing Pipeline (term vector) Decode Normalize Text Tokenize Remove Stop Words / Filter Tokens Stem / Lemmatize Vectorize Text Normalize Vectors <do more things> Decoding and Normalization See previous video May delay normalization until after tokenization Tokenization Split text into individual tokens For now: words (sequences of alphanumeric characters) someone must have slandered josef k. someonemusthaveslanderedjosefk Remove Stop Words Common utility words not useful for many tasks And, or, but, if, etc. Can remove with: List of stop words Frequency (words appearing in most documents) Whether you do this depends on task Stemming and Lemmatiziation “change”, “changes”, “changed”, “changing” — all the same word Stemming transforms word forms into common stems Not necessarily a word (e.g. ‘chang’) Different forms will have the same stem English: snowball is a common algorithm Lemmatization transforms words into ‘lemma’ form Normalizes to an actual word (e.g. “change”) More computationally intensive Vectorizing SciKit CountVectorizer The SciKit-Learn CountVectorizer gets us to this stage fit() learns vocabulary transform() turns array of text into (sparse) matrix CountVectorizer pipeline Decode (if necessary) Lowercase (optional) Strip accents (optional, also normalizes Unicode) Tokenize (default: 2+ alphanumeric characters are a word) Vectorize (values are term frequencies) Problems with Term Frequency All words equally important Count vector has higher values just for longer documents Solution? Weigh words differently Uncommon words higher weight? More likely useful for classification! Normalize to use relative frequency instead of absolute count TF-IDF Normalized TF-IDF Text Processing Pipeline (term vector) Decode Normalize Text Tokenize Remove Stop Words / Filter Tokens Stem / Lemmatize Vectorize Text Normalize Vectors Result: each document has a vector assigning scores to terms Alternatives N-grams (e.g. bigrams) instead of words Either words or characters Sentence-level analysis Two-level tokenization: words and sentences Tag words by part of speech Analyze relationships between words Transformer architectures Neural nets for sequences of words Significant power, significant processing cost nltk provides a lot of capabilities, can integrate with scikit-learn Wrapping Up Term vectors treat each word as a feature, for which documents have scores. Often normalized to give uncommon words higher weight, and for document length. Photo by Jael Rodriguez on Unsplash

This video, we're going to talk about more of the text processing pipeline,
how do we actually get from tax that we've decoded into features and vectors that we can use in order to do computation?
So learning outcomes for this video feed understand the different stages of text processing and to be able to extract a
vector representation from text and also know some of the things that are possible beyond what we will discuss in class.
So we've got our input, a sentence that we saw in the previous video. Someone must have slandered Joseph K and we output.
We want to we maybe want to get some features from it as vectors, maybe want to get scores, decisions for some process.
So first, before we get into how to do that, I want to define a few terms.
A corpus is a set of textual data that to analyze and a document is a single document within the corpus of a corpus and it's made off of documents.
So the example that I give you is a spam filter example.
Our corpus is a bunch of emails. Each email is a document and then a term is a single word within a document.
We can think of the set of all terms is the vocabulary of of the corpus.
So we first want to decode a normalized text, as we talked about, in the encoding and decoding video.
Then we want to tokenized it or split it into individual words. We want to do some filtering on those tokens.
We want might want to further process them and into some kind of a normal form.
Then convert these the sequence of tokens, the output of that stops a sequence of tokens.
We want to convert it into a vector that we then may also further normalize and we can use for various types of processing.
So that's the pipeline. We're gonna be talking through previous video talking about decoding a normalization.
Sometimes we might delay some of the normalization until after tokenization. Oftentimes we'll do it first.
So that we've got we've got the code, the text decoded, the code points.
We've put them in a normal form. We've maybe lowercase everything or case full of everything.
We might even have stripped accents at that process as well.
And then the input, the output of that is the sequence of normalized text that's ready to be tokenized and so on.
Tokenization where we want to do is we want to split text into individual tokens.
And these tokens often are words. So we want to grab the words.
Someone must have slandered Josef K.
And so Sacket learns default.
Tolkan Isar uses a sequence of alphanumeric characters that actually use it requires there to be at least two so it will not extract K as a token.
But but we split this text into individual tokens and still at a sequence of tokens.
But this becomes the basis basically for any further analysis.
The next thing we do is we might want to remove stop words and so common utility words like and or or are often not useful depending on the task.
If you're trying to do an information retrieval task or you're searching for text finding documents that include the word,
and it's probably not useful because almost every documents going to include the word and or or but or if you want to do document classification,
the vast majority of your documents are going to have these words. So they're not very useful for.
Distinguishing between one class of documents and another.
So we can remove them either with a list of stop words, Saikat learn includes a list of stop words or by using a frequency based analysis.
So we remove the words that appear the most often.
Whether or not you do this depends on task and it also depends on the other analysis you're going to do.
If you're going to be doing an analysis that drink that recovers sentence structure, then you need these words.
If your. If you're but if you're doing information retrieval, you may well not need them if you're doing classification.
They're often not useful for a classification task and so they're worth removing.
Then after we've leave tokenized, we might have removed our stop words. Then we'll do something called we might.
I'll get it again. Whether or not you do this depends on task. But you can do something called stemming or limitation.
And the idea here is that words have multiple forms in English. Change changes, change changing.
These are different forms of the same word. They're different tenses of the same verb.
And so but if for some purposes, especially search purposes, a lot of this pipeline has its origins in search.
But then it's used for a lot of other purposes as well. There's no difference if you want to search for change in the corpus.
You probably want to find documents. It's a changed or changing.
And so there's two ways of doing this stemming is a relatively simple technique that transforms words into common stems.
The result is not necessarily a word like a stemming algorithm may well change all of these forms of the word change in CNG.
But the what the stemming does give you is that different forms of the same word are going to go to the same steps.
So if you stem all of the word that all of the different forms are going to result in identical tokens in ink for English,
snowball is a common stemming algorithm. Saikat learned does not provide you with any stemming or limitation.
To do that, you're going to need to go to another package, limits zation, accomplishes a similar purpose,
but it transforms words actually in what's called Alema form, which is an actual word, and it's the root form of the word.
And so these will all become change most likely in your lamma Tizer limited zation is more computationally intensive.
But the upside of limitation is it gives you actual words rather than these stems.
So once we've got our tokens, we've we've dropped the ones we don't want to use.
We've stemmed or DeMatteis, if we if that's appropriate for our task, we then want to vectorized that.
We want to create what's called a tom vector or a term frequency vector.
And this vector has one dimension for every term in the entire vocabulary, and its values are the number of times the term appears in the document.
So. And our example tax, someone must have slandered Joseph Kay.
We're gonna have a one for someone, one for mosta, one for have a one for slandered, one for Joseph.
It's also so sparse because each doctor is probably only going to use a fraction of the terms,
and so we'll use a sparse representation that doesn't actually store all of the zeroes.
But mathematically, this vector has an entry for every term. And if a term never appears in the document, it gets a zero.
So in psych, it provides a class called the Count Vector Ricer, which gets us to this stage.
It tokenized this text. It can do transformation. It can do lowercase.
And it can decode your Unicode if necessary. Erica, decode into Unicode if necessary.
The fit method learns the vocabulary and transforms transform turns.
An array of tax sets a one dimensional ray. Each element is a text string into a sparse matrix that looks like this.
This is called a Tom Doc or a document term matrix we have on the Y axis.
We have documents and on the columns, the different columns are terms.
And it's not recording the zeroes, the empties here are zeros and the term one document one is term one five times term three,
two times in term for its term for three times.
And so you get this transform. You get this. The sparse matrix.
And now for each documents a row and each column is a word. Basically now our words are features and their values are how frequently they appear.
So when our pipeline. Here's what Count Vectorized can do for you decode lower case.
It can strip accents if you tell it to strip accents, it's also going to normalize your Unicode.
It can then tokenized. It then tokenized is your tax.
It's default tokenization strategy is it treats any sequence of two or more alphanumeric characters as a word.
And then once it's tokenized it vectorized is and you get back these values that are the term frequency's,
how often did the word appear in each document? This is what count vector rhetoric gives you super useful.
You're going to use it as the basis for some of your processing in the assignment.
Now, there are a few problems for some purposes for using raw term frequency.
The first is that all words are equally important. If the word have appears tens times and the word hippopotamus appears 10 times.
It makes no distinction, even though if you're trying for a lot of purposes,
the fact that a document says hippopotamus, which is a far less common word,
is probably more meaningful for distinguishing it from other documents than the fact that it has half almost every document has have in it.
And so this is a solution. There's two pieces of the solution. One is we wait words differently.
We're going to give uncommon words, words that don't appear in very many documents.
Higher weight. And this is this winds up being broadly useful because if you're trying to retrieve a document,
the words that are more unique to that document are probably more useful for trying to retrieve it.
If we're trying to classify a document, the words that are more unique are probably also going to be more useful for dividing
distinguishing between one class of documents for another than another example,
if you're trying to. If you have a collection of of scientific reports and you're trying to distinguish between zoology and botany,
the word have is probably not useful because both classes of documents probably have it,
whereas hippopotamus is probably much more common in zoology documents than in a botany document.
So the other thing that we're going to do is we're going to normalize these
vectors so that rather than just using an absolute counts in a longer document,
it just has higher values everywhere. We're going to normalize it so that every documents vector has the same magnitude.
And this is going to mean long documents don't have substantially higher values than short documents just on the virtue of their length.
Which makes it easier to compare documents. So the first part of it waiting more waiting,
giving more weight to less common words is done through something called term frequency, inverse document frequency.
And the idea here is that we compute a feature value by multiplying the term frequency t.f.
By the inverse of the document frequency D.F.
And we take it the way we compute, the way Saikat learn computes the inverse document frequency is if we have N and that's the number of documents.
We take the log of the number of documents divided by the number of documents
with the terms of D.F. is the number of documents that have the term tea. And is the total number.
So. And it's the inverse. So rather than saying D.F. over D.F. of tea over N,
which would be the fraction of documents that have the term, we do take the inverse of that fraction.
And over D.F. of tea, we take the log Saikat learn add some ones to both the numerator and the denominator in order to prevent.
If you've got a term that doesn't appear in any of the documents which the Count Vectorized isn't going to give you one of
those terms because the vocabulary just doesn't have words not going to be in the vocabulary if it's not in the document.
But in some other cases, we have a vocabulary that comes from somewhere else. You might adding a one.
Just make sure you never have division by zero and you're never trying to take a log of zero.
Other common ones are you're gonna take the log of an over one plus D.F. of T so
that you just have the one plus in the denominator and not in the numerator.
That little change is not going to make a significant difference in terms of performance.
It's a subtle difference in how it's implemented, but it's not going to make a significant computational difference.
But what this does is a document that appears, if you think about what you have to do to make the value go up.
The document appears more frequently. You can increase the value by having the document appear more frequently in the terms of increased T.F.
That'll go up. You can also increase the the value, the TFI D.F. value, which we're calling X.
You can increase that by having the document appear determined here in you were documents.
So the more documents it appears in, the closer this is the one where it has,
if it's if it doesn't appear in very many documents, it's going to be a large number.
Say a million over 50 or a million. Over five hundred.
The log of that's going to be high and you can have a much higher TFI D.F. value if the term does not appear in very many documents.
And that reflects that if the term is more unique to this document than it's more useful for distinguishing this document from other documents.
That's the key insight here. Less common words are more useful for describing what makes this document different from other documents.
And if two documents have the same eyes, have the same uncommon words.
That's really useful for saying they're probably similar if they have the same words.
But those words are super common, that the documents are probably not very similar.
So we then normalize our TFI D.F. vector to be what's called a unit vector.
And what this means is we just divide the vector by its L2 norm.
If you use the T.F. IDF Vector Isar instead of the Count Vectorized and Saikat, it's going to do everything to count vectorized.
It does. And then it's going to apply T.F. IDF and it's going to normalize them, the unit vectors by the default.
You can change some of that, like have it not in do the L2 normalization, but use the TFI the effector.
It's going to give you unit vectors that apply. They use TFI,
D.F. adjustment to down wait common terms and give you these factors that are useful too for
for characterizing documents in terms of how frequently they use relatively uncommon words.
So now we have our whole text processing pipeline for getting to a term vector.
We decode normalize. We tokenized. We remove, stop words. We may do some other token filtering as well, like throw out numbers or something.
You can stammer lemon ties. So both of stopwork removal and the stemming and limitation are optional.
You don't have to do them. You can. This is the template for the pipeline.
Different applications and different tasks is going to use different steps of this of this pipeline.
Then we vectorized the text and we normalized our resulting vectors that we get these nice vector representations of a document.
The result is each document gets this vector that assigns scores to terms for their relevance to that document.
And this this is a very high dimensional vector.
Ten thousand hundred thousand dimensions. So if you stick into a linear model, that's a large number of coefficients.
But it gives you this vector representation of the resulting text that we can then use for additional tasks processing,
classifying, clustering, etc. our text. So a few alternatives to some of the design decision for some of the pieces of this pipeline.
First, we can we don't have to tokenized the words. We can tokenized into what's called an end.
Gramp's such as a Bigram or a Trigram. And what they are. There are sequences that were so a bigram as pairs of words that appear next to each other.
You can also do character level bigram. So it's pairs of characters.
You can also do what's called a skip gram, which pairs of characters that appear in proximity.
But there might be another word between them.
You can also do sentence level analysis rather than just splitting into words and putting the words into a vector,
which is called bag of work, by the well way.
It's called a bag of words model because you just take the document, we throw it all the words into a bag.
We don't have the relationships anymore. We can split it not only in the words, but in the sentences.
We can tag words by parts of speech. We can look at analyze relationships between words.
Various natural language processing toolkits give you tools for doing that.
Stanford MLP is one that gives you the ability to do those word relationships.
Currently, there's a lot of work in an AP Natura. So A.P. natural language processing is the field for studying this.
If you want to learn a lot more about this processing pipeline and things you can do with text the classes you want to take out an LP and I are.
Information retrieval. Transformer architectures are deep neural networks that can deal with sequences of words.
They have significant power for modeling how language works and modeling.
What's going on in text. But that comes with significant processing cost.
The Python package and K gives you a lot of tools. It doesn't give you the full sentence analyzer, but gives you a lot more sophisticated tools,
such as parts of speech, tagging, stemming in limited zation, etc. Then you haven't like it.
You can also integrate with Saikat,
learn who can replace the psychic lern count vectorized or TFI the effect Dreiser with a vectorized that uses NLC K to do the analysis.
If you need more sophisticated text, processing is a part of your application.
So to conclude, term vectors treat each word as a feature for which documents have scores that often normalize.
The uncommon words have higher weight and that longer documents don't just have larger vector values.
Where then when you use these vectors downstream in order to do a variety of computations.

Resources#

🎥 Vectors and Similarity#

This video describes the concept of a vector representation, and how to compute the similarity between two documents.

Video (6m58s)

Slides

So now that we have some term vectors, what can we do with them learning outcomes for this video or for you to understand the
value of vector representations and instances and the compute similarities between vectors?
So we've talked about vectors from time to time throughout this course.
But. And you've probably seen you may have seen them in a linear algebra class or statistics class, but a vectors just an array of numbers.
Got a vector X that's got values X one, X, two, three, X.
Then we say that it has N dimensions. This can be the row or column of a matrix.
And we often then the length of the vector, the length of the vector is usually not as dimensionality, the length is usually the L2 norm.
The sum of the squares, the value of the vector we can think about.
So in instances, features form an array of form of vector, an array of values like the role of a data frame.
It's a vector. We can treat it as one. We could create the column as a vector.
But when we with the features you've been seeing so far on this, we standardize,
then the features often have different scales, different meanings, not necessarily meaningful to relate them to each other.
One of the things that's a little different about our about the vectors that we get out of tax is that they're more homogenous, like each.
Each dimension corresponds to a term. But they all mean the same kind of thing.
It's the number of times that term has been applied or the relevance of that term
to the document and the both the voelz heterogeneous and homogenous vectors.
They they record of that. They we can think of them as giving data points a position in some kind of a vector space.
So if we've got a two dimensional vector space here, we have vectors A and B,
we can think of them either as a point in space out here at the end of the arrow or as a line from origin to point.
So we've got a is at two, three and B is it for one using the standard X Y notation.
We can compute a variety of things. We can compute C, which is A minus B.
Which is a factor that would go from B to A. We can computer distance between them, which is the length of that vector, the Euclidean length.
And we call it would you that is the length because it's the it's the length of this arrow.
If Pythagorean it's a multidimensional generalization of Pythagoras, of the Pythagorean Theorem.
Because if you want the length of this arrow,
then you need to compute the square root to the sum of the squares of its two sides, which are four and one.
Square, the square, the Miach, some them take the square root and you're going to get square root of 17,
which is going to be the length of this vector, the length of that that arrow.
You can also compute what's called an inner product, which is the sum of the products of.
So two vectors have the same dimensionality. They're both n dimensional vectors.
If you go across the dimensions and you multiply the vectors values for them and you sum up all of that, then you get what's called the DOT product.
So for these particular vectors, it's going to be two times four plus three times one, which is going to be you leavens.
The inner product of these two vectors is 11. You can also think about the angle between them so they have an angle theta.
And you can often, in terms of what we do at that angle is we compute its cosine.
We want to get the angle itself. We can compute the arc cosine of this cosine.
But the coast between two vectors is their inner product divided by the product of their lengths.
And so it would be that eleven divided by Route 17 and the other A has a length of the square root.
Of four plus nine, which is 13. So that's going to be the angle between these two vectors, this angle is a common measure of object similarity,
and we aren't we could do this with any kind of a vector.
It just happens to be especially useful for these kinds of homogenous vectors, like text vector.
You can do it for any you can at any data point. You can computer cosine similarity between them.
If you've got vector representation, it looks at how related are to vectors and it has a useful relationship with the the two
vectors are have a mean of zero then it's equivalent to the Peerson correlation between them.
It's very common though to do this with TFI D.F. vectors and a T.F. idea vector is.
It's going to always have a value of it.
It's always going to be a positive values. They're all in that upper right quadrant. But it lets you compare.
It gives you a measure of some very common similarity measure between documents.
You can also take the distance between the two documents and the vector spaces, the relationship between them.
A couple of useful tricks first.
If the two vectors are unit vectors, then the inner product is the cosine because both of the length is a length of one.
The denominator is one times once. You can just compute an inner product. This is one reason that normalizing TFT effectors to unit vectors is useful.
And oftentimes, if we're gonna do something that's going to compute cosine similarities,
we normalize the unit vectors first, then we get these very efficient, cosine similarity.
Because you just have to compute the inner product between vectors. This is also a useful reason why it's interest.
It's useful to normal sometimes to normalize a set of feature values, for instance, to a unit vector unit vector normalization that we talked about.
Transformations in normalizations unit vector normalization is almost never something we want to do to a column in one of our data frames.
We only want to do it within a row that the feature values of a particular instance, either all of them or some subset of them.
If we have a matrix whose rows are all unit vectors, then we can compete.
We can use a matrix multiplication to compute the pairwise similarities between all rows of M and a very optimized fashion.
We start talking. We're thinking about how you really.
Lean hard into vectorized and computations, being able to use a matrix matrix, multiply two to compute a large number of similarities at one go.
It's a very good example of that kind of that kind of thinking.
And Sipi provides sparse matrix representations to a super, super useful for that view account vectorized or TFI de'ath vectorized.
Or you're going to get a sci fi sparse matrix out that it doesn't spend space storing all the zeroes.
So to conclude, vector representations allow us to do a lot of useful computations.
And one of those is that we can compute the similarity between two documents by taking the cosine between their vectors, their TFI, D.F. of vectors.
And this gives us a powerful mudgett. We can compute similarity or distance or difference between documents by using their bag of words.
Term frequency renormalize term frequency remembering.

🎥 Classifying Text#

This video introduces classifying text, and the use of a naïve Bayes classifier based on term frequencies.

Video (6m9s)

Slides

This video, we're going to talk about how to classify text,
so learning outcomes are to be able to use vector space representations to classify text and also to classify in a more than two classes.
So there's a lot of various reasons you might want to classify text. You might want to distinguish between spam and not spam.
We might want to identify news categories. You're going to be doing that in the end.
The assignment might want to say separate news from opinion or identify fraud or
a lot of other reasons why we might want to classify texts into different kinds.
And we're going to be using for this process as our input term vectors. There's been work lately on using more sophisticated representations.
But we're going to start just with using our term vectors. So really simple way to do it is to use what's called a K nearest neighbor, a classifier.
And what what it does is it finds the K nearest neighbors to a data point that are closest to the vector space.
So if we've got some vector space and we have some data points.
And we maybe have some data points of another class. And we've got one that comes in and we want to classify it.
Let's say it's here, what it's going to do is it's going to go find the closest data points.
Maybe it'll be these three and it's going to take a little vote. And inside this green circle, it's a little hard to see.
But two of them are purple and one of them is red. So it's going to the vote is going to be purple.
This is K is three, three years neighbors. Two of them were purple. One is red.
And two, it's going to classify our new data point as purple. He uses some metric.
Often this metric is the Euclidean distance between the two vectors.
And it also easily extends to more than two classes because you can have three, four or five classes.
You can see what the most common class is in your neighborhood. This works fine with TFI, D.F. vectors.
So you can you can use the TFT effect. Dreiser this. The psychic K neighbors classifier and you can get.
But don't put those two things in a pipeline and you get a working classifier that can work well on a variety of tasks.
That uses the term vectors extract from your text in order to do classification.
Another way is to use what's called a naive Bayes classifier, where here the goal is.
It tries to estimate the probability in this case, as is spam, the probability of spam given a particular documents text.
And the way that you and we can use Bayes Theorem to say if we have the probability of a dock of a particular document given tax,
the probability that if we said, oh, we want to make a spam, what would these words appear in it?
And the probability of making a spam. And we can divide that then by normalizing constant.
We can compute the probability of spam given a document. And there's a couple of things that go into this.
First, we have data. We have a bunch of documents that we can learn what how terms relate to spam.
And second, we make what's called the naive Bayes assumption, which assumes that terms are independent of each other.
This assumption is false. But this assumption is very useful because it makes it very computationally simple way to do the classification.
That also happens to work remarkably well.
And so then I leave base assumption is that if I say I'm going to write a spam and then my spam, I say Apple.
The next word I use is the frequency of any other word occurring as independent of that.
It's the fact that you saw that I said Apple in my spam has no influence on the probability that I am going to say cash in my spam.
And so since they're independent, we can take the product over the terms and the document of the probability of each term given spam.
And so what we do is we have four spam. We have a probability distribution over terms for not spam.
Well, the probability distribution over terms,
we can learn those probability distributions from labeled data just by counting how common each term is and spams and not spammers.
So how are words we ever span's you are not spams are labeled as count how often his term appears in each one,
we use that to derive a probability of using that term.
If I'm going to write a spam, if I'm going to write a non spam email, so we do that,
we learned our prior we either we can either set it to even or we can learn it from the data.
This is called sometimes empirical Bayes psychic learns it from the default.
So if if 70 percent of our emails are spam, then the probability of spam will be point seven.
The probability of not spam will be point three. It does that automatically.
And then given a new document, we compute this probability of spam.
We also compute the probability of not spam.
And did we use our prior we use the term probabilities from the data, we can compute the normalizing function by by computing these two probabilities.
This works well with not you don't want to do TFI D.F. before this,
you can do it and it mostly works, but to really compute the probabilities, trillion.
It's going to work on your term. Your term.
You're just your term frequency vectors. How often the word appears.
And it's going to naturally learn which words are more useful for distinguishing between spam and not spam.
Psychic learn implements this in the multinomial and the classifier class.
So Tarapur, both CNN and Naive Bayes can predict multiple classes from vector data such as term frequency vectors from text.
If you want to do multiple class accuracy, it's the same as binary accuracy.
It's just the net, the fraction, the time it was correct.
The spam filter example notebook demonstrates naive Bayes with the count vector Isar that we've been talking about.

CS 533 Fall 2022

Week 12 — Text (11/7–11)

Contents

Week 12 — Text (11/7–11)#

🧐 Content Overview#

📅 Deadlines#

🚩 Midterm B#

🎥 Unstructured Data#

🎥 Unicode and Encodings#

Resources#

🎥 The Text Processing Pipeline#

Resources#

🎥 Vectors and Similarity#

🎥 Classifying Text#

Resources#

📓 Spam Filter Example#

📩 Assignment 6#