Week 12 — Text (11/7–11)
Contents
Week 12 — Text (11/7–11)#
Attention
The second midterm is this week.
This week we are going to talk about unstructured data, particularly text.
🧐 Content Overview#
Element | Length |
---|---|
2m57s | |
28m45s | |
17m36s | |
6m58s | |
6m9s |
This week has 1h2m of video and 0 words of assigned readings. This week’s videos are available in a Panopto folder.
📅 Deadlines#
Midterm B, November 12
No quiz due to midterm.
🚩 Midterm B#
The second midterm will be released on in the evening on **{date}wk12 wed xlong
, and due at **midnight, Saturday, November 12, 2022. It follows the same
structure and rules as Midterm A, and is over material through Week
11.
🎥 Unstructured Data#
This video introduces the week and describes the key ideas of extracting features from unstructured data.
- Oh, again. Now going to talk about how to start processing unstructured data, particularly text data.
- So this video, we're going to introduce the idea of processing unstructured text, unstructured data modeling,
- going to understand a little bit about some of the different types of features we can try to extract from unstructured data.
- So so far in this class, we've been dealing primarily with tabular data, fully structured data of various forms.
- Or we have for each instance, we have an equivalent set of features. Some feature values might be missing.
- Well, we've got all of these features.
- As I just as we discussed earlier in the class, there's additional types of data, semi's structured data has defined fields got,
- but they can vary significantly in terms of what fields we get for what objects and unstructured data doesn't have defined fields.
- There may be these aren't mutually exclusive.
- You might have some unstructured data like text or images that have structured or some a structure made, a data that come with them.
- You can treat the meta data just like we treat or other structured data.
- But we're gonna be talking this week about how to start working with the unstructured data.
- So to get features from data, there's a few things we can do.
- We can get raw features so we can get words. For example, we got text.
- We might say, well, the words are features and the values associated with those features say are the number of times the word appears.
- We've got image data. We've got pixels. Then we can get some features that are readily convert to a readily think.
- We can think of them in some ten tabular form, like size or length like this.
- The number of characters in a document is a feature. We could also use additional models like there's a sentiment and sentiment inference models that
- will try to determine if a sentence or some text is positive as a positive or negative sentiment.
- In other models that we can apply in order to compute additional features from some of our raw data this week we're going to be processing and strong,
- unstructured text using words as features. The result will be tabular.
- You'll have a column for each word, but we're going to be using words as features no when you use it to build classifiers.
- There's an example notebook in this week's material that walks you through Luke using
- the topics that we're learning in the videos to build a classifier for email spam.
- So to wrap up unstructured data doesn't come with the nice data frame style features.
- You don't get the rows and the columns that are all set up. You can do nice math on the columns.
- We're gonna directly use content is a large feature space and we're using words as features.
- We have as many features as there are different words in the content. We'll also see how to derive.
- We also set up to be able to derive smaller set of features from the data.
🎥 Unicode and Encodings#
In this video, I describe Unicode and text encodings.
Important Python code:
- This video I'm going to talk with you about Unicode and Encodings. How is text actually represented and stored?
- We talked a while ago about encodings and codings for different kinds of data.
- We forecasted that we're going to talk about text we've seen in some of our initial processing.
- Occasionally we need to provide an encoding. What's up with all of that?
- So learning outcomes for the video are few. To understand how text is stored in memory and in files,
- encode and decode text for storage and communication and to distinguish between characters, code points and bytes.
- So we have text, we have some input. We have say here with a first sentence of the book, The Trial by Franz Kafka.
- Someone must have slandered Josef K for one morning without having done anything truly wrong.
- He was arrested and this was a sentence of the English language.
- It's a sequence of words. It's a sequence of characters. And we store it as a sequence of bytes encoding those characters.
- Computer has to store everything in byte. We've talked about how you start flooding poll numbers a little bit here where the text is stored in bytes.
- So for a long time in U.S. based computing, we stored text in what was called ASCII, the American Standard Code for Information Interchange,
- and it assigned a number in the range zero to one hundred and twenty seven seven bits for all the English letters.
- A number of symbols and control codes.
- You'll find the numbers in in range 48 to fifty seven uppercase letters go from sixty five to ninety or lowercase from ninety seven
- to 122 etc. to sign these codes that we could store in bytes and just the lower seven bits of the byte so you could use the top,
- the upper bit for for other control purposes. It was able to store written English text using most of the characters and punctuation.
- That's in common use. So few problems with this one.
- The whole world doesn't speak English,
- and we need to be able to encode a lot of different languages for our languages that use Latin, insert Latin or Cyrillic alphabet.
- This is often able to be done with extended code pages. So Latin one extends ASCII, so the number zero through to 127 are ASCII, 128 through 255.
- So when a bit in a bite, this is the difference where whether the first bit is set or not, the high order bit, is it set or not?
- Are for used for characters with Western European accents and some control characters, Latin fifteen does the same with Eastern European windows.
- Twelve fifty two combines both.
- It's compatible with Latin one, but it uses a range of characters that Latin one reserves for control codes to store the Eastern European accents.
- Califate eight are uses the uses the upper one hundred and twenty seven characters for
- Russian Cyrillic characters and so allows you this to store Russian and English in KEYT r.
- But many languages, such as Chinese, Korean, Japanese,
- they have many more characters that can fit in a bite even if the entire bite was allocated to it.
- And so you use a double bite in code.
- A multi byte character set often uses two bytes per character, such as Big Five for traditional Chinese and Jeab twenty 12,
- which is the Chinese national standard for simplified Chinese, and it also covers Russian.
- Now for a sequence of bytes that's text. In order to properly interpret this text, you have to know the encoding.
- So that's one problem. It's another encoding in order to decode it properly.
- Also, it's very hard to mix encodings and therefore languages because if you've got text that's encoded in GIBI twenty three.
- Twelve. Well, you can't write Turkish on that. So if you've got a document that you're trying to have parts in Chinese and parts in Turkish,
- you have to have a way to switch back and forth between code pages within a single document gets very complex.
- And so Unicode was developed to try to unify these.
- Unicode is an international standard for text representation in which text is represented as a sequence of what's called a code point.
- These code points basically can think about. You can think of a Unicode text string is an array of code points.
- These are code points are encoded into bytes.
- And so here, again, as we saw with other data types, there's this distinction between data and it's encodings.
- We actually kind of have three levels here because you've got characters code
- points and then the bytes that are used to actually store the code points.
- We have universal encoding that can encode all valid Unicode. And as I said, we've got the other encoding that have historically been used.
- Latin, one U.S. ASCII califate are they have been redefined as encodings for subsets of Unicode.
- So if you if you want to process KEYT are in modern program,
- what you do is you decode the K away eight R and Unicode process Unicode and then you can read encode into whatever.
- Now not all encodings can take all characters.
- So if you try to encode Russian text into Windows twelve fifty two, it will fail because Windows twelve fifty two doesn't have the Cyrillic alphabet.
- Unicode is intended to cover all nonfictional human languages, so it includes.
- Any in the goals include any characters use to write any human language spoken today,
- it includes ancient languages that are no longer spoken, such as it has Phoenician characters.
- It includes a lot of things like mass symbols, the international phonetic alphabet emoji and a lot of other symbols.
- You'll sometimes find the people using these for various strange effects, like to get a italics in a tweet or something.
- What it's actually doing is it's going and grabbing naff characters and using those,
- if you want to see sometime, use your computer's assistive technology to read aloud such content.
- And you'll see here what it's doing under the hood.
- And it winds up being very, very difficult for for visually impaired users to engage with because the read aloud is gonna be completely mangled.
- But you've got a lot of different symbols across the space that are being used for human communication.
- Unicode specifically does not include fictional languages. You're not going to find clicking on our Elvish characters in Unicode.
- People have there are portions of Unicode that are reserved for private use force of software, and there are people who have mapped,
- cling on or Elvish characters into those into those code points and then developed fonts that are capable of rendering them,
- but they're not part of the Unicode standard. So to store all of this, python has two string types.
- The SDR is a text string when you write a string. This was your single or your double quote.
- This is what you get. You get a text string and a string is a C ATEC string is a sequence of code points.
- So SRF seven is going to return the eighth, the eighth character or code point does its character and does it as a single,
- does it as a string of a single character.
- But the Owada function o r d will then get you the numeric code point from one of these single character strings.
- Bites as a bite string. And here it's a sequence of eight bit bites.
- If you ask for B.S. subset of a bite string B.S. and you have four B.S. up seven,
- you're going to get the eighth bite in the strings and start from zero. You can convert between them with the end code and decode function.
- So ENCODE takes a text string and encodes it to bytes. Decode takes a bite string and decodes it to stir.
- And both of these you need to provide the Kodak, the name of the Kodak that you want to use UTF eight ASCII,
- Latin one big five, whatever Kodak it is that you want to use.
- Pandas also provide string types historically for pandas one, and it's still supported and still the default,
- you would use an object series to stat stores, python strings, either bite's or Sturr objects.
- Pandas can work with both pandas also now has a string type.
- You can convert an object type storying strings to a panda string type.
- You can also convert a you can also use the string D type when you're loading CSC as V file and then it will.
- Lowden's a panda string, Cullom, both of them. The pandas has these.
- These families have access of type specific methods and accessors that stirred up whatever four
- strings that works on both the objects Cyle Strings series and the the new string data type.
- So, as I said, to encode and decode, you have a python, you can do bytes that decode,
- insert and encode panda series also provides, encode and decode operations off of this Sturr accessor now.
- So you've got these strings. They're sequences of code points. How do those relate to characters?
- So the code points in the simple form. The code points record the numeric values used for different characters.
- So if I write Joseph K, the name of our character from the.
- From the sentence at the beginning, it's going to be recorded as this sequence of code points for A as J.
- O is six F SS seven three E six five.
- These are in hexadecimal. These are hexadecimal numbers. So that abiders. Exactly.
- To two digits. And any character that's in Latin one.
- So in terms of letters, it's English letters, it's Western European accent.
- Their code point is the same as their Latin one encoding number. And so the ASCII numbers are there.
- Unicode code points for the Western European accents. Their Latin one by value is is the Unicode code point.
- So O with an mail out is F six use an amount is F c.
- And so we get these accented characters being stored this way.
- But code points and characters don't necessarily have a one to one relationship.
- So if we have Motley Crue and it's we can encode it, as with the O is the, um, lot of SSX that is the character O with Rumaila.
- But you can also encode. You can also store that same character as two code points, six F, which is O and three oh eight which is combining diary sis.
- And so what that does when it sees that sequences.
- Oh. And then it says OK, complete combining diaries is and put the diaries is on top of it.
- These combined there are a lot of things combining characters that can add accents.
- They can stack on top of each others. You can stick like diaries. Is Anatoliy around a hat or whatever?
- If you see sometimes people writing text has like stacked accents coming off the tops and bottoms that supposed to look like Lovecraftian horror.
- They're doing that with these combining the characters just stack more and more, combining characters on top of each other.
- They can also, in some cases, modify the shape of a character. They can assemble complex emoji.
- There's some other use cases as well.
- But code points have can have these relationships with code point can modify code points that come before in rare occasions after.
- So you don't when you ask a string for its length, that's gonna give you the number of code points, which is not the same as the number of characters.
- So if you did if if we had Motley Crue stored this way and you did a LAN on it,
- you're gonna get 13 instead of 11, even though it's only going to print eleven characters.
- You're going to get you have 13 code points. This does let you assign lots to characters that don't commonly have them, for example, diaries.
- The diaries of the unallowed is usually only applied to. Disavowals.
- And so if you want if you want Spinal Tap. You need to use an which is going to be six B and you need three O eight.
- The combined guy races, there are a few human languages that do use this character, but they have not added a pre combined character,
- Unicode, and they don't need one because you can make one out of the raw end and the combining character correction.
- This is not six. This is six e. So one a complex use for these are complex emojis.
- So emoji can have a number of modifiers and layers on top of them.
- And so if you have the woman with dark skin tone, running emoji.
- The skin tone is actually amplify entered as a Unicode combining modifier.
- And so this this emoji here is actually made up of six or a five code points in sequence runner.
- The modifier for dark skin tone, a joint character, the modifier for female or the character for female.
- And then the another character, which means display as emoji rather than text.
- And it puts all of those together to get a woman running with dark skin tone.
- The politics of this in terms of what's the default and what's the default for different emoji are interesting and worth contemplating.
- But you can use but this idea of being able to combine code points allows you to build up arbitrarily modified characters.
- And so they define many of the more complex emojis are defined by these sets of combinator of combinations,
- rather than defining a separate code point for every possible configuration of a of a particular emoji.
- So the key point here is that Kote points are not equal to characters, a character can require multiple code points.
- And this comes in. This becomes relevant when you'd say compute the string length.
- If you want to say line strings up there, OBC edge cases where it's wrong.
- The only way truly to tell how wide a character,
- how wide a string is going to be and display is to ask a rendering engine to render it in a fort and measure how wide it is.
- Also, though, a character can be encoded multiple ways, as we saw, the oath diaries can be either F six or it can be six F followed by three or eight.
- And what this means is that simple string equality will sometimes.
- You can have equivalent strings that are not equal. We're gonna see in a little bit normalization techniques that let you let let me deal with that.
- But there's no semantic difference between the two ways of encoding the O with a diary source.
- But if when you compare the strings bite, when you compare the strings code point by point, they're not going to be equal.
- Also, you have some things such as Roman numeral. There's a code point for Roman numeral one.
- And there's it's indistinguishable from the Latin capital letter.
- I and it's in their speak for compatibility reasons because some encoding.
- So we had we had all of these different encodings that are encoding text in different ways before we tried to unify everything in the Unicode.
- Jimmy, Twenty three. 12 had encoding, had characters for four latt for Roman numerals,
- and so to translate that into Unicode, they had to have be able to map those into code points.
- It's the code point is it's called a compatibility code point,
- and it's not distinguishable from the Latin letteri, except that it's a different code point.
- So there's a Unico database that tracks many properties of code point to get identifies,
- which ones are these compatability code points and what they're what they're equivalent to?
- It tracks the type you can marks for each code point is a letter, a number of punctuation, a combiner upper lower case.
- There's a lot of different properties in the Unicode database. Also, the code points themselves are divided into blocks like the Latin one supplement.
- The Python Unicode data module is a starting point for being for getting access to this
- database and and making inquiries about your about your about different characters.
- But another thing we can do with these and the Unicode data module gives you a function for this is you can normalize them.
- So Unicode defined some normal forms and SFD normal form decomposed.
- It always uses the combining characters. So if you if you have a A seven.
- So if you have an F six. It's going to translate that to a sex F and a three o eight.
- And the normal form composed using single code points whenever possible.
- So it is going to go the other way around.
- It's going to if it sees a six f three away, it's going to turn it into an F six.
- Then there's also JFK variance, so JFK, DNF, Casey, they do the same thing,
- but they also normalize compatibility character so that Roman numeral one is going to turn into a Latin capital left eye.
- So once you've normalized strings, then it's a lot more likely that two semantically equivalent strings are going to also be
- equal because you don't have the issue of like they're different just because the oh,
- the new plot was was encoded differently. That's normalized.
- So you can more directly compare them.
- You can look up, you can say if you're using Unicode for file names, you can look up the file correctly and know they're going to get it right,
- even if it was typed in or even if it was initially encoded in a different a different encoding.
- So Unico data provides you a method to normalize.
- So I can say a motley crew to normalize and F.D. and it will normalize it to the decomposed normal form.
- PANDAS has normalization with that stirred up normalized normalization, super useful.
- As I said, to in order to make sure that Mottley is only going to be encoded one way,
- even if you've got files that are coming from different systems that might have defaulted to whether they use it composed or decomposed characters.
- There's some other subtleties to Unicode as well, though.
- So the lower method converts a string to lower case, but there can still be distinctions that don't make a semantic difference.
- There's another function case foaled that converts to a lower case and goes farther to eliminate case distinctions.
- For example, the German letter Stronsay will become two S's.
- It's this, it's equivalent, but lower is not going to convert it because it's already it's already lowercase letter.
- But it's going to convert it down to these two S's so that you get that there's absolutely no distinctions there anymore.
- But there's other rules that are locale specific.
- For example, if you want to sort strings and you want to do it correctly, the rules for doing that depend on the language, not just the characters.
- And so your computer has something called a little count each.
- Prosit, you can change locales for a process. Each process has a locale that it's running in.
- The defines a number of things that could define your date formats and things like that.
- But it also defines sort orders.
- And the Python locale module gives you the sterkel function, which allows you to compare two strings based on their locale specific store order.
- And it also gives you an X Asterix form function, which given a string,
- it'll convert it into another string such that just doing a normal lexical graphic code
- point by code point comparison on the resulting strings will sort them in the correct order.
- So if you're going to Sawtell, if you're going to sort a lot.
- It can be useful to say make another column of your data frame that has the store X foreign versions of your strings and then use that as your source.
- Keep one thing to be aware of, though, that when you're doing that,
- your program is going to have different results based on look how in which it's running.
- And so for reproducibility, you might need to specify a particular locale to run your program in.
- So as I said earlier, there are some universal encodings that can encode all Unicode characters,
- UTF eight is probably the most common, particularly for storing files on disk.
- If you have a choice, store your files in UTF eight.
- Could points tape between one and four bites, ASCII is valid UTF eight, because the code points in the range zero through one twenty seven.
- Ah, in the. The code point number is the same as the ASCII number, and also they get stored in one bite and you and UTF eight.
- So ASCII just is immediately UTF eight. Anything outside the ASCII range gets encoded in two to four bytes.
- UTF 16 uses two bytes per basic character. And so any character ins or any character inside what's called the basic multilingual plane,
- which covers most currently spoken human languages, doesn't include emoji or mathematical symbols.
- It will use two bytes. Characters outside of it will use four bytes.
- A couple notes in the UTF 16 encoding and requires you to know the bite order because some computing systems are what's called big endian,
- where the the higher order byte comes before the lower order byte in memory and others are little endian where it's the other way around.
- Intel processors work in little endian.
- And so there's a charac there's a two character or two byte sequence that goes at the beginning of the file that marks its brightwater.
- UTF eight. So UTF eight is a great default in general, and it's particularly a great default for English,
- for European languages, because a lot of them will take one to two bytes.
- But if you're storing text that's written in Chinese or Japanese.
- Most of those characters are going to require three bytes and UTF eight, but only two bytes and UTF 16.
- And so UTF 16 can be a more compact format for storing files, storing text in those languages.
- Then there's UCSF four, which stores four bytes per character. It also requires a buy order and UCSC for it.
- It's just it's the sequence of four byte. It's an array of thirty twos or you and thirty twos.
- And each one just stores the code point directly. There is no special encodings.
- Internally, what Python does is it stores strings in the most compact form it can from the following list.
- If if the string is Latin valid, Latin one, it stores it with one in Latin, one with one byte per character.
- Is that so if if it's code points are all in the range zero to 255. It will store it as Latin one with one bite per character.
- If they aren't. But all of the characters in the string are in the basic multilingual plane.
- Then it all stored in UCSC two, which is two bytes per character, but it can only store text in the BNP.
- And if it's not, if it can't fit in UCSC too, then it stores it in use.
- Yes. Four. So it's an array of in thirty twos. And no matter.
- This is transparent to you as he is a python. When you're writing in Python, you do not see the difference in how it's stored.
- Unless you go get the size of the object. Land is always going to return the number of code points you can get.
- It's just storing it more in the most compact form it can.
- If you're writing a C ext module, then you'll see the difference between the different storage formats of a strength.
- Other systems, many of them use UTF 16.
- So Java, JavaScript, Windows and Mac API as they natively use usually UTF 16 for legacy reasons are often treated as UCSC too.
- So you see, S2 was used in the 90s when there were fewer than 64000 code points in Unicode and you could store them in just two bytes.
- UTF 16. First for four characters in the book Basic Multilingual Plane.
- There's no difference between UTF 16 and UCSD two, Usaia,
- UTF 16 extends UCSC to with the ability to say this character uses an extra pair of bytes to encode the additional characters.
- So if you ask Java or Java script to give you the length of a string,
- it's actually going to give you the number of two byte pairs UTF sixteen code units,
- which will be the length of the number of code points if everything's in the basic multilingual plane.
- But as soon as you've emoji or mathematical symbols in your string, it's gonna be wrong.
- Also, the char at function that gives you an individual code code point is going to give you the UTF 16
- code unit if you need the actual code point for a character that's in the upper parts of Unicode.
- You're gonna need another function. So all this said, when you get some text, there's first two first steps of processing your text.
- The first is to decode the text and convert the bytes into text that you can then process in Unicode.
- Then we may want to normalize the text to put it into a consistent form.
- Unicode normal form is often useful. We might want to strip accents.
- This is going to lose some meaning, but it can make. It can reduce distinctions between words in ways that can occasionally be useful for processing.
- We might want the case for it so that our text is all lower case and we don't have to deal with
- deal with mixed case text because we don't if we don't want to make distinctions between case.
- Some points that you need, you need to remember when processing tax, and I've also linked to a reading that provide some additional material.
- You always need to know what you're encoding is when you're getting text from the
- outside world or you're writing text back out on a modern Mac or Linux system.
- It's going to default to UTF eight on a Windows system.
- It's still usually defaults to code page twelve fifty to. You need to decode bytes in order to get a text string, which is a sequence of code points.
- But even once you have your code points, they don't necessarily map one to one with characters.
- And the length of a string does not mean the printed number of characters because you can have combine hours and things that are going to.
- You can have combiner that are going to make multiple bytes of us. One character that zero with Joyner doesn't show up,
- etc. You can use normalization to improve the comparability of strings so that you say to normalize that.
- So you always use combiner, you never use combine hours.
- And then case folding is a tool that allows you to make strings comparable case insensitively by removing distinctions between upper and lower case.
- You have to pay attention to Lukow as well. If you want to sort things correctly for your current human language,
- you need to do so in a way that's based on the specific locale because different ones have different sorting roles.
- If you have single byte encoding of text and it's not UTF eight, not all byte sequences are valid,
- UTF eight, then Latin one is a common one and Latin one will always succeed.
- If it's wrong, you're going to have some of your accents and other non English characters.
- They will be incorrect. But if you're just trying to get it read in.
- You can start looking at them Latin one is common. Unfortunately, there's not a you can look at it.
- Try different encodings and get the one that gets you meaningful text and language.
- But there's not a there's not a foolproof way to detect the encoding used for text.
- So to wrap up string is not a simple data type. Text is encoded both for on disk storage and in memory processing.
- You've got these multiple layers. We have characters. Characters are represented with code points.
- Code points are encoded in the bytes. There's not the one to one relationships that we might think going up and down the stack,
- but it enables us now to, in a very flexible fashion, represent a wide array of written human language.
Resources#
🎥 The Text Processing Pipeline#
This video discusses the basic steps of text processing, beginning with tokenization. The result is a document/term matrix, possibly normalized.
- This video, we're going to talk about more of the text processing pipeline,
- how do we actually get from tax that we've decoded into features and vectors that we can use in order to do computation?
- So learning outcomes for this video feed understand the different stages of text processing and to be able to extract a
- vector representation from text and also know some of the things that are possible beyond what we will discuss in class.
- So we've got our input, a sentence that we saw in the previous video. Someone must have slandered Joseph K and we output.
- We want to we maybe want to get some features from it as vectors, maybe want to get scores, decisions for some process.
- So first, before we get into how to do that, I want to define a few terms.
- A corpus is a set of textual data that to analyze and a document is a single document within the corpus of a corpus and it's made off of documents.
- So the example that I give you is a spam filter example.
- Our corpus is a bunch of emails. Each email is a document and then a term is a single word within a document.
- We can think of the set of all terms is the vocabulary of of the corpus.
- So we first want to decode a normalized text, as we talked about, in the encoding and decoding video.
- Then we want to tokenized it or split it into individual words. We want to do some filtering on those tokens.
- We want might want to further process them and into some kind of a normal form.
- Then convert these the sequence of tokens, the output of that stops a sequence of tokens.
- We want to convert it into a vector that we then may also further normalize and we can use for various types of processing.
- So that's the pipeline. We're gonna be talking through previous video talking about decoding a normalization.
- Sometimes we might delay some of the normalization until after tokenization. Oftentimes we'll do it first.
- So that we've got we've got the code, the text decoded, the code points.
- We've put them in a normal form. We've maybe lowercase everything or case full of everything.
- We might even have stripped accents at that process as well.
- And then the input, the output of that is the sequence of normalized text that's ready to be tokenized and so on.
- Tokenization where we want to do is we want to split text into individual tokens.
- And these tokens often are words. So we want to grab the words.
- Someone must have slandered Josef K.
- And so Sacket learns default.
- Tolkan Isar uses a sequence of alphanumeric characters that actually use it requires there to be at least two so it will not extract K as a token.
- But but we split this text into individual tokens and still at a sequence of tokens.
- But this becomes the basis basically for any further analysis.
- The next thing we do is we might want to remove stop words and so common utility words like and or or are often not useful depending on the task.
- If you're trying to do an information retrieval task or you're searching for text finding documents that include the word,
- and it's probably not useful because almost every documents going to include the word and or or but or if you want to do document classification,
- the vast majority of your documents are going to have these words. So they're not very useful for.
- Distinguishing between one class of documents and another.
- So we can remove them either with a list of stop words, Saikat learn includes a list of stop words or by using a frequency based analysis.
- So we remove the words that appear the most often.
- Whether or not you do this depends on task and it also depends on the other analysis you're going to do.
- If you're going to be doing an analysis that drink that recovers sentence structure, then you need these words.
- If your. If you're but if you're doing information retrieval, you may well not need them if you're doing classification.
- They're often not useful for a classification task and so they're worth removing.
- Then after we've leave tokenized, we might have removed our stop words. Then we'll do something called we might.
- I'll get it again. Whether or not you do this depends on task. But you can do something called stemming or limitation.
- And the idea here is that words have multiple forms in English. Change changes, change changing.
- These are different forms of the same word. They're different tenses of the same verb.
- And so but if for some purposes, especially search purposes, a lot of this pipeline has its origins in search.
- But then it's used for a lot of other purposes as well. There's no difference if you want to search for change in the corpus.
- You probably want to find documents. It's a changed or changing.
- And so there's two ways of doing this stemming is a relatively simple technique that transforms words into common stems.
- The result is not necessarily a word like a stemming algorithm may well change all of these forms of the word change in CNG.
- But the what the stemming does give you is that different forms of the same word are going to go to the same steps.
- So if you stem all of the word that all of the different forms are going to result in identical tokens in ink for English,
- snowball is a common stemming algorithm. Saikat learned does not provide you with any stemming or limitation.
- To do that, you're going to need to go to another package, limits zation, accomplishes a similar purpose,
- but it transforms words actually in what's called Alema form, which is an actual word, and it's the root form of the word.
- And so these will all become change most likely in your lamma Tizer limited zation is more computationally intensive.
- But the upside of limitation is it gives you actual words rather than these stems.
- So once we've got our tokens, we've we've dropped the ones we don't want to use.
- We've stemmed or DeMatteis, if we if that's appropriate for our task, we then want to vectorized that.
- We want to create what's called a tom vector or a term frequency vector.
- And this vector has one dimension for every term in the entire vocabulary, and its values are the number of times the term appears in the document.
- So. And our example tax, someone must have slandered Joseph Kay.
- We're gonna have a one for someone, one for mosta, one for have a one for slandered, one for Joseph.
- It's also so sparse because each doctor is probably only going to use a fraction of the terms,
- and so we'll use a sparse representation that doesn't actually store all of the zeroes.
- But mathematically, this vector has an entry for every term. And if a term never appears in the document, it gets a zero.
- So in psych, it provides a class called the Count Vector Ricer, which gets us to this stage.
- It tokenized this text. It can do transformation. It can do lowercase.
- And it can decode your Unicode if necessary. Erica, decode into Unicode if necessary.
- The fit method learns the vocabulary and transforms transform turns.
- An array of tax sets a one dimensional ray. Each element is a text string into a sparse matrix that looks like this.
- This is called a Tom Doc or a document term matrix we have on the Y axis.
- We have documents and on the columns, the different columns are terms.
- And it's not recording the zeroes, the empties here are zeros and the term one document one is term one five times term three,
- two times in term for its term for three times.
- And so you get this transform. You get this. The sparse matrix.
- And now for each documents a row and each column is a word. Basically now our words are features and their values are how frequently they appear.
- So when our pipeline. Here's what Count Vectorized can do for you decode lower case.
- It can strip accents if you tell it to strip accents, it's also going to normalize your Unicode.
- It can then tokenized. It then tokenized is your tax.
- It's default tokenization strategy is it treats any sequence of two or more alphanumeric characters as a word.
- And then once it's tokenized it vectorized is and you get back these values that are the term frequency's,
- how often did the word appear in each document? This is what count vector rhetoric gives you super useful.
- You're going to use it as the basis for some of your processing in the assignment.
- Now, there are a few problems for some purposes for using raw term frequency.
- The first is that all words are equally important. If the word have appears tens times and the word hippopotamus appears 10 times.
- It makes no distinction, even though if you're trying for a lot of purposes,
- the fact that a document says hippopotamus, which is a far less common word,
- is probably more meaningful for distinguishing it from other documents than the fact that it has half almost every document has have in it.
- And so this is a solution. There's two pieces of the solution. One is we wait words differently.
- We're going to give uncommon words, words that don't appear in very many documents.
- Higher weight. And this is this winds up being broadly useful because if you're trying to retrieve a document,
- the words that are more unique to that document are probably more useful for trying to retrieve it.
- If we're trying to classify a document, the words that are more unique are probably also going to be more useful for dividing
- distinguishing between one class of documents for another than another example,
- if you're trying to. If you have a collection of of scientific reports and you're trying to distinguish between zoology and botany,
- the word have is probably not useful because both classes of documents probably have it,
- whereas hippopotamus is probably much more common in zoology documents than in a botany document.
- So the other thing that we're going to do is we're going to normalize these
- vectors so that rather than just using an absolute counts in a longer document,
- it just has higher values everywhere. We're going to normalize it so that every documents vector has the same magnitude.
- And this is going to mean long documents don't have substantially higher values than short documents just on the virtue of their length.
- Which makes it easier to compare documents. So the first part of it waiting more waiting,
- giving more weight to less common words is done through something called term frequency, inverse document frequency.
- And the idea here is that we compute a feature value by multiplying the term frequency t.f.
- By the inverse of the document frequency D.F.
- And we take it the way we compute, the way Saikat learn computes the inverse document frequency is if we have N and that's the number of documents.
- We take the log of the number of documents divided by the number of documents
- with the terms of D.F. is the number of documents that have the term tea. And is the total number.
- So. And it's the inverse. So rather than saying D.F. over D.F. of tea over N,
- which would be the fraction of documents that have the term, we do take the inverse of that fraction.
- And over D.F. of tea, we take the log Saikat learn add some ones to both the numerator and the denominator in order to prevent.
- If you've got a term that doesn't appear in any of the documents which the Count Vectorized isn't going to give you one of
- those terms because the vocabulary just doesn't have words not going to be in the vocabulary if it's not in the document.
- But in some other cases, we have a vocabulary that comes from somewhere else. You might adding a one.
- Just make sure you never have division by zero and you're never trying to take a log of zero.
- Other common ones are you're gonna take the log of an over one plus D.F. of T so
- that you just have the one plus in the denominator and not in the numerator.
- That little change is not going to make a significant difference in terms of performance.
- It's a subtle difference in how it's implemented, but it's not going to make a significant computational difference.
- But what this does is a document that appears, if you think about what you have to do to make the value go up.
- The document appears more frequently. You can increase the value by having the document appear more frequently in the terms of increased T.F.
- That'll go up. You can also increase the the value, the TFI D.F. value, which we're calling X.
- You can increase that by having the document appear determined here in you were documents.
- So the more documents it appears in, the closer this is the one where it has,
- if it's if it doesn't appear in very many documents, it's going to be a large number.
- Say a million over 50 or a million. Over five hundred.
- The log of that's going to be high and you can have a much higher TFI D.F. value if the term does not appear in very many documents.
- And that reflects that if the term is more unique to this document than it's more useful for distinguishing this document from other documents.
- That's the key insight here. Less common words are more useful for describing what makes this document different from other documents.
- And if two documents have the same eyes, have the same uncommon words.
- That's really useful for saying they're probably similar if they have the same words.
- But those words are super common, that the documents are probably not very similar.
- So we then normalize our TFI D.F. vector to be what's called a unit vector.
- And what this means is we just divide the vector by its L2 norm.
- If you use the T.F. IDF Vector Isar instead of the Count Vectorized and Saikat, it's going to do everything to count vectorized.
- It does. And then it's going to apply T.F. IDF and it's going to normalize them, the unit vectors by the default.
- You can change some of that, like have it not in do the L2 normalization, but use the TFI the effector.
- It's going to give you unit vectors that apply. They use TFI,
- D.F. adjustment to down wait common terms and give you these factors that are useful too for
- for characterizing documents in terms of how frequently they use relatively uncommon words.
- So now we have our whole text processing pipeline for getting to a term vector.
- We decode normalize. We tokenized. We remove, stop words. We may do some other token filtering as well, like throw out numbers or something.
- You can stammer lemon ties. So both of stopwork removal and the stemming and limitation are optional.
- You don't have to do them. You can. This is the template for the pipeline.
- Different applications and different tasks is going to use different steps of this of this pipeline.
- Then we vectorized the text and we normalized our resulting vectors that we get these nice vector representations of a document.
- The result is each document gets this vector that assigns scores to terms for their relevance to that document.
- And this this is a very high dimensional vector.
- Ten thousand hundred thousand dimensions. So if you stick into a linear model, that's a large number of coefficients.
- But it gives you this vector representation of the resulting text that we can then use for additional tasks processing,
- classifying, clustering, etc. our text. So a few alternatives to some of the design decision for some of the pieces of this pipeline.
- First, we can we don't have to tokenized the words. We can tokenized into what's called an end.
- Gramp's such as a Bigram or a Trigram. And what they are. There are sequences that were so a bigram as pairs of words that appear next to each other.
- You can also do character level bigram. So it's pairs of characters.
- You can also do what's called a skip gram, which pairs of characters that appear in proximity.
- But there might be another word between them.
- You can also do sentence level analysis rather than just splitting into words and putting the words into a vector,
- which is called bag of work, by the well way.
- It's called a bag of words model because you just take the document, we throw it all the words into a bag.
- We don't have the relationships anymore. We can split it not only in the words, but in the sentences.
- We can tag words by parts of speech. We can look at analyze relationships between words.
- Various natural language processing toolkits give you tools for doing that.
- Stanford MLP is one that gives you the ability to do those word relationships.
- Currently, there's a lot of work in an AP Natura. So A.P. natural language processing is the field for studying this.
- If you want to learn a lot more about this processing pipeline and things you can do with text the classes you want to take out an LP and I are.
- Information retrieval. Transformer architectures are deep neural networks that can deal with sequences of words.
- They have significant power for modeling how language works and modeling.
- What's going on in text. But that comes with significant processing cost.
- The Python package and K gives you a lot of tools. It doesn't give you the full sentence analyzer, but gives you a lot more sophisticated tools,
- such as parts of speech, tagging, stemming in limited zation, etc. Then you haven't like it.
- You can also integrate with Saikat,
- learn who can replace the psychic lern count vectorized or TFI the effect Dreiser with a vectorized that uses NLC K to do the analysis.
- If you need more sophisticated text, processing is a part of your application.
- So to conclude, term vectors treat each word as a feature for which documents have scores that often normalize.
- The uncommon words have higher weight and that longer documents don't just have larger vector values.
- Where then when you use these vectors downstream in order to do a variety of computations.
Resources#
Stanza (formerly StanfordNLP)
🎥 Vectors and Similarity#
This video describes the concept of a vector representation, and how to compute the similarity between two documents.
- So now that we have some term vectors, what can we do with them learning outcomes for this video or for you to understand the
- value of vector representations and instances and the compute similarities between vectors?
- So we've talked about vectors from time to time throughout this course.
- But. And you've probably seen you may have seen them in a linear algebra class or statistics class, but a vectors just an array of numbers.
- Got a vector X that's got values X one, X, two, three, X.
- Then we say that it has N dimensions. This can be the row or column of a matrix.
- And we often then the length of the vector, the length of the vector is usually not as dimensionality, the length is usually the L2 norm.
- The sum of the squares, the value of the vector we can think about.
- So in instances, features form an array of form of vector, an array of values like the role of a data frame.
- It's a vector. We can treat it as one. We could create the column as a vector.
- But when we with the features you've been seeing so far on this, we standardize,
- then the features often have different scales, different meanings, not necessarily meaningful to relate them to each other.
- One of the things that's a little different about our about the vectors that we get out of tax is that they're more homogenous, like each.
- Each dimension corresponds to a term. But they all mean the same kind of thing.
- It's the number of times that term has been applied or the relevance of that term
- to the document and the both the voelz heterogeneous and homogenous vectors.
- They they record of that. They we can think of them as giving data points a position in some kind of a vector space.
- So if we've got a two dimensional vector space here, we have vectors A and B,
- we can think of them either as a point in space out here at the end of the arrow or as a line from origin to point.
- So we've got a is at two, three and B is it for one using the standard X Y notation.
- We can compute a variety of things. We can compute C, which is A minus B.
- Which is a factor that would go from B to A. We can computer distance between them, which is the length of that vector, the Euclidean length.
- And we call it would you that is the length because it's the it's the length of this arrow.
- If Pythagorean it's a multidimensional generalization of Pythagoras, of the Pythagorean Theorem.
- Because if you want the length of this arrow,
- then you need to compute the square root to the sum of the squares of its two sides, which are four and one.
- Square, the square, the Miach, some them take the square root and you're going to get square root of 17,
- which is going to be the length of this vector, the length of that that arrow.
- You can also compute what's called an inner product, which is the sum of the products of.
- So two vectors have the same dimensionality. They're both n dimensional vectors.
- If you go across the dimensions and you multiply the vectors values for them and you sum up all of that, then you get what's called the DOT product.
- So for these particular vectors, it's going to be two times four plus three times one, which is going to be you leavens.
- The inner product of these two vectors is 11. You can also think about the angle between them so they have an angle theta.
- And you can often, in terms of what we do at that angle is we compute its cosine.
- We want to get the angle itself. We can compute the arc cosine of this cosine.
- But the coast between two vectors is their inner product divided by the product of their lengths.
- And so it would be that eleven divided by Route 17 and the other A has a length of the square root.
- Of four plus nine, which is 13. So that's going to be the angle between these two vectors, this angle is a common measure of object similarity,
- and we aren't we could do this with any kind of a vector.
- It just happens to be especially useful for these kinds of homogenous vectors, like text vector.
- You can do it for any you can at any data point. You can computer cosine similarity between them.
- If you've got vector representation, it looks at how related are to vectors and it has a useful relationship with the the two
- vectors are have a mean of zero then it's equivalent to the Peerson correlation between them.
- It's very common though to do this with TFI D.F. vectors and a T.F. idea vector is.
- It's going to always have a value of it.
- It's always going to be a positive values. They're all in that upper right quadrant. But it lets you compare.
- It gives you a measure of some very common similarity measure between documents.
- You can also take the distance between the two documents and the vector spaces, the relationship between them.
- A couple of useful tricks first.
- If the two vectors are unit vectors, then the inner product is the cosine because both of the length is a length of one.
- The denominator is one times once. You can just compute an inner product. This is one reason that normalizing TFT effectors to unit vectors is useful.
- And oftentimes, if we're gonna do something that's going to compute cosine similarities,
- we normalize the unit vectors first, then we get these very efficient, cosine similarity.
- Because you just have to compute the inner product between vectors. This is also a useful reason why it's interest.
- It's useful to normal sometimes to normalize a set of feature values, for instance, to a unit vector unit vector normalization that we talked about.
- Transformations in normalizations unit vector normalization is almost never something we want to do to a column in one of our data frames.
- We only want to do it within a row that the feature values of a particular instance, either all of them or some subset of them.
- If we have a matrix whose rows are all unit vectors, then we can compete.
- We can use a matrix multiplication to compute the pairwise similarities between all rows of M and a very optimized fashion.
- We start talking. We're thinking about how you really.
- Lean hard into vectorized and computations, being able to use a matrix matrix, multiply two to compute a large number of similarities at one go.
- It's a very good example of that kind of that kind of thinking.
- And Sipi provides sparse matrix representations to a super, super useful for that view account vectorized or TFI de'ath vectorized.
- Or you're going to get a sci fi sparse matrix out that it doesn't spend space storing all the zeroes.
- So to conclude, vector representations allow us to do a lot of useful computations.
- And one of those is that we can compute the similarity between two documents by taking the cosine between their vectors, their TFI, D.F. of vectors.
- And this gives us a powerful mudgett. We can compute similarity or distance or difference between documents by using their bag of words.
- Term frequency renormalize term frequency remembering.
🎥 Classifying Text#
This video introduces classifying text, and the use of a naïve Bayes classifier based on term frequencies.
- This video, we're going to talk about how to classify text,
- so learning outcomes are to be able to use vector space representations to classify text and also to classify in a more than two classes.
- So there's a lot of various reasons you might want to classify text. You might want to distinguish between spam and not spam.
- We might want to identify news categories. You're going to be doing that in the end.
- The assignment might want to say separate news from opinion or identify fraud or
- a lot of other reasons why we might want to classify texts into different kinds.
- And we're going to be using for this process as our input term vectors. There's been work lately on using more sophisticated representations.
- But we're going to start just with using our term vectors. So really simple way to do it is to use what's called a K nearest neighbor, a classifier.
- And what what it does is it finds the K nearest neighbors to a data point that are closest to the vector space.
- So if we've got some vector space and we have some data points.
- And we maybe have some data points of another class. And we've got one that comes in and we want to classify it.
- Let's say it's here, what it's going to do is it's going to go find the closest data points.
- Maybe it'll be these three and it's going to take a little vote. And inside this green circle, it's a little hard to see.
- But two of them are purple and one of them is red. So it's going to the vote is going to be purple.
- This is K is three, three years neighbors. Two of them were purple. One is red.
- And two, it's going to classify our new data point as purple. He uses some metric.
- Often this metric is the Euclidean distance between the two vectors.
- And it also easily extends to more than two classes because you can have three, four or five classes.
- You can see what the most common class is in your neighborhood. This works fine with TFI, D.F. vectors.
- So you can you can use the TFT effect. Dreiser this. The psychic K neighbors classifier and you can get.
- But don't put those two things in a pipeline and you get a working classifier that can work well on a variety of tasks.
- That uses the term vectors extract from your text in order to do classification.
- Another way is to use what's called a naive Bayes classifier, where here the goal is.
- It tries to estimate the probability in this case, as is spam, the probability of spam given a particular documents text.
- And the way that you and we can use Bayes Theorem to say if we have the probability of a dock of a particular document given tax,
- the probability that if we said, oh, we want to make a spam, what would these words appear in it?
- And the probability of making a spam. And we can divide that then by normalizing constant.
- We can compute the probability of spam given a document. And there's a couple of things that go into this.
- First, we have data. We have a bunch of documents that we can learn what how terms relate to spam.
- And second, we make what's called the naive Bayes assumption, which assumes that terms are independent of each other.
- This assumption is false. But this assumption is very useful because it makes it very computationally simple way to do the classification.
- That also happens to work remarkably well.
- And so then I leave base assumption is that if I say I'm going to write a spam and then my spam, I say Apple.
- The next word I use is the frequency of any other word occurring as independent of that.
- It's the fact that you saw that I said Apple in my spam has no influence on the probability that I am going to say cash in my spam.
- And so since they're independent, we can take the product over the terms and the document of the probability of each term given spam.
- And so what we do is we have four spam. We have a probability distribution over terms for not spam.
- Well, the probability distribution over terms,
- we can learn those probability distributions from labeled data just by counting how common each term is and spams and not spammers.
- So how are words we ever span's you are not spams are labeled as count how often his term appears in each one,
- we use that to derive a probability of using that term.
- If I'm going to write a spam, if I'm going to write a non spam email, so we do that,
- we learned our prior we either we can either set it to even or we can learn it from the data.
- This is called sometimes empirical Bayes psychic learns it from the default.
- So if if 70 percent of our emails are spam, then the probability of spam will be point seven.
- The probability of not spam will be point three. It does that automatically.
- And then given a new document, we compute this probability of spam.
- We also compute the probability of not spam.
- And did we use our prior we use the term probabilities from the data, we can compute the normalizing function by by computing these two probabilities.
- This works well with not you don't want to do TFI D.F. before this,
- you can do it and it mostly works, but to really compute the probabilities, trillion.
- It's going to work on your term. Your term.
- You're just your term frequency vectors. How often the word appears.
- And it's going to naturally learn which words are more useful for distinguishing between spam and not spam.
- Psychic learn implements this in the multinomial and the classifier class.
- So Tarapur, both CNN and Naive Bayes can predict multiple classes from vector data such as term frequency vectors from text.
- If you want to do multiple class accuracy, it's the same as binary accuracy.
- It's just the net, the fraction, the time it was correct.
- The spam filter example notebook demonstrates naive Bayes with the count vector Isar that we've been talking about.
Resources#
📓 Spam Filter Example#
The Spam Filter Example demonstrates tokenization and classification with text.
📩 Assignment 6#
Assignment 6 is available and is due November 20.