Week 13 — Unsupervised (11/14–18)#

In this week, we are going to talk more about unsupervised learning — learning without labels. We are not going to have time to investigate these techniques very deeply, but I want you to know about them, and you are experimenting with them in Assignment 6.

This week’s content is lighter, since we just had a large assignment and a midterm, and another assignment is due on Sunday.

🧐 Content Overview#

Element	Length
🎥 No Supervision	2m51s
🎥 Decomposing Matrices	17m22s
🎥 Clustering	6m56s
🎥 Vectors and Spaces	7m27s
🎥 Information and Entropy	10m31s

This week has 0h45m of video and 0 words of assigned readings. This week’s videos are available in a Panopto folder.

📅 Deadlines#

Quiz 13, November 17
Assignment 6, November 20

🎥 No Supervision#

In this video, we review the idea of supervised learning and contrast it with unsupervised learning.

Video (2m51s)

Slides

But in this video, I'm going to introduce you to the idea of unsupervised learning.
This week we're going to learn about the difference between supervised and unsupervised learning.
You're going to learn how to project data into lower dimensional spaces with matrix factorization and to cluster data points.
So so far, we've been focusing on learning what we're trying to predict a label.
We have a categorical label where we're trying to predict. And this is classification we're trying to classify as spam, not spam fraud.
There's a couple of the examples we've been using.
We can have a continuous label we're trying to predict, in which case we call it, in which case it's three regression.
But this is called we can also try to predict ordinal variables, et cetera.
But this is all called supervised learning. Where we have the key idea here is we have a ground truth for the outcome.
We have observed outcomes for our training data. This is sometimes called a supervisions signal.
And we're trying to learn to predict these known outcomes.
That's that's the heart of what it means to do supervised learning. But.
We can do things without having a supervision signal.
And some of the things we can do without access to a supervision signals, we can try to group instances together what's called clustering,
where we try to find related groups and clustering and multiclass classification are related.
Because if you've got multiple class labels, then you're trying to divide them into that.
Clustering is where you're trying to buy the Met, but you don't have the class labels.
You can try to learn vector spaces for items in order to say,
learn the relationship between items in some cases, also to learn the relationships between features of items.
There's also a middle ground called self supervised learning where you don't have labels in the sense that we use them in supervised learning,
but you extract something that looks like a label from the data and use that as a supervisions signal.
Word and beddings are one example of of self supervised learning.
So why do we want to do this unsupervised learning? There's a few reasons.
One is that it can be useful as a data exploration tool.
If you can find clusters in the data, then that can help guide where you your investigation to understand what's going on in your data.
It can help to reduce data complexity for either visualization or for subsequent learning tasks.
You can use them as inputs into other models and sometimes it's all we have.
We don't have access to labels and we're trying to make sense of our data source. Unsupervised learning techniques can be helpful in order to do that.
So to wrap up unsupervised learning learns patterns from data.
We don't have labels available. It's useful for grouping items together, exploration and as input into other models.

🎥 Decomposing Matrices#

This video introduces the idea of matrix decomposition, which we can use to reduce the dimensionality of data points.

Video (17m22s)

Slides

Blow in this video. I want to introduce the idea of Matrix decompositions.
There's a couple of notebooks that go with this to demonstrate the concepts more and to give you some additional readings in this.
This video is going to explain what's going on. So goal here is to view matrix multiplication and decompose a matrix into a lower rank approximation.
So if you've taken a linear algebra class, you may have seen a matrix.
We have a it's a two matrix is just a two dimensional array of numbers.
We say it's dimension is M by N Rose always go first. When we're notating matrices, this is also the convention used by NUM Pi.
And so it is. It's Rose R and dimensional row vectors.
So that's a row vector. It's columns. R m dimensional column vectors have a column vector there.
We can also compute its transpose if we swapped the rows and columns. Vampyre exposes that as the exact capital T operation.
But a matrix is it's this two dimensional array of numbers. We can do a few things with them.
We can add them together. We can subtract them. One of those things we can do is we can multiply them.
So if we have two matrices A and B and A is M by K and B is K by N.
This is important. The inner dimensions of the two matrices have to match.
Then we can compute the Matrix product and it's gonna be M times and so for multiplying two matrices.
Unlike multiplication of of scalars.
Matrix multiplication is not commutative. You can't switch B get the same result.
You have to have the same the same matrix or the same dimensionality on the inside.
And what you get as the result is the dimensionality of the outside.
So what it's defined as is is CIJ Row, Row I column J is defined by the SARM.
Across the row of A and down the column of B of the pairwise items.
So it is it is the DOT product of Roe A.
And Column B. Or and of column Jay of of Matrixx Base.
You compute. You compute the DOT product. So see what C is, is it is the dot product of every row of A with every column of B num pi.
You can compute this with the A and B operation. That's the Python Matrix multiplication operator.
So this is a fundamental operation for matrices that you can multiply them together.
We also can have what we call sparse matrices and a matrix is sparse.
If most of its values are zero, that's what it means mathematically for it to be sparse.
Computationally a sparse matrix is a matrix with a zero. Values are not stored.
And so Saipov provide the sparse Matrix class that we can a number of sparse matrix classes that we can use.
The number high ENDI array is a dense matrix. Data frame is also a dense matrix.
S data frame.
It's very serious, can't be sparse, but if we need to do sparse computations, we can use this Sipi that sparse package to give us sparse matrices.
This is what Saikat learned does under the hood.
When you do, when you tokenized text with its with its count vector Dreiser or its TFI D.F. Vector riser,
its giving use Sibai sparse matrices as a result.
Now one of the things we can do with another thing we can do with a matrix is do what's called the dimensionality reduction.
And this follows from a theorem that if we have a matrix and intown by hand,
then we can compute a decompositions into the multiplication of three matrices P Sigma and Q Transpose.
And this gives us we can break down any matrix into this, Sipi provides us with functions in order to compute this decompensating given an ax.
It will compute piece Sigmon Q or Q Transpose. We can also then truncate this.
So Sigma is what's called the singular values.
We can truncate this matrix, only keep the K largest ones and set the rest of zero or just cut them out so that.
To so that we can get we get a narrower P and a narrower cure, a shorter Q transpose.
And this gives us an approximation of the original Matrix X.
There's a few useful properties. So the rows of P rows of P correspond to rows of X rows correspond.
So what Pier gives us is a K dimensional. Representation of rows of X, if X has a lot of columns.
This is super useful because if case more than no columns of X, then we get the smaller,
more compact representation of the original rows of the Matrix.
Also, it preserves distance things that are things are approximately as far apart in the in P as they are in the original X.
So why do we want to do this? One reason is for a compact representation. As I said, we get this k dimensional representation of our values x.
It can be useful to remove noise. I'll talk more about that in a little bit.
It can be useful for plotting high dimensional data to show relationships. If you've got 50 columns of X.
You can plot just like two columns of it. Or you can take the SDD to find columns that are particularly hard to project it into another.
Another vector space that you can show it in two dimensions that maximize the maximize the
amount or the extent to which the data points you can be spread out in those two dimensions.
It can also improve our ability to compute distances and fine and then it can provide.
It finds relationship and it can be helpful for finding relationships between features.
So if we have correlated features, we have multiple features that are partially measuring a similar thing.
They're correlated with each other. Principal component analysis is an application of matrix composition that allows us to find those relationships
and combine those correlated things and extract non correlated components out of these correlated observations.
So how do you actually do this? So Saikat learn provide a truncated SPDM class.
It's a transformer. If you call fit, it learns Kute transpose.
If you call transformatory turns the rows of P for the instances you pass in the anthraces,
you pass and don't have to be the same instances that you gave to fit.
Fit. Transform does the whole thing at once. I'm giving you example Kobel that you see this in action.
And then there's also the SEVIS function inside PI that computes the SVOD of a sparse matrix.
So one of the applications of SFD, as I said, is something called principal component analysis.
If you mean center, your features are you can standardize them. But if you mean center your features and then you compute the S.V. D.
What you get is the columns of P or what we call principle components.
So columns zero. Is the position of the data point along a vector that has a maximum variance and you're over?
You can go over to Q and find that vector in the original of the original data space.
And so what it does is it finds you've got this you've got this data in space and it finds it aligned through the data.
That explains more variance over a long which is not that it explains more
variance along which there is more variance than there is along any other line.
You could draw through that data point through that that space. That can be the axis at axis and a vector space.
And then if you see projekt all of your points onto that line, then you can find another line that explains most of the remaining.
The more of the remaining variance than any other.
So here I have I have data projected in two dimensional space, it's actually three dimensional data.
There's some correlation. We get this line here that runs through it.
And this line, if you go along so that the variance along the axis,
there's a fair amount of variance is a fair amount of variance along why there's more variance along this line.
So it gives just this line that this is the line through which there's most of the variance.
We could transform the data. So this line is now our X X-axis.
And then we could look at where's the where's the rest of the variance? So we can see here.
Here I'm showing the vectors the first and the second principle components.
The first one is along this line. I showed you the first place. It can go either way.
PCI does not guarantee which direction the sign is going to go. It does the same flip.
You can point the arrow the other direction, but you've got this this vector here that lets us.
Is this line along which there's more variance than any other?
But then there's this second line and it's orthogonal to the first and it's OK, where's the next chunk of variance?
What direction do I go to find the next amount of variance? You and The Notebook was online that generated these plots along with the simulation.
You can play with a little bit. So why do we want to use this? There's a few different useful use cases.
One is to compress and genoise our data. So as I said, we truncate we can truncate DVD.
We keep the K largest singular values. This means that P and Q are much smaller than pretty liste.
P is much smaller than X.
Then the result is that when we multiply them back together, it approximates X and it is it is the best rank approximation to the rank.
The rank of a matrix is basically a measure of how complex it is.
What it is, is it's the number of non-zero values in the singular value decomposition.
And so if we zero out the smallest values, what we get is.
If least squares error is our measure. Of how good an approximation of the original matrix is.
There is no better approximation than the truncated SPDM.
Another thing that happens is if there's noise in X, if X is some strong signal and a bunch of noise.
The largest singular values and singular vectors are probably going to pick up the signal and not the noise.
Always both for the most part. And so if you add the noise will be learned in the smaller vectors.
And so if you drop the smaller vectors, then you're dropping a bunch of the noise.
And so it can be useful to clean up data for some various purposes.
If X is sparsely observed,
you can use this also to impute values if you're careful about how you set up the composition because that he got but this you have to have
the full matrix if you're careful about how you set it up or use an alternative means of learning one that can deal with missing data.
Then you can multiply them back together to predict what the values you weren't able to observe of X are.
Really useful technique for imputing that data.
And for filling in unobserved values.
This is how a lot of recommender systems work. Actually, if we observe your preference for some movies,
we can use a singular value decomposition or a derivative of it in order to fill back in and estimate your preference for the movies you haven't seen.
And then if X is the document term Matrix or the Roeser documents and the columns or terms and we take the SFD,
this is what's called latent semantic analysis or latent semantic indexing.
And it's a way for understanding.
What we call the topics in a corpus, because these these dimensions and in the reduced dimensionality space, the metric, the.
This inner vector space and talk of another video about what a little bit more about what that means.
They correspond theoretically to different kinds of topics.
And so if that document becomes represented rather than the words, it becomes represented as a vector over topics.
And each document is a mixture of these topics and words correspond to topics as well.
The model there is that a document produces a word or contains a word because the document is about topic and the word is relevant to the topic.
And so you learn these topics.
And it lets you compare documents even if they don't have as many words in common because you can establish this in enemies, OK?
These words are on the same topic. Then if I use some of them, I'm on that topic.
And another document uses other ones that it's on that topic. And we can learn the topic relationship by doing a matrix, the composition.
Another one is for visualization. So low dimensional vectors can be visualized.
And I show this in the example notebooks.
But if we take an SVOD, then we can you say the first two columns of the SFD to visualize our data points in a space.
The space is not human interpretable. But let's see how spread out the points are.
We can also use it to get better neighborhood. So one of the problems, there's a couple of problems with high dimensional spaces.
We're trying to compute distances.
One is that distance is more expensive to compute because the more dimensions you have, the more compute you need to do.
But also, as the dimensionality of a space increases, the number of features,
the number of columns in your in your Matrix point start to look about the same distance from each other.
It's called the cursive dimensionality. Decomposed matrices can help with this.
So doing SFD can help make either a K and then classifier or a commune's clustering approach work better if you work on the.
If you do the K and N or the K means clustering, which we're going to talk about in the next video.
On top of. The transformed data using an SVOD, it can sometimes be more effective than if you just use it on its own.
The fourth case is the model categorical interaction. So if we want to models say the likelihoods of words to appear together,
like what's the likelihood that apple and fish appear within three words of each other in a sentence?
We can think about this as a probability, but there's N squared of them because we have no probability for every pair of words.
That's a lot to learn if we were going to learn.
If you want to learn a matrix that maps the probability between every pair of words in the English language, that's a very, very large matrix.
So instead, what we can do is we can learn of reduced dimensionality, space, and we usually don't do this by actually taking the NCD.
We do it with with approximation method that just directly optimize these vectors.
But we can learn vectors for words so that.
You basically using a logistic model of the probabilities so that the DOT products between them is the law gods of the two words appearing together.
And so words that appear together are going to have similar vectors.
Words that appear far apart are going to have very different vectors. And this is called a word embedding.
This is what a word embedding does. Like word the vac glove. These various word embedded.
This is what they do. And more sophisticated versions of this are at the heart of a lot of machine learning models.
So a lot of neural architecture is a lot of deep learning.
Models have various embedding and all in embedding is it's a vector representation of something.
And they're often done through these kinds of dimensionality, reduction techniques or approximations of them,
so that you get you get these vectors, these low dimensional vectors that are in a space like they're 10 dimensional vector.
And the 10 dimensions don't mean anything. They're just dimensions that are useful for explaining this.
This this instance is relationship to whatever we're trying to do with it.
And so they take you a long ways and a lot of machine learning.
And then they're the core piece of a lot of different models.
So to wrap up Matrix decompositions, which is also called a matrix factorization or dimensionality reduction,
breaks a high dimensional matrix into a low dimensional one. And it's useful for compressing data.
You've got a more compact representation. It's useful for making it more well behaved numerically.
We can compute better distances. We compute distances more efficiently, can reduce noise in the data.
There's a lot of different purposes for which decomposing data into this lower dimensional space is super useful.

Resources#

The next notebook
The PCADemo, demonstrating the PCA plots
numpy.ndarray
scipy.sparse
scipy.linalg.svd()
scipy.sparse.linalg.svds()
sklearn.decomposition.TruncatedSVD
sklearn.decomposition.PCA

📓 Movie Decomposition#

The Movie Decomposition notebook demonstrates matrix decomposition with movie data.

🎥 Clustering#

This video introduces the concept of clustering, another useful unsupervised learning technique.

Video (6m56s)

Slides

This video I want to introduce clustering, so learning outcomes are for.
To understand the idea of clustering and to interpret the results of clustering with K means.
So the idea of clustering is to group things together. So if we want to find groups in our data points, but we don't know what the groups are,
many clustering techniques require us to know how many groups there are. But we don't know what the groups are.
If we did, we would just use a multiclass classifier to find them. We want to find them from the data.
This is what we call clustering. So there's a couple of different kinds of clustering in terms of the membership of the clusters.
One is mixed membership where a point can be in more than one cluster.
And it has a different degree of affinity for the different clusters matrix factorization we can see as a kind of mixed membership clustering.
Where do the values in the decomposed and the lower dimensional space are?
How strongly the matrix is associated? The data point is associated with that cluster, but single membership clustering we want to find.
We want to find clusters. And we want to put each point in one cluster. So we might have movie types, different types of movies.
Want to put each movie in a different type. These might align with genres. They might align with something else.
So the idea one technique is to do it based on what we call centroid and the centroid is just the center of a cluster.
And so to do this, we typically need a distance function between two data points, between two vectors.
Often this is the Euclidean distance, but we have to define the vector space properly.
We need to do the feature engineering, have the features appropriately normalized and standardized so that the distance between them.
The distance between their vectors actually reflects how far apart the vectors are, the instances are with regards to our clustering goal.
If the distance does not relay, if it isn't so, that more similar items in terms of what we what we hope the cluster is going to uncover,
those more similar items need to have a smaller distance between each other than they do their distance to a not a less similar right along again,
along whatever it is that we hope the clustering is going to uncover.
We can do clustering on on dimensioned after dimensionality reductions that we can get.
We can get our ah. We can work in a lower dimensional space and sometimes that'll make where our distances be better behaved.
So the goal is to find the centroid of these clusters. And then what we'll do is we'll when an item comes in, we'll find which of our clusters.
So we have 10 clusters. We're gonna fi compare it. We're gonna measure its distance from the centers of all of the clusters.
And we're gonna say it's in the closest one. And so the K means algorithm does this by.
So we tell you how many clusters we want. We want five clusters, 10 clusters.
And it picks ten points. And says, these are my cluster centers.
And then figures out what cluster all of the data points are in.
And now that it's got all the data points clustered, it uses the it takes each cluster and recompute the new center.
It takes all the data points, computes the center of that set of points. And that's the new cluster center.
It then does this again because then you move the cluster center.
It might be that some points on the edge between it and another cluster switch clusters.
And then once you've switched clusters, you compute the center centroid again and you repeat this several times until what we call convergence.
And so this is a this is an example. We've seen a couple of others of what's called an iterative method,
iterative method as a method where you start somewhere and you incrementally improve your result.
So we start with some cluster centers, cluster the data points, move the centers to reflect the data points.
Try again. And convergence basically means it stops moving.
We've rerun another round of it. And our cluster centers haven't moved very much.
This does require us to know. Okay. It can't figure out how many clusters there are supposed to be.
And they're optimizations that can improve it various ways, particularly picking a like the simple way to do it.
If you pick K points just completely at random. There are more sophisticated ways to pick those points that can result in better clustering behavior.
So to do this, since I can't learn the K means class or do K means clustering fit learns, the cluster centers predict,
will map a data point to a cluster no super give it predict with some data and it will give you numbers, cluster numbers, cluster centers.
If you look if you go into and get the cluster centers attribute out of the escolar an object, that's the center centroid.
If your clusters of other clustering algorithms and Saikat learn of a similar interface.
Now we've got these clusters. How do we see how well they work?
Well, look at them like the purpose here is we want to uncover data, uncover connections and groupings in the data.
But we don't have labels, so one thing you really have to do with clustering is just look at it.
Do the clusters seem to be finding coherent? Do they seem to be finding coherent sets of things that we're clustering?
If you do have labels, sometimes we will have labels and we can like we have labels for a little bit of data.
We can use it to compare clustering behaviors, clustering systems also, or cluster clustering results.
Also, it can be useful when we're experimenting with clustering techniques to cluster data
where we do have the labels to see how good a job it's doing at recovering labels.
And we do have them to get some idea of how it might do. And we don't have them.
And then there are some quality scores.
There's a score called silhouette that compares the distances within a cluster to the distances between items and items and the closest other cluster.
And if things tend to be closer to each other than they are to other clusters, then you've got a better clustering.
These can be used to compare clustering, but there's not an absolute quality value like,
oh, a silhouette of point five means they've got a good clustering. No clustering is a really, really evaluating.
Clustering is a really imprecise thing. But the basic it basically is the clustering useful for what you're trying to do with it.
So to wrap up clustering allows us to identify groups of items in the data. These clusters may or may not make sense.
You have to look at them really. Cluster quality depends on a number of things.
Your features and your metric are super important because if you don't have a feature space and a
metric such that things that are similar to each other are close together in the on your metric,
then clustering is not going to be able to find the relationships you're looking for. Also, the cluster counters superimportant.
If there are eight natural groups and you try to find five clusters, clusters might not work so well.
Now the natural gropings and the cluster count do not necessarily need a map.
Sometimes you can get good cluster rings with an extra cluster or not having quite as many clusters.

Resources#

sklearn.cluster.KMeans

📓 Clustering Example#

The clustering example notebook shows how to use the KMeans class.

🎥 Vector Spaces#

This video talks about vector spaces and transforms.

Video (7m27s)

Slides

This video, I want to talk a little bit more about vector spaces, we've talked about them a little,
but we're going to talk about the concept in a little more detail now.
Want to formally introduce more formally introduced the concept of vector space and understand the idea of a vector space transformation.
We're only going to scratch the surface for a lot more. I recommend that you read a good linear algebra book.
So remember, vector is a vector is there's a sequence of numbers, an array, basically a one dimensional array.
So X X one X two X and that's the vector real's to the end is an end dimensional vector space.
In the real numbers, you vector spaces over other things to integers over.
Complex numbers over weird things, but it's an n dimensional vector space and we can do a few operations, the vector.
We can add, subtract, we can multiply by a scalar. That's a real number, we can compute the inner product.
You cannot multiply multiplying two vectors, just I'm going to multiply defector's together is not an operation.
If you both apply to vectors and know what you're actually getting is the pairwise
multiplication like it multiplies the elements together if they're compatible sizes.
That is not actually a linear algebra operation. There is the inner product, which is the sum of the element Y's products.
And then there's a distance, which is the inner product of a subtraction with itself. And so we can have a matrix.
So we've got a sample exit rows and instance, each instance is a row vector.
As a row of this matrix X, we can do all the vector things with these rows.
Matrix, a matrix is, as we said in an earlier video. Is this this two dimensional array?
Of numbers. It's a collection of row vectors and it's a collection of Cullom vectors.
It's also a linear map from one vector space to another. And the other one might be the same the the the same vector space in terms like it
might be n by n subtle map are an N dimensional vector to another N dimensional vector.
But it's had some transformations applied to it. There's a bunch of things we can do with matrices.
We can add them, multiply them by either a scalar or by a compatible matrix or vector.
We saw that earlier. You can transpose them, et cetera. There's a number of special matrices.
So we have column vectors which are m by one, remember, we always have rows first.
So M rose by one column is an M by one column vector.
One row by N columns is an it is a one by N row vector.
We can have a square matrix where the two dimensions are the same.
We can have a diagonal matrix that only had that zero everywhere except the diagonal.
So you've got your big matrix. It's got. The diagonal is non-zero.
All of this is zero. You can have an identity matrix, which is a diagonal matrix where all the non-zero values are one.
You can have a triangular matrix where either the upper right. Or the lower left corner is non-zero and the other side is zero.
So it's the everything above and to the right of the diagonal is zero for a lower triangular matrix.
Everything down the left of the diagonal is zero for an upper triangular matrix.
Also a symmetric matrix, which is a square matrix where it's equal to its transpose.
So the top right corner is equal to the top bottom left corner.
You flip it. You flip the rows in the column and you get the same Matrix backout. You can also have what's called an orthogonal matrix,
where if a transpose times A is equal to the identity matrix, then you have an orthogonal matrix.
Matrix vector multiplication is super useful operation.
So if we've got an M byan matrix and we've gotten N dimensional column vector, then we can compute Y equals X and we can.
And this is going to be an M dimensional column vector.
And what we've done here is we have mapped X into another vector space or we've transformed it.
And even if even if the even if a a square. So it's from R and R n.
What we're doing we can apply this can transform the vector so that it's in it's in the same space.
But it is its relationships to other vectors have changed.
And it's effectively a. In the different organization of the same space for lack of a better term.
I'm trying to avoid getting deep into the linear algebra terms like so like change of basis and things.
Because I'm trying to give you the intuition for it.
And linear algebra class is gonna really eat either a class or a textbook or an online course is going
to help you shore up a lot of the details that you're going to need to dove deeper into linear algebra.
So multiplying by a matrix can give us a bunch of different transform.
We can reduce dimensionality. We can basically project, project and do other transformations.
So a projection is when you just strip off vectors. So if we have X.
So if we've got one seven. The projection of this under the first dimension is just one.
But you can also do some additional transformations at your besides just projection to get it down to a lower dimensional space.
One of the things you can do is translate. So if we've got here. We can translate it.
We've got Vector's. And we just shift them, same relationship to each other.
They're just moved. We can scale them so that they're going to this vector is going to.
Get a vector here. We can scale it.
You can skew the space. You can also rotate within the space and any combination of these, you can do any linear transformation.
And actually this is what it means for something to be a linear transformation of linear
transformation is the transformation you can express through a matrix multiplication.
But also, we have linear systems, so linear systems are written as matrix vector operations.
We can solve this for Beda. So why is X better than Beda is equal is X inverse times Y.
If we want ordinarily, this is this is the direct exact solution to the linear equations.
But if they don't have a solution, we can get the least squared solution by solving a different system.
Multiply. X transpo solve, X transpose, Y equal to X transpose, x beda.
I missed it x there, I just wrote it in.
Now one particular note though is so I wrote a matrix inverse here matrix and versus an operation but usually don't actually want to perform it.
Matrix and Versus are almost always used for solving a system.
Linear equation solving the system is usually a better solution than actually inverting a matrix.
So wrap up vectors represent data points in a vector space. These can be manipulated and transform, particularly by multiplying them by a matrix.
I recommend that you can salties Lynge some linear algebra learning resources to learn a lot more.

Resources#

Linear Algebra Done Right by Sheldon Axler
Handbook of Linear Algebra (terse and comprehensive reference)

🎥 Information and Entropy#

This video introduces the idea of entropy as a way to quantify information. It’s something I want to make sure you’ve seen at least once by the end of the class.

Video (10m31s)

Slides

Hello. In this video, I want to introduce the idea of information theory and particularly the concept of entropy
that we use to measure information or to measure the complexity of a distribution.
So learning outcomes for this video are for you to understand the relationship of entropy and uncertainty and to be
able to use entropy to measure the amount of information in a signal or the complexity of a probability distribution.
So want to start by thinking about what does it mean to quantify information? If I tell you Apple, how much information did I just give?
You might seem like a kind of abstract question, but it turns out to be a question that we can work on answering.
Would the amount of information you would say that I gave you by saying Apple Change, if you knew in advance that I would say an English word?
Would it change if you knew that I would say the name of fruit? Maybe the name of a regional fruit or that I would name the fruit in the picture?
The idea we want to get out of here is if you knew already that I was being named the fruit in the picture,
then me telling you Apple gives you less information than it would if your previous knowledge is that I would give you the name of our fruit.
It's the key principle here is that information is the resolution of uncertainty.
If we want to start to try to quantify information or talk about how much information is in a message or is in a signal or is in a possible message,
what we want to do is we want to measure the amount of uncertainty that it resolves.
If you already know that I'm going to say Apple, when I tell you Apple, I'm giving you no information.
But if you know that I'm going to name a fruit,
then I give you the amount of information communicated by the fact that I say Apple is based on the space of possible fruits that I could give you.
If you know that I'm going to say the name of a fruit that grows in the Pacific Northwest.
Then, you know, I'm not going to say banana. I might say Apple.
I might say blackberry or huckleberry, but bananas and tropical fruits are not going to be in the sets.
You have more information, which means that there's less you have more background knowledge,
which means that there's less uncertainty that's resolved by me communicating the message Apple,
which means there's less information in what I'm communicating. We measure this in bits and the fundamental a bit one bit.
It's like the bit on your computer. If we flip a fair coin, so the probability of heads and probability of tails are equal, they're both point five.
If you know, I'm going to flip a fair coin, I flip it and I tell you heads, that communicates one bit of information.
And this is computed with the formula. The probability which is zero point five.
Times the log, which is negative one logs of values that are less than one are always negative, so.
Zero point five times negative, one, zero point five times negative, one push that all through and you get one.
This is one bit of information and we call this the entropy of flipping a fair coin.
The entropy of this probability distribution, the probability distribution characterizes the uncertainty.
There's two outcomes. Both are equally likely. We have no one from.
All we know is it's going to be one of these two outcomes. Both are equally likely.
We have no information about which one it's going to be. Uncertainty that's being resolved,
is that which coin of these two equally likely possible or which which of these two equally likely possibilities occurred?
That's the information we can receive. We call that one bit of information. One bit is which one of two equally likely possibilities.
The general version of the entropy formula looks like this. It's the sum over the possible messages of the probability of that message.
Times the log base, two of the probability of that message.
And this is the result of this is in bits.
You can do it with a natural log in which case it's called gnats, but usually we use the base to log for computing entropy.
There's a minus sign before it because probabilities are between zero and one.
All of our log probabilities are going to be negative, so we stick a minus sign in front of the whole thing.
And now we get we get positive number of bits.
And what it measures is it measures the expected number of bits that's needed to encode a message that resolves this uncertainty.
Using an efficient encoding and then efficient encoding takes advantage of the fact if the outcomes are not equally
likely in efficient encoding takes advantage of that fact to use shorter encoding for more common messages.
And so if you if you have this kind of an efficient encoding, what's the expected number of bits that are required to come in to write down a message
or encode a message that resolves the uncertainty in this probability distribution?
So if we're going to apply this, we have a fair die each six outcomes.
Each one is equally likely than the entropy of this. Die is two point five eight five.
You can run this computation through Python with your calculator to confirm that.
Now, if we have an unfair die where the probability is proportional to a VAT to its value,
so rolling a six is six times more likely than rolling a one.
We can represent this with the probability is going to be the probability of six is equal to six over twenty one.
Probability of one is equal to one over twenty one. The entropy of this distribution is two point three, nine eight.
It's lower so that the uniform distribution over a set of outcomes is the maximum entropy distribution over those outcomes.
Because if if some outcomes are more common than others, we can use shorter sequences to encode those.
For example, we could use one four to encode the most common value of six.
Or we could use zero one to encode a five. And if we do this kind of an encoding,
more common outcomes are shorter so we can decrease the average number of bits it requires to encode a message
that resolves the uncertainty of which die outcome is going to happen based on this shared background knowledge.
So a couple of a few of the key points here, so entropy measures the distribution,
and we're treating this distribution as a measure of uncertainty or of a measure of certainty or uncertainty or a measure of belief.
It quantifies how, how likely. Prior to receiving the message,
it quantifies how likely we think that how would likely different messages are based on our current knowledge is quantifies the uncertainty.
Someone received the message. We know how much uncertainty was being reduced.
This is a very Bayesian perspective on on probability, and it approximates the information requirements.
We usually do not have a full, accurate probabilistic model that completely and accurately describes our current state of knowledge.
We have an approximation, usually that approximation is as higher entropy than the actual full model will be,
but we have this approximation and we can use that then to approximate the information requirements.
The actual exact, precise amount of information is very, very difficult to compute.
But if we have a distribution that does a reasonable job of characterizing our prior knowledge,
we can use that to estimate the amount of uncertainty reduced. We can use that to quantify information we can use.
Though entropy is a building block, particularly as the measure of a complexity of a distribution,
we can use it or the uninformative ness of a distribution. We can use it to build a number of others, a number of other constructs.
And I'm not going to give you the definitions of all these in this video. My purpose here is just to introduce you to the idea of entropy.
You can go read a lot more about these if you'd like, you're going to see them in other classes,
but we can talk about the conditional entropy, which is if I know the value of X, then how much information does Y?
How much uncertainty does why reduce, which is different than if I didn't know the value of X. We could talk about the mutual information,
how much the two variables tell us about each other. We there's can I give you the discrete formula?
You can also apply information theory to continuous random variables. We can talk about transmission rates.
So far, we've talked about bits. We can talk about the number of bits per second that you can transmit over over a communication channel.
Information theory originated in or out of the early information theory.
Work was done in the context of quantifying how much information you can transmit over various kinds of radio channels and things.
We can talk about the difference between two distributions so far, these have talked about different Spacelab events,
but we have the same space of events, but two different distributions.
We can talk about the difference between them with either what's called the cold back leave,
their divergence that measures how different they are or the cross entropy, which we can derive from the K.L. divergence.
We can use these for a variety of things, we can use them to measure the complexity, the distribution,
the uncertainty that's going to be reduced by observing a draw from that distribution, we can use it to measure the differences in distribution.
This can be good for optimization. If we have some empirical data and we want to learn a distribution that approximates it,
we can do that by minimizing the scale divergence or minimizing the cross entropy to get our our our distribution as
close as possible to say the empirical data that we're observing or another distribution that we have access to.
So to wrap up information reduces uncertainty, and we measure information by measuring the amount of uncertainty that it reduces.
There are a number then of derived metrics that allow us to compare distributions,
compare or allow us to see how much observing one outcome informs us about another outcome,
and we measure all of this through a framework called entropy.

Resources#

An Introduction to Information Theory: Symbols, Signals & Noise by John R. Pierce
Entropy (information theory) on Wikipedia

🚩 Week 13 Quiz#

Take the Week 13 quiz on Canvas.

📓 Practice: SVD on Paper Abstracts#

The Week 13 Exercise notebook demonstrates latent semantic analysis on paper abstracts and has an exercise to classify text into new or old papers.

It requires the chi-papers.csv file, which is derived from the HCI Bibliography. It is the abstracts from papers published at the CHI conference (the primary conference for human-computer interaction) over a period of nearly 40 years.

If you want to see how to create this file, see the Fetch CHI Papers example.

📩 Assignment 6#

Assignment 6 is due November 20.

CS 533 Fall 2022

Week 13 — Unsupervised (11/14–18)

Contents

Week 13 — Unsupervised (11/14–18)#

🧐 Content Overview#

📅 Deadlines#

🎥 No Supervision#

🎥 Decomposing Matrices#

Resources#

📓 Movie Decomposition#

🎥 Clustering#

Resources#

📓 Clustering Example#

🎥 Vector Spaces#

Resources#

🎥 Information and Entropy#

Resources#

🚩 Week 13 Quiz#

📓 Practice: SVD on Paper Abstracts#

📩 Assignment 6#