Now, we want to make a matrix. But this matrix would be 64K by 73K, which is:
64000*73000*8/(1024*1024)
35644.53125
That’s about 35GB. My desktop is that big, but it’s still a lot.
Let’s make a sparse matrix. We are going to start by making a matrix in coordinate format: a set of triples of the form , where is the row number and is the column number.
In order to do this, though, we need identifiers that are contiguous and zero-based. Fortunately, this is exactly what the Pandas Index data structure does — it maps between index keys (of whatever type and value range) and contiguous, zero-based identifiers.
The movies frame had a movie index, since we indexed it by movie ID.
If it didn’t find a movie, it would have returned a negative index.
Now, let’s make an index for the tags. We’ll start by creating one for the unique tags, and we’ll use np.unique so the tags are in sorted order, and we’ll lowercase them first:
Now we are ready to make a sparse matrix. We’re going to store the matrix in Compress Sparse Row (CSR) form, using csr_matrix; its constructor can take a matrix in COO (coordinate) form, which is what we just created. Not every movie ID is used, so we’re going to specify th size of the resulting matrix; some rows will be all 0.
We’re also going to take the log of the counts (plus one), to reduce skew while preserving 0s (since ).
So let’s make it, and call it mt_mat for “movie tag matrix”
<62423x65464 sparse matrix of type '<class 'numpy.float64'>'
with 463518 stored elements in Compressed Sparse Row format>
Woot! We now have have a sparse matrix whose rows are movies, columns are tags, and values are the number of times that tag has been applied to that movie.
We could have done that with a custom transformer.
Now let’s train a TruncatedSVD that will project our tag matrix into 5 dimensions. Remember it will learn:
fit stores Q in the SVD, but throws away P - P is the result of a transform. If we call fit_transform, it will store Q and return P. Let’s do it:
We can see that the first dimension (0) has a point that is close to a right angle; an SVD gives us orthogonal dimensions.
Now, let’s see what movies have the highest 0 dimension. Since we have a data frame, we can use nlargest:
mt_P_df.nlargest(5,'MD0')
MD0
MD1
MD2
MD3
MD4
movieId
260
22.462252
-8.466508
16.765136
-7.705190
-5.162488
296
18.786882
5.663466
-6.104237
13.731377
1.411248
79132
15.794102
-8.038224
-1.889724
-1.827832
5.055682
2959
15.044625
-1.071897
-9.625266
3.079433
3.910763
2571
15.021538
-10.273176
4.950140
-1.015959
3.954582
But these aren’t meaningful. Let’s join with movies to get titles:
mt_P_df.nlargest(5,'MD0').join(movies['title'])
MD0
MD1
MD2
MD3
MD4
title
movieId
260
22.462252
-8.466508
16.765136
-7.705190
-5.162488
Star Wars: Episode IV - A New Hope (1977)
296
18.786882
5.663466
-6.104237
13.731377
1.411248
Pulp Fiction (1994)
79132
15.794102
-8.038224
-1.889724
-1.827832
5.055682
Inception (2010)
2959
15.044625
-1.071897
-9.625266
3.079433
3.910763
Fight Club (1999)
2571
15.021538
-10.273176
4.950140
-1.015959
3.954582
Matrix, The (1999)
What about the second dimension?
mt_P_df.nlargest(5,'MD1').join(movies['title'])
MD0
MD1
MD2
MD3
MD4
title
movieId
356
14.526421
11.795361
1.283543
-5.822479
-5.868638
Forrest Gump (1994)
1197
7.738679
6.938436
3.866916
-1.371170
0.029446
Princess Bride, The (1987)
4306
4.914259
6.918061
4.411453
-0.466859
1.774408
Shrek (2001)
46578
5.090094
6.496026
0.156534
1.678212
3.066294
Little Miss Sunshine (2006)
4886
3.782889
6.223932
3.952054
0.045252
1.820286
Monsters, Inc. (2001)
Up to you whether you think these have a meaningful relationship.
What are the top tags for the first dimension - that is another interesting question.
I’m going to take a slightly different approach this time. I’m going to use np.argsort, which returns indexes corresponding to the sorted order of a NumPy array. The last ones will be the largest values. If I argsort a row of the mt_svd.components_ matrix, it will sort the tags by that feature; I can use the tag IDs of the highest-valued tags to get the tag names from the tag index.
Let’s go!
tag_idx[np.argsort(mt_svd.components_[0,:])[-5:]]
Index(['funny', 'based on a book', 'atmospheric', 'sci-fi', 'action'], dtype='object')
This means that movies with these tags will score highly on this feature.
Since there is a popularity component to our data (people add more tags to popular movies), there is a very good chance that the first feature will effectively be popularity.