Indexing¶
This notebook discusses Pandas indexing operations, and pitfalls.
Setup¶
First our Python modules:
import numpy as np
import pandas as pd
import scipy.linalg as spla
import matplotlib.pyplot as plt
import seaborn as sns
And then the movie data, as usual. Movies:
movies = pd.read_csv('hetrec2011-ml/movies.dat', delimiter='\t', encoding='latin1', na_values=['\\N'])
movies.head()
id | title | imdbID | spanishTitle | imdbPictureURL | year | rtID | rtAllCriticsRating | rtAllCriticsNumReviews | rtAllCriticsNumFresh | ... | rtAllCriticsScore | rtTopCriticsRating | rtTopCriticsNumReviews | rtTopCriticsNumFresh | rtTopCriticsNumRotten | rtTopCriticsScore | rtAudienceRating | rtAudienceNumRatings | rtAudienceScore | rtPictureURL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Toy story | 114709 | Toy story (juguetes) | http://ia.media-imdb.com/images/M/MV5BMTMwNDU0... | 1995 | toy_story | 9.0 | 73.0 | 73.0 | ... | 100.0 | 8.5 | 17.0 | 17.0 | 0.0 | 100.0 | 3.7 | 102338.0 | 81.0 | http://content7.flixster.com/movie/10/93/63/10... |
1 | 2 | Jumanji | 113497 | Jumanji | http://ia.media-imdb.com/images/M/MV5BMzM5NjE1... | 1995 | 1068044-jumanji | 5.6 | 28.0 | 13.0 | ... | 46.0 | 5.8 | 5.0 | 2.0 | 3.0 | 40.0 | 3.2 | 44587.0 | 61.0 | http://content8.flixster.com/movie/56/79/73/56... |
2 | 3 | Grumpy Old Men | 107050 | Dos viejos gruñones | http://ia.media-imdb.com/images/M/MV5BMTI5MTgy... | 1993 | grumpy_old_men | 5.9 | 36.0 | 24.0 | ... | 66.0 | 7.0 | 6.0 | 5.0 | 1.0 | 83.0 | 3.2 | 10489.0 | 66.0 | http://content6.flixster.com/movie/25/60/25602... |
3 | 4 | Waiting to Exhale | 114885 | Esperando un respiro | http://ia.media-imdb.com/images/M/MV5BMTczMTMy... | 1995 | waiting_to_exhale | 5.6 | 25.0 | 14.0 | ... | 56.0 | 5.5 | 11.0 | 5.0 | 6.0 | 45.0 | 3.3 | 5666.0 | 79.0 | http://content9.flixster.com/movie/10/94/17/10... |
4 | 5 | Father of the Bride Part II | 113041 | Vuelve el padre de la novia (Ahora también abu... | http://ia.media-imdb.com/images/M/MV5BMTg1NDc2... | 1995 | father_of_the_bride_part_ii | 5.3 | 19.0 | 9.0 | ... | 47.0 | 5.4 | 5.0 | 1.0 | 4.0 | 20.0 | 3.0 | 13761.0 | 64.0 | http://content8.flixster.com/movie/25/54/25542... |
5 rows × 21 columns
And ratings:
ratings = pd.read_csv('hetrec2011-ml/user_ratedmovies-timestamps.dat', delimiter='\t', encoding='latin1')
ratings.head()
userID | movieID | rating | timestamp | |
---|---|---|---|---|
0 | 75 | 3 | 1.0 | 1162160236000 |
1 | 75 | 32 | 4.5 | 1162160624000 |
2 | 75 | 110 | 4.0 | 1162161008000 |
3 | 75 | 160 | 2.0 | 1162160212000 |
4 | 75 | 163 | 4.0 | 1162160970000 |
And let's load tags:
tags = pd.read_csv('hetrec2011-ml/tags.dat', delimiter='\t', encoding='latin1')
tags.head()
id | value | |
---|---|---|
0 | 1 | earth |
1 | 2 | police |
2 | 3 | boxing |
3 | 4 | painter |
4 | 5 | whale |
What Is an Index?¶
An index is what we use to locate rows or columns in a Pandas data structure. (While conceptually we can think of a data frame as a dictionary of columns, the name-to-column mapping is actually done with an index.)
An index is two things in one data structure:
- An array of key values (given a position, we can get the key value)
- A hash map to look up row position(s) for a given key value
Indexes keys can be any (immutable) data type. Pandas has special support for indexing over datetime64
objects, for time series.
Range Index¶
When we load the data from the CSV file, it by default it is loaded with a RangeIndex
, in which the keys and the positions are the same. These indexes are very memory-efficient, because they don't actually need to store key values.
Let's see the movie index:
movies.index
RangeIndex(start=0, stop=10197, step=1)
With a range index, loc
and iloc
do the same thing (for rows):
movies.loc[5, ['id', 'title']]
id 6 title Heat Name: 5, dtype: object
movies.iloc[5, 0:2]
id 6 title Heat Name: 5, dtype: object
Setting an Index¶
The set_index
method makes a column into the index. It will replace the current index; if the current index is a range index, that is usually fine.
There are two versions:
- By default, it returns a new data frame with the new index, and leaves the original unchanged. The frame shares underlying data with the original, so it doesn't take up that much memory.
- With
inplace=True
, it modifies the data frame to have the new index.
Both are useful. Be very careful with inplace
in a function; the caller probably doesn't expect the data frame to be modified!
Let's index the movies
frame by movie ID:
movies = movies.set_index('id')
movies.index
Int64Index([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ... 64997, 64999, 65006, 65011, 65037, 65088, 65091, 65126, 65130, 65133], dtype='int64', name='id', length=10197)
The new index is also reflected in .info()
, and the id
column is gone:
movies.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 10197 entries, 1 to 65133 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 title 10197 non-null object 1 imdbID 10197 non-null int64 2 spanishTitle 10197 non-null object 3 imdbPictureURL 10016 non-null object 4 year 10197 non-null int64 5 rtID 9886 non-null object 6 rtAllCriticsRating 9967 non-null float64 7 rtAllCriticsNumReviews 9967 non-null float64 8 rtAllCriticsNumFresh 9967 non-null float64 9 rtAllCriticsNumRotten 9967 non-null float64 10 rtAllCriticsScore 9967 non-null float64 11 rtTopCriticsRating 9967 non-null float64 12 rtTopCriticsNumReviews 9967 non-null float64 13 rtTopCriticsNumFresh 9967 non-null float64 14 rtTopCriticsNumRotten 9967 non-null float64 15 rtTopCriticsScore 9967 non-null float64 16 rtAudienceRating 9967 non-null float64 17 rtAudienceNumRatings 9967 non-null float64 18 rtAudienceScore 9967 non-null float64 19 rtPictureURL 9967 non-null object dtypes: float64(13), int64(2), object(5) memory usage: 1.9+ MB
We now have an index that lets us up by 64-bit integer keys. We can now look up Heat by movie ID:
movies.loc[6]
title Heat imdbID 113277 spanishTitle Heat imdbPictureURL http://ia.media-imdb.com/images/M/MV5BMTM1NDc4... year 1995 rtID 1068182-heat rtAllCriticsRating 7.7 rtAllCriticsNumReviews 58 rtAllCriticsNumFresh 50 rtAllCriticsNumRotten 8 rtAllCriticsScore 86 rtTopCriticsRating 7.2 rtTopCriticsNumReviews 17 rtTopCriticsNumFresh 14 rtTopCriticsNumRotten 3 rtTopCriticsScore 82 rtAudienceRating 3.9 rtAudienceNumRatings 42785 rtAudienceScore 92 rtPictureURL http://content9.flixster.com/movie/26/80/26809... Name: 6, dtype: object
iloc
still returns this at position 5:
movies.iloc[5]
title Heat imdbID 113277 spanishTitle Heat imdbPictureURL http://ia.media-imdb.com/images/M/MV5BMTM1NDc4... year 1995 rtID 1068182-heat rtAllCriticsRating 7.7 rtAllCriticsNumReviews 58 rtAllCriticsNumFresh 50 rtAllCriticsNumRotten 8 rtAllCriticsScore 86 rtTopCriticsRating 7.2 rtTopCriticsNumReviews 17 rtTopCriticsNumFresh 14 rtTopCriticsNumRotten 3 rtTopCriticsScore 82 rtAudienceRating 3.9 rtAudienceNumRatings 42785 rtAudienceScore 92 rtPictureURL http://content9.flixster.com/movie/26/80/26809... Name: 6, dtype: object
The reset_index
method will turn the index back into a column, and put a range index back in place:
movies.reset_index().info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10197 entries, 0 to 10196 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 10197 non-null int64 1 title 10197 non-null object 2 imdbID 10197 non-null int64 3 spanishTitle 10197 non-null object 4 imdbPictureURL 10016 non-null object 5 year 10197 non-null int64 6 rtID 9886 non-null object 7 rtAllCriticsRating 9967 non-null float64 8 rtAllCriticsNumReviews 9967 non-null float64 9 rtAllCriticsNumFresh 9967 non-null float64 10 rtAllCriticsNumRotten 9967 non-null float64 11 rtAllCriticsScore 9967 non-null float64 12 rtTopCriticsRating 9967 non-null float64 13 rtTopCriticsNumReviews 9967 non-null float64 14 rtTopCriticsNumFresh 9967 non-null float64 15 rtTopCriticsNumRotten 9967 non-null float64 16 rtTopCriticsScore 9967 non-null float64 17 rtAudienceRating 9967 non-null float64 18 rtAudienceNumRatings 9967 non-null float64 19 rtAudienceScore 9967 non-null float64 20 rtPictureURL 9967 non-null object dtypes: float64(13), int64(3), object(5) memory usage: 1.6+ MB
movies.index.name
'id'
We could change the name if we wanted to, by assigning to it:
movies.index.name = 'movieId'
But we won't actually do that right now.
Arrays and Keys¶
The index is also an array - we can get the key at a certain position:
movies.index[5]
6
To go the other way — find the location or position for a key value — we can use the get_loc
method:
movies.index.get_loc(6)
5
If we look for a nonexistent key, it fails:
movies.index.get_loc(30481080)
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) ~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 2645 try: -> 2646 return self._engine.get_loc(key) 2647 except KeyError: pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item() pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item() KeyError: 30481080 During handling of the above exception, another exception occurred: KeyError Traceback (most recent call last) <ipython-input-17-937b631c964d> in <module> ----> 1 movies.index.get_loc(30481080) ~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 2646 return self._engine.get_loc(key) 2647 except KeyError: -> 2648 return self._engine.get_loc(self._maybe_cast_indexer(key)) 2649 indexer = self.get_indexer([key], method=method, tolerance=tolerance) 2650 if indexer.ndim > 1 or indexer.size > 1: pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item() pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item() KeyError: 30481080
This is a KeyError
, meaning we tried to look up a key that does not exist.
Internally, the index uses a NumPy ndarray
to store the keys. We can access this array as .values
:
movies.index.values
array([ 1, 2, 3, ..., 65126, 65130, 65133], dtype=int64)
Index Characteristics¶
There are a few useful characteristics an index can have. It may be unique, or it may have duplicates. This is a unique index:
movies.index.is_unique
True
movies.index.has_duplicates
False
An index can also be monotonic, or sorted; the is_monotonic
accessor is a shortcut for is_monotonic_increasing
and returns True
if the index is sorted in nondecreasing order:
movies.index.is_monotonic
True
Since this index is both monotonic and unique, it is sorted in strictly increasing order.
Since it is monotonic, we can also use ranges for key-based lookup:
movies.loc[6:10]
title | imdbID | spanishTitle | imdbPictureURL | year | rtID | rtAllCriticsRating | rtAllCriticsNumReviews | rtAllCriticsNumFresh | rtAllCriticsNumRotten | rtAllCriticsScore | rtTopCriticsRating | rtTopCriticsNumReviews | rtTopCriticsNumFresh | rtTopCriticsNumRotten | rtTopCriticsScore | rtAudienceRating | rtAudienceNumRatings | rtAudienceScore | rtPictureURL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||||
6 | Heat | 113277 | Heat | http://ia.media-imdb.com/images/M/MV5BMTM1NDc4... | 1995 | 1068182-heat | 7.7 | 58.0 | 50.0 | 8.0 | 86.0 | 7.2 | 17.0 | 14.0 | 3.0 | 82.0 | 3.9 | 42785.0 | 92.0 | http://content9.flixster.com/movie/26/80/26809... |
7 | Sabrina | 47437 | Sabrina | http://ia.media-imdb.com/images/M/MV5BMTYyNDM1... | 1954 | 1018047-sabrina | 7.4 | 31.0 | 28.0 | 3.0 | 90.0 | 7.2 | 5.0 | 5.0 | 0.0 | 100.0 | 3.8 | 12812.0 | 87.0 | http://content7.flixster.com/movie/10/93/36/10... |
8 | Tom and Huck | 112302 | Tom y Huck | http://ia.media-imdb.com/images/M/MV5BMTUxNDYz... | 1995 | tom_and_huck | 4.2 | 8.0 | 2.0 | 6.0 | 25.0 | 0.0 | 2.0 | 1.0 | 1.0 | 50.0 | 2.7 | 2649.0 | 45.0 | http://content9.flixster.com/movie/26/16/26169... |
9 | Sudden Death | 114576 | Sudden Death: muerte súbita | http://ia.media-imdb.com/images/M/MV5BMTcwMTU2... | 1995 | 1068470-sudden_death | 5.2 | 32.0 | 17.0 | 15.0 | 53.0 | 5.6 | 9.0 | 5.0 | 4.0 | 55.0 | 2.6 | 3626.0 | 40.0 | http://content8.flixster.com/movie/27/91/27912... |
10 | GoldenEye | 113189 | GoldenEye | http://ia.media-imdb.com/images/M/MV5BNTE1OTEx... | 1995 | goldeneye | 6.8 | 41.0 | 33.0 | 8.0 | 80.0 | 6.2 | 11.0 | 7.0 | 4.0 | 63.0 | 3.4 | 28260.0 | 78.0 | http://content9.flixster.com/movie/26/66/26669... |
Note that range-based indexing in .loc
includes both ends, unlike slices in almost any other part of Python.
Unsorted Indexes: While we might expect slice-based key indexing to get things in the range, that is not what it does. It locates the positions of the start and end keys, and returns everything between them. If the index is not sorted, this is usually not useful. Only use key slices on sorted indexes.
Hierarchical Indexing¶
Pandas supports hierarchical indexes, with MultiIndex
.
Create a hierarchical index by passing multiple columns to set_index
.
For example, let's make a series that indexes ratings by user and item IDs:
idx_ratings = ratings.set_index(['userID', 'movieID'])
idx_ratings
rating | timestamp | ||
---|---|---|---|
userID | movieID | ||
75 | 3 | 1.0 | 1162160236000 |
32 | 4.5 | 1162160624000 | |
110 | 4.0 | 1162161008000 | |
160 | 2.0 | 1162160212000 | |
163 | 4.0 | 1162160970000 | |
... | ... | ... | ... |
71534 | 44555 | 4.0 | 1196647538000 |
46578 | 4.0 | 1196647004000 | |
48516 | 4.5 | 1196646826000 | |
61075 | 5.0 | 1223625365000 | |
62049 | 4.5 | 1223625490000 |
855598 rows × 2 columns
We see on the left that there are two levels to the index: userID
and movieID
. We can see this with .names
:
idx_ratings.index.names
FrozenList(['userID', 'movieID'])
The multi-level index lets us look up values by just the outermost index, getting all rows with that outer index value. For example, to get all ratings by user 190:
idx_ratings.loc[190]
rating | timestamp | |
---|---|---|
movieID | ||
1 | 4.5 | 1057778398000 |
2 | 4.0 | 1053457414000 |
7 | 2.0 | 1049823937000 |
32 | 3.0 | 1151336652000 |
95 | 2.0 | 1078019174000 |
... | ... | ... |
54001 | 4.5 | 1185219688000 |
54259 | 4.0 | 1217875029000 |
58559 | 4.5 | 1217258754000 |
59315 | 4.0 | 1210627067000 |
64497 | 3.0 | 1230991072000 |
296 rows × 2 columns
Since we are looking up one user, it drops the user level from the index. If we looked up multiple users, it would not:
idx_ratings.loc[[75, 190]]
rating | timestamp | ||
---|---|---|---|
userID | movieID | ||
75 | 3 | 1.0 | 1162160236000 |
32 | 4.5 | 1162160624000 | |
110 | 4.0 | 1162161008000 | |
160 | 2.0 | 1162160212000 | |
163 | 4.0 | 1162160970000 | |
... | ... | ... | ... |
190 | 54001 | 4.5 | 1185219688000 |
54259 | 4.0 | 1217875029000 | |
58559 | 4.5 | 1217258754000 | |
59315 | 4.0 | 1210627067000 | |
64497 | 3.0 | 1230991072000 |
351 rows × 2 columns
Multi-indexes are one place where Python makes a distinction between lists and tuples. If we look up (75, 32)
, it will look up user 75's rating for item 32:
idx_ratings.loc[(75, 32)]
rating 4.500000e+00 timestamp 1.162161e+12 Name: (75, 32), dtype: float64
We can index by a list of tuples:
idx_ratings.loc[[(75, 32), (190, 58559)]]
rating | timestamp | ||
---|---|---|---|
userID | movieID | ||
75 | 32 | 4.5 | 1162160624000 |
190 | 58559 | 4.5 | 1217258754000 |
So we have the following principle:
- The index locator can be a key, or a list or array of keys
- A key can be a single value, looking up in the outermost index level; or a tuple of values, looking up by multiple levels simultaneously
Be careful: hierarchical indexes are not as useful as they look. They are great for lookups, and joins, but they often have problems too.
In particular, if you have a large data frame with a hiearchical index, there are several operations — including groupby — that will cause memory explosion. If you call melt
with multiple ID columns, it will turn them into a hierarchical index.
I usually use hierarchical indexes for two purpose:
- Storing data for multidimensional lookup, like the user-and-item ratings
- Joining on mulitiple simultaneous columns
In both cases, once I have the data I need, I usually strip away the multi-index before moving forward (with reset_index
, or with a join or lookup operation that leaves the data frame with a normal one-level index) to reduce the risk of unexpected memory explosion.
Sorting and Realigning¶
We can rearrange data frames based on their indexes.
Sorting¶
One useful operation to realign a data frame (or series) with respect to its index is to sort the index. Our index is already sorted by ID, but what if we wanted to use IMDB IDs?
imdb_movies = movies.reset_index().set_index('imdbID')
imdb_movies
id | title | spanishTitle | imdbPictureURL | year | rtID | rtAllCriticsRating | rtAllCriticsNumReviews | rtAllCriticsNumFresh | rtAllCriticsNumRotten | rtAllCriticsScore | rtTopCriticsRating | rtTopCriticsNumReviews | rtTopCriticsNumFresh | rtTopCriticsNumRotten | rtTopCriticsScore | rtAudienceRating | rtAudienceNumRatings | rtAudienceScore | rtPictureURL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
imdbID | ||||||||||||||||||||
114709 | 1 | Toy story | Toy story (juguetes) | http://ia.media-imdb.com/images/M/MV5BMTMwNDU0... | 1995 | toy_story | 9.0 | 73.0 | 73.0 | 0.0 | 100.0 | 8.5 | 17.0 | 17.0 | 0.0 | 100.0 | 3.7 | 102338.0 | 81.0 | http://content7.flixster.com/movie/10/93/63/10... |
113497 | 2 | Jumanji | Jumanji | http://ia.media-imdb.com/images/M/MV5BMzM5NjE1... | 1995 | 1068044-jumanji | 5.6 | 28.0 | 13.0 | 15.0 | 46.0 | 5.8 | 5.0 | 2.0 | 3.0 | 40.0 | 3.2 | 44587.0 | 61.0 | http://content8.flixster.com/movie/56/79/73/56... |
107050 | 3 | Grumpy Old Men | Dos viejos gruñones | http://ia.media-imdb.com/images/M/MV5BMTI5MTgy... | 1993 | grumpy_old_men | 5.9 | 36.0 | 24.0 | 12.0 | 66.0 | 7.0 | 6.0 | 5.0 | 1.0 | 83.0 | 3.2 | 10489.0 | 66.0 | http://content6.flixster.com/movie/25/60/25602... |
114885 | 4 | Waiting to Exhale | Esperando un respiro | http://ia.media-imdb.com/images/M/MV5BMTczMTMy... | 1995 | waiting_to_exhale | 5.6 | 25.0 | 14.0 | 11.0 | 56.0 | 5.5 | 11.0 | 5.0 | 6.0 | 45.0 | 3.3 | 5666.0 | 79.0 | http://content9.flixster.com/movie/10/94/17/10... |
113041 | 5 | Father of the Bride Part II | Vuelve el padre de la novia (Ahora también abu... | http://ia.media-imdb.com/images/M/MV5BMTg1NDc2... | 1995 | father_of_the_bride_part_ii | 5.3 | 19.0 | 9.0 | 10.0 | 47.0 | 5.4 | 5.0 | 1.0 | 4.0 | 20.0 | 3.0 | 13761.0 | 64.0 | http://content8.flixster.com/movie/25/54/25542... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
960731 | 65088 | Bedtime Stories | Más allá de los sueños | http://ia.media-imdb.com/images/M/MV5BMjA5Njk5... | 2008 | bedtime_stories | 4.4 | 104.0 | 26.0 | 78.0 | 25.0 | 4.7 | 26.0 | 6.0 | 20.0 | 23.0 | 3.5 | 108877.0 | 63.0 | http://content6.flixster.com/movie/10/94/33/10... |
25464 | 65091 | Manhattan Melodrama | El enemigo público número 1 | http://ia.media-imdb.com/images/M/MV5BMTUyODE3... | 1934 | manhattan_melodrama | 7.0 | 12.0 | 10.0 | 2.0 | 83.0 | 0.0 | 4.0 | 2.0 | 2.0 | 50.0 | 3.7 | 344.0 | 71.0 | http://content9.flixster.com/movie/66/44/64/66... |
1024715 | 65126 | Choke | Choke | http://ia.media-imdb.com/images/M/MV5BMTMxMDI4... | 2008 | choke | 5.6 | 135.0 | 73.0 | 62.0 | 54.0 | 4.9 | 26.0 | 8.0 | 18.0 | 30.0 | 3.3 | 13893.0 | 55.0 | http://content6.flixster.com/movie/10/85/09/10... |
959337 | 65130 | Revolutionary Road | Revolutionary Road | http://ia.media-imdb.com/images/M/MV5BMTI2MzY2... | 2008 | revolutionary_road | 6.7 | 194.0 | 133.0 | 61.0 | 68.0 | 6.9 | 36.0 | 25.0 | 11.0 | 69.0 | 3.5 | 46044.0 | 70.0 | http://content8.flixster.com/movie/10/88/40/10... |
212579 | 65133 | Blackadder Back & Forth | Blackadder Back & Forth | http://ia.media-imdb.com/images/M/MV5BMjA5MjU4... | 1999 | blackadder-back-forth | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | http://content7.flixster.com/movie/34/10/69/34... |
10197 rows × 20 columns
Now let's sort and show again:
imdb_movies.sort_index(inplace=True)
imdb_movies
id | title | spanishTitle | imdbPictureURL | year | rtID | rtAllCriticsRating | rtAllCriticsNumReviews | rtAllCriticsNumFresh | rtAllCriticsNumRotten | rtAllCriticsScore | rtTopCriticsRating | rtTopCriticsNumReviews | rtTopCriticsNumFresh | rtTopCriticsNumRotten | rtTopCriticsScore | rtAudienceRating | rtAudienceNumRatings | rtAudienceScore | rtPictureURL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
imdbID | ||||||||||||||||||||
439 | 6990 | The Great Train Robbery | Asalto y robo al tren | http://ia.media-imdb.com/images/M/MV5BMTk3OTQy... | 1903 | the-great-train-robbery1903 | 7.6 | 6.0 | 6.0 | 0.0 | 100.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | http://content7.flixster.com/movie/10/87/70/10... |
4972 | 7065 | The Birth of a Nation | El nacimiento de una nación | http://ia.media-imdb.com/images/M/MV5BMjI2MzU4... | 1915 | birth_of_a_nation | 8.0 | 31.0 | 31.0 | 0.0 | 100.0 | 0.0 | 4.0 | 4.0 | 0.0 | 100.0 | 3.3 | 1249.0 | 58.0 | http://content9.flixster.com/movie/10/88/58/10... |
6864 | 7243 | Intolerance: Love's Struggle Throughout the Ages | Intolerancia | http://ia.media-imdb.com/images/M/MV5BODEyNDUy... | 1916 | 1010644-intolerance | 7.8 | 23.0 | 22.0 | 1.0 | 95.0 | 0.0 | 3.0 | 3.0 | 0.0 | 100.0 | 3.8 | 1397.0 | 77.0 | http://content6.flixster.com/movie/31/14/31140... |
8133 | 8511 | The Immigrant | Charlot emigrante | http://ia.media-imdb.com/images/M/MV5BMjExNDE5... | 1917 | immigrant | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | http://content7.flixster.com/movie/10/85/69/10... |
8395 | 48374 | Otets Sergiy | El padre Sergio | NaN | 1917 | otets-sergiy-father-sergius | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | http://content7.flixster.com/movie/10/91/43/10... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1229827 | 4260 | Jonas Brothers: The 3D Concert Experience | Jonas Brothers: The 3D Concert Experience | http://ia.media-imdb.com/images/M/MV5BMTU3NzE4... | 2009 | jonas_brothers_3d | 4.2 | 71.0 | 16.0 | 55.0 | 22.0 | 4.6 | 22.0 | 7.0 | 15.0 | 31.0 | 3.3 | 74109.0 | 64.0 | http://content7.flixster.com/movie/10/92/39/10... |
1233247 | 757 | Dung che sai duk | Este contraveneno del Oeste | http://ia.media-imdb.com/images/M/MV5BMTc0NzMx... | 1994 | ashes_of_time_redux | 6.6 | 69.0 | 54.0 | 15.0 | 78.0 | 6.7 | 18.0 | 14.0 | 4.0 | 77.0 | 3.6 | 2166.0 | 71.0 | http://content8.flixster.com/movie/10/93/19/10... |
1233381 | 63698 | Üç maymun | Tres monos | http://ia.media-imdb.com/images/M/MV5BMTMwNjM3... | 2008 | three_monkeys | 6.7 | 56.0 | 43.0 | 13.0 | 76.0 | 7.7 | 13.0 | 12.0 | 1.0 | 92.0 | 3.6 | 1239.0 | 72.0 | http://content8.flixster.com/movie/11/13/29/11... |
1275532 | 4457 | Bianca Beauchamp All Access 2: Rubberized | Bianca Beauchamp All Access 2: Rubberized | http://ia.media-imdb.com/images/M/MV5BMTI1ODQ1... | 2008 | bianca-beauchamp-all-access-2-rubberized | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | http://content6.flixster.com/movie/10/88/44/10... |
1349938 | 418 | Being Human | Being Human | http://ia.media-imdb.com/images/M/MV5BMjA1MzIw... | 2008 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
10197 rows × 20 columns
Now the index reports that it's sorted too:
imdb_movies.index.is_monotonic
True
Reindexing¶
Another operation on an index is a reindex. When we reindex a data frame or series, we give it an array of key values, and it rearranges elements to correspond to the new array. It will insert missing values for keys that don't exist in the original frame.
So let's reindex the movie frame (we are not passing inplace
, so it's going to return a new frame):
movies.reindex([1, 6, 101, 2038401])
title | imdbID | spanishTitle | imdbPictureURL | year | rtID | rtAllCriticsRating | rtAllCriticsNumReviews | rtAllCriticsNumFresh | rtAllCriticsNumRotten | rtAllCriticsScore | rtTopCriticsRating | rtTopCriticsNumReviews | rtTopCriticsNumFresh | rtTopCriticsNumRotten | rtTopCriticsScore | rtAudienceRating | rtAudienceNumRatings | rtAudienceScore | rtPictureURL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||||
1 | Toy story | 114709.0 | Toy story (juguetes) | http://ia.media-imdb.com/images/M/MV5BMTMwNDU0... | 1995.0 | toy_story | 9.0 | 73.0 | 73.0 | 0.0 | 100.0 | 8.5 | 17.0 | 17.0 | 0.0 | 100.0 | 3.7 | 102338.0 | 81.0 | http://content7.flixster.com/movie/10/93/63/10... |
6 | Heat | 113277.0 | Heat | http://ia.media-imdb.com/images/M/MV5BMTM1NDc4... | 1995.0 | 1068182-heat | 7.7 | 58.0 | 50.0 | 8.0 | 86.0 | 7.2 | 17.0 | 14.0 | 3.0 | 82.0 | 3.9 | 42785.0 | 92.0 | http://content9.flixster.com/movie/26/80/26809... |
101 | Bottle Rocket | 115734.0 | Ladrón que roba a otro ladrón | http://ia.media-imdb.com/images/M/MV5BMTk1ODI3... | 1996.0 | bottle_rocket | 6.6 | 42.0 | 33.0 | 9.0 | 78.0 | 6.1 | 9.0 | 5.0 | 4.0 | 55.0 | 3.7 | 16217.0 | 80.0 | http://content6.flixster.com/movie/26/73/26732... |
2038401 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
There is no movie 2038401, so it fills in NaN
everywhere.
Alignment¶
Another useful operation is align
.
It reindexes two series so that their indexes match, and we can then perform more operations on them.
When you do vectorized operations on two series, Pandas first aligns their indexes and then does the operation on the corresponding values.
For example, we can treat users' ratings as item-dimensional vectors, and compute the cosine as a measure of how similar their ratings are.
$$\mathrm{cos}(\vec{x}, \vec{y}) = \frac{\vec{x} \cdot \vec{y}}{\|\vec{x}\|_2 \|\vec{x}\|_2}$$First, let's get the rating vectors for two users:
u1_rates = idx_ratings.loc[75, 'rating']
u2_rates = idx_ratings.loc[190, 'rating']
Let's multiply their ratings - Pandas will automatically align them:
u12_mult = u1_rates * u2_rates
u12_mult.describe()
count 22.000000 mean 13.534091 std 6.139400 min 2.250000 25% 12.062500 50% 15.375000 75% 17.500000 max 22.500000 Name: rating, dtype: float64
u12_mult
movieID 1 NaN 2 NaN 3 NaN 7 NaN 32 13.5 ... 54001 NaN 54259 NaN 58559 NaN 59315 NaN 64497 NaN Name: rating, Length: 329, dtype: float64
There are 22 values that aren't NaN
- those are the values where both users have a rating. This is because the alignment fills in NaN, and NaN times a rating value is NaN.
To finish the cosine, we can compute the sum of this, and the norms of the two vectors:
np.sum(u12_mult) / (spla.norm(u1_rates) * spla.norm(u2_rates))
0.19577120060037112
The scipy.linalg.norm
function computes the Euclidean norm $\| \cdot \|_2$.
But what does this alignment do? We can do it ourselves with .align
. First let's look at our vectors:
u1_rates
movieID 3 1.0 32 4.5 110 4.0 160 2.0 163 4.0 165 4.5 173 3.5 296 5.0 353 3.5 420 2.0 589 4.0 653 3.0 832 4.5 920 0.5 996 4.5 1036 4.0 1127 3.5 1215 4.5 1233 4.0 1304 2.5 1370 4.0 1374 4.0 1485 4.0 1527 4.5 1917 2.5 2011 2.0 2054 1.5 2058 4.0 2490 4.0 2571 4.5 2640 3.0 2688 3.0 2700 4.5 2762 3.5 2959 4.5 3258 1.5 3793 3.0 3889 3.0 3994 3.5 4993 3.5 5107 3.0 5833 2.5 5952 3.5 6213 4.0 6225 0.5 6333 4.0 7000 3.5 7007 4.5 7153 3.5 32029 4.5 32587 5.0 33437 3.5 39052 3.5 45431 3.5 45722 4.5 Name: rating, dtype: float64
u2_rates
movieID 1 4.5 2 4.0 7 2.0 32 3.0 95 2.0 ... 54001 4.5 54259 4.0 58559 4.5 59315 4.0 64497 3.0 Name: rating, Length: 296, dtype: float64
The align
method returns a tuple of the aligned series:
u1a, u2a = u1_rates.align(u2_rates)
u1a
movieID 1 NaN 2 NaN 3 1.0 7 NaN 32 4.5 ... 54001 NaN 54259 NaN 58559 NaN 59315 NaN 64497 NaN Name: rating, Length: 329, dtype: float64
u2a
movieID 1 4.5 2 4.0 3 NaN 7 2.0 32 3.0 ... 54001 4.5 54259 4.0 58559 4.5 59315 4.0 64497 3.0 Name: rating, Length: 329, dtype: float64
They now have the same index, and the values are put in place. Missing values are filled in with NaN
. These vectors are now ready to be multipled, or have other operations applied!
Index Gotchas¶
Indexes do not have to be unique - the same key can be in multiple rows.
I don't recommend using non-unique indexes. They often aren't terribly useful, and they very quickly get slow.
As described above, hierarchical indexes also have problems, particularly with groupby
. They can eat an incredible amount of memory, send your system deep into swap, and slow it to a crawl.
Learning More¶
- The textbook
- Indexing and Selecting Data in the Pandas user guide
- MultiIndex / Advanced Indexing in the Pandas user guide