Indexing#

This notebook discusses Pandas indexing operations, and pitfalls.

Setup#

First our Python modules:

import numpy as np
import pandas as pd
import scipy.linalg as spla
import matplotlib.pyplot as plt
import seaborn as sns

And then the movie data, as usual. Movies:

movies = pd.read_csv('hetrec2011-ml/movies.dat', delimiter='\t', encoding='latin1', na_values=['\\N'])
movies.head()
id title imdbID spanishTitle imdbPictureURL year rtID rtAllCriticsRating rtAllCriticsNumReviews rtAllCriticsNumFresh ... rtAllCriticsScore rtTopCriticsRating rtTopCriticsNumReviews rtTopCriticsNumFresh rtTopCriticsNumRotten rtTopCriticsScore rtAudienceRating rtAudienceNumRatings rtAudienceScore rtPictureURL
0 1 Toy story 114709 Toy story (juguetes) http://ia.media-imdb.com/images/M/MV5BMTMwNDU0... 1995 toy_story 9.0 73.0 73.0 ... 100.0 8.5 17.0 17.0 0.0 100.0 3.7 102338.0 81.0 http://content7.flixster.com/movie/10/93/63/10...
1 2 Jumanji 113497 Jumanji http://ia.media-imdb.com/images/M/MV5BMzM5NjE1... 1995 1068044-jumanji 5.6 28.0 13.0 ... 46.0 5.8 5.0 2.0 3.0 40.0 3.2 44587.0 61.0 http://content8.flixster.com/movie/56/79/73/56...
2 3 Grumpy Old Men 107050 Dos viejos gruñones http://ia.media-imdb.com/images/M/MV5BMTI5MTgy... 1993 grumpy_old_men 5.9 36.0 24.0 ... 66.0 7.0 6.0 5.0 1.0 83.0 3.2 10489.0 66.0 http://content6.flixster.com/movie/25/60/25602...
3 4 Waiting to Exhale 114885 Esperando un respiro http://ia.media-imdb.com/images/M/MV5BMTczMTMy... 1995 waiting_to_exhale 5.6 25.0 14.0 ... 56.0 5.5 11.0 5.0 6.0 45.0 3.3 5666.0 79.0 http://content9.flixster.com/movie/10/94/17/10...
4 5 Father of the Bride Part II 113041 Vuelve el padre de la novia (Ahora también abu... http://ia.media-imdb.com/images/M/MV5BMTg1NDc2... 1995 father_of_the_bride_part_ii 5.3 19.0 9.0 ... 47.0 5.4 5.0 1.0 4.0 20.0 3.0 13761.0 64.0 http://content8.flixster.com/movie/25/54/25542...

5 rows × 21 columns

And ratings:

ratings = pd.read_csv('hetrec2011-ml/user_ratedmovies-timestamps.dat', delimiter='\t', encoding='latin1')
ratings.head()
userID movieID rating timestamp
0 75 3 1.0 1162160236000
1 75 32 4.5 1162160624000
2 75 110 4.0 1162161008000
3 75 160 2.0 1162160212000
4 75 163 4.0 1162160970000

And let’s load tags:

tags = pd.read_csv('hetrec2011-ml/tags.dat', delimiter='\t', encoding='latin1')
tags.head()
id value
0 1 earth
1 2 police
2 3 boxing
3 4 painter
4 5 whale

What Is an Index?#

An index is what we use to locate rows or columns in a Pandas data structure. (While conceptually we can think of a data frame as a dictionary of columns, the name-to-column mapping is actually done with an index.)

An index is two things in one data structure:

  • An array of key values (given a position, we can get the key value)

  • A hash map to look up row position(s) for a given key value

Indexes keys can be any (immutable) data type. Pandas has special support for indexing over datetime64 objects, for time series.

Range Index#

When we load the data from the CSV file, it by default it is loaded with a RangeIndex, in which the keys and the positions are the same. These indexes are very memory-efficient, because they don’t actually need to store key values. Let’s see the movie index:

movies.index
RangeIndex(start=0, stop=10197, step=1)

With a range index, loc and iloc do the same thing (for rows):

movies.loc[5, ['id', 'title']]
id          6
title    Heat
Name: 5, dtype: object
movies.iloc[5, 0:2]
id          6
title    Heat
Name: 5, dtype: object

Setting an Index#

The set_index method makes a column into the index. It will replace the current index; if the current index is a range index, that is usually fine.

There are two versions:

  • By default, it returns a new data frame with the new index, and leaves the original unchanged. The frame shares underlying data with the original, so it doesn’t take up that much memory.

  • With inplace=True, it modifies the data frame to have the new index.

Both are useful. Be very careful with inplace in a function; the caller probably doesn’t expect the data frame to be modified!

Let’s index the movies frame by movie ID:

movies = movies.set_index('id')
movies.index
Int64Index([    1,     2,     3,     4,     5,     6,     7,     8,     9,
               10,
            ...
            64997, 64999, 65006, 65011, 65037, 65088, 65091, 65126, 65130,
            65133],
           dtype='int64', name='id', length=10197)

The new index is also reflected in .info(), and the id column is gone:

movies.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10197 entries, 1 to 65133
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   title                   10197 non-null  object 
 1   imdbID                  10197 non-null  int64  
 2   spanishTitle            10197 non-null  object 
 3   imdbPictureURL          10016 non-null  object 
 4   year                    10197 non-null  int64  
 5   rtID                    9886 non-null   object 
 6   rtAllCriticsRating      9967 non-null   float64
 7   rtAllCriticsNumReviews  9967 non-null   float64
 8   rtAllCriticsNumFresh    9967 non-null   float64
 9   rtAllCriticsNumRotten   9967 non-null   float64
 10  rtAllCriticsScore       9967 non-null   float64
 11  rtTopCriticsRating      9967 non-null   float64
 12  rtTopCriticsNumReviews  9967 non-null   float64
 13  rtTopCriticsNumFresh    9967 non-null   float64
 14  rtTopCriticsNumRotten   9967 non-null   float64
 15  rtTopCriticsScore       9967 non-null   float64
 16  rtAudienceRating        9967 non-null   float64
 17  rtAudienceNumRatings    9967 non-null   float64
 18  rtAudienceScore         9967 non-null   float64
 19  rtPictureURL            9967 non-null   object 
dtypes: float64(13), int64(2), object(5)
memory usage: 1.9+ MB

We now have an index that lets us up by 64-bit integer keys. We can now look up Heat by movie ID:

movies.loc[6]
title                                                                  Heat
imdbID                                                               113277
spanishTitle                                                           Heat
imdbPictureURL            http://ia.media-imdb.com/images/M/MV5BMTM1NDc4...
year                                                                   1995
rtID                                                           1068182-heat
rtAllCriticsRating                                                      7.7
rtAllCriticsNumReviews                                                   58
rtAllCriticsNumFresh                                                     50
rtAllCriticsNumRotten                                                     8
rtAllCriticsScore                                                        86
rtTopCriticsRating                                                      7.2
rtTopCriticsNumReviews                                                   17
rtTopCriticsNumFresh                                                     14
rtTopCriticsNumRotten                                                     3
rtTopCriticsScore                                                        82
rtAudienceRating                                                        3.9
rtAudienceNumRatings                                                  42785
rtAudienceScore                                                          92
rtPictureURL              http://content9.flixster.com/movie/26/80/26809...
Name: 6, dtype: object

iloc still returns this at position 5:

movies.iloc[5]
title                                                                  Heat
imdbID                                                               113277
spanishTitle                                                           Heat
imdbPictureURL            http://ia.media-imdb.com/images/M/MV5BMTM1NDc4...
year                                                                   1995
rtID                                                           1068182-heat
rtAllCriticsRating                                                      7.7
rtAllCriticsNumReviews                                                   58
rtAllCriticsNumFresh                                                     50
rtAllCriticsNumRotten                                                     8
rtAllCriticsScore                                                        86
rtTopCriticsRating                                                      7.2
rtTopCriticsNumReviews                                                   17
rtTopCriticsNumFresh                                                     14
rtTopCriticsNumRotten                                                     3
rtTopCriticsScore                                                        82
rtAudienceRating                                                        3.9
rtAudienceNumRatings                                                  42785
rtAudienceScore                                                          92
rtPictureURL              http://content9.flixster.com/movie/26/80/26809...
Name: 6, dtype: object

The reset_index method will turn the index back into a column, and put a range index back in place:

movies.reset_index().info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10197 entries, 0 to 10196
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      10197 non-null  int64  
 1   title                   10197 non-null  object 
 2   imdbID                  10197 non-null  int64  
 3   spanishTitle            10197 non-null  object 
 4   imdbPictureURL          10016 non-null  object 
 5   year                    10197 non-null  int64  
 6   rtID                    9886 non-null   object 
 7   rtAllCriticsRating      9967 non-null   float64
 8   rtAllCriticsNumReviews  9967 non-null   float64
 9   rtAllCriticsNumFresh    9967 non-null   float64
 10  rtAllCriticsNumRotten   9967 non-null   float64
 11  rtAllCriticsScore       9967 non-null   float64
 12  rtTopCriticsRating      9967 non-null   float64
 13  rtTopCriticsNumReviews  9967 non-null   float64
 14  rtTopCriticsNumFresh    9967 non-null   float64
 15  rtTopCriticsNumRotten   9967 non-null   float64
 16  rtTopCriticsScore       9967 non-null   float64
 17  rtAudienceRating        9967 non-null   float64
 18  rtAudienceNumRatings    9967 non-null   float64
 19  rtAudienceScore         9967 non-null   float64
 20  rtPictureURL            9967 non-null   object 
dtypes: float64(13), int64(3), object(5)
memory usage: 1.6+ MB

Index Properties and Operations#

We can do quite a few things with an Index.

First, it has a name:

movies.index.name
'id'

We could change the name if we wanted to, by assigning to it:

movies.index.name = 'movieId'

But we won’t actually do that right now.

Arrays and Keys#

The index is also an array - we can get the key at a certain position:

movies.index[5]
6

To go the other way — find the location or position for a key value — we can use the get_loc method:

movies.index.get_loc(6)
5

If we look for a nonexistent key, it fails:

movies.index.get_loc(30481080)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2645             try:
-> 2646                 return self._engine.get_loc(key)
   2647             except KeyError:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 30481080

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-17-937b631c964d> in <module>
----> 1 movies.index.get_loc(30481080)

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2646                 return self._engine.get_loc(key)
   2647             except KeyError:
-> 2648                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2649         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2650         if indexer.ndim > 1 or indexer.size > 1:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 30481080

This is a KeyError, meaning we tried to look up a key that does not exist.

Internally, the index uses a NumPy ndarray to store the keys. We can access this array as .values:

movies.index.values
array([    1,     2,     3, ..., 65126, 65130, 65133], dtype=int64)

Index Characteristics#

There are a few useful characteristics an index can have. It may be unique, or it may have duplicates. This is a unique index:

movies.index.is_unique
True
movies.index.has_duplicates
False

An index can also be monotonic, or sorted; the is_monotonic accessor is a shortcut for is_monotonic_increasing and returns True if the index is sorted in nondecreasing order:

movies.index.is_monotonic
True

Since this index is both monotonic and unique, it is sorted in strictly increasing order.

Since it is monotonic, we can also use ranges for key-based lookup:

movies.loc[6:10]
title imdbID spanishTitle imdbPictureURL year rtID rtAllCriticsRating rtAllCriticsNumReviews rtAllCriticsNumFresh rtAllCriticsNumRotten rtAllCriticsScore rtTopCriticsRating rtTopCriticsNumReviews rtTopCriticsNumFresh rtTopCriticsNumRotten rtTopCriticsScore rtAudienceRating rtAudienceNumRatings rtAudienceScore rtPictureURL
id
6 Heat 113277 Heat http://ia.media-imdb.com/images/M/MV5BMTM1NDc4... 1995 1068182-heat 7.7 58.0 50.0 8.0 86.0 7.2 17.0 14.0 3.0 82.0 3.9 42785.0 92.0 http://content9.flixster.com/movie/26/80/26809...
7 Sabrina 47437 Sabrina http://ia.media-imdb.com/images/M/MV5BMTYyNDM1... 1954 1018047-sabrina 7.4 31.0 28.0 3.0 90.0 7.2 5.0 5.0 0.0 100.0 3.8 12812.0 87.0 http://content7.flixster.com/movie/10/93/36/10...
8 Tom and Huck 112302 Tom y Huck http://ia.media-imdb.com/images/M/MV5BMTUxNDYz... 1995 tom_and_huck 4.2 8.0 2.0 6.0 25.0 0.0 2.0 1.0 1.0 50.0 2.7 2649.0 45.0 http://content9.flixster.com/movie/26/16/26169...
9 Sudden Death 114576 Sudden Death: muerte súbita http://ia.media-imdb.com/images/M/MV5BMTcwMTU2... 1995 1068470-sudden_death 5.2 32.0 17.0 15.0 53.0 5.6 9.0 5.0 4.0 55.0 2.6 3626.0 40.0 http://content8.flixster.com/movie/27/91/27912...
10 GoldenEye 113189 GoldenEye http://ia.media-imdb.com/images/M/MV5BNTE1OTEx... 1995 goldeneye 6.8 41.0 33.0 8.0 80.0 6.2 11.0 7.0 4.0 63.0 3.4 28260.0 78.0 http://content9.flixster.com/movie/26/66/26669...

Note that range-based indexing in .loc includes both ends, unlike slices in almost any other part of Python.

Unsorted Indexes: While we might expect slice-based key indexing to get things in the range, that is not what it does. It locates the positions of the start and end keys, and returns everything between them. If the index is not sorted, this is usually not useful. Only use key slices on sorted indexes.

Hierarchical Indexing#

Pandas supports hierarchical indexes, with MultiIndex. Create a hierarchical index by passing multiple columns to set_index.

For example, let’s make a series that indexes ratings by user and item IDs:

idx_ratings = ratings.set_index(['userID', 'movieID'])
idx_ratings
rating timestamp
userID movieID
75 3 1.0 1162160236000
32 4.5 1162160624000
110 4.0 1162161008000
160 2.0 1162160212000
163 4.0 1162160970000
... ... ... ...
71534 44555 4.0 1196647538000
46578 4.0 1196647004000
48516 4.5 1196646826000
61075 5.0 1223625365000
62049 4.5 1223625490000

855598 rows × 2 columns

We see on the left that there are two levels to the index: userID and movieID. We can see this with .names:

idx_ratings.index.names
FrozenList(['userID', 'movieID'])

The multi-level index lets us look up values by just the outermost index, getting all rows with that outer index value. For example, to get all ratings by user 190:

idx_ratings.loc[190]
rating timestamp
movieID
1 4.5 1057778398000
2 4.0 1053457414000
7 2.0 1049823937000
32 3.0 1151336652000
95 2.0 1078019174000
... ... ...
54001 4.5 1185219688000
54259 4.0 1217875029000
58559 4.5 1217258754000
59315 4.0 1210627067000
64497 3.0 1230991072000

296 rows × 2 columns

Since we are looking up one user, it drops the user level from the index. If we looked up multiple users, it would not:

idx_ratings.loc[[75, 190]]
rating timestamp
userID movieID
75 3 1.0 1162160236000
32 4.5 1162160624000
110 4.0 1162161008000
160 2.0 1162160212000
163 4.0 1162160970000
... ... ... ...
190 54001 4.5 1185219688000
54259 4.0 1217875029000
58559 4.5 1217258754000
59315 4.0 1210627067000
64497 3.0 1230991072000

351 rows × 2 columns

Multi-indexes are one place where Python makes a distinction between lists and tuples. If we look up (75, 32), it will look up user 75’s rating for item 32:

idx_ratings.loc[(75, 32)]
rating       4.500000e+00
timestamp    1.162161e+12
Name: (75, 32), dtype: float64

We can index by a list of tuples:

idx_ratings.loc[[(75, 32), (190, 58559)]]
rating timestamp
userID movieID
75 32 4.5 1162160624000
190 58559 4.5 1217258754000

So we have the following principle:

  • The index locator can be a key, or a list or array of keys

  • A key can be a single value, looking up in the outermost index level; or a tuple of values, looking up by multiple levels simultaneously

Be careful: hierarchical indexes are not as useful as they look. They are great for lookups, and joins, but they often have problems too. In particular, if you have a large data frame with a hiearchical index, there are several operations — including groupby — that will cause memory explosion. If you call melt with multiple ID columns, it will turn them into a hierarchical index.

I usually use hierarchical indexes for two purpose:

  • Storing data for multidimensional lookup, like the user-and-item ratings

  • Joining on mulitiple simultaneous columns

In both cases, once I have the data I need, I usually strip away the multi-index before moving forward (with reset_index, or with a join or lookup operation that leaves the data frame with a normal one-level index) to reduce the risk of unexpected memory explosion.

Sorting and Realigning#

We can rearrange data frames based on their indexes.

Sorting#

One useful operation to realign a data frame (or series) with respect to its index is to sort the index. Our index is already sorted by ID, but what if we wanted to use IMDB IDs?

imdb_movies = movies.reset_index().set_index('imdbID')
imdb_movies
id title spanishTitle imdbPictureURL year rtID rtAllCriticsRating rtAllCriticsNumReviews rtAllCriticsNumFresh rtAllCriticsNumRotten rtAllCriticsScore rtTopCriticsRating rtTopCriticsNumReviews rtTopCriticsNumFresh rtTopCriticsNumRotten rtTopCriticsScore rtAudienceRating rtAudienceNumRatings rtAudienceScore rtPictureURL
imdbID
114709 1 Toy story Toy story (juguetes) http://ia.media-imdb.com/images/M/MV5BMTMwNDU0... 1995 toy_story 9.0 73.0 73.0 0.0 100.0 8.5 17.0 17.0 0.0 100.0 3.7 102338.0 81.0 http://content7.flixster.com/movie/10/93/63/10...
113497 2 Jumanji Jumanji http://ia.media-imdb.com/images/M/MV5BMzM5NjE1... 1995 1068044-jumanji 5.6 28.0 13.0 15.0 46.0 5.8 5.0 2.0 3.0 40.0 3.2 44587.0 61.0 http://content8.flixster.com/movie/56/79/73/56...
107050 3 Grumpy Old Men Dos viejos gruñones http://ia.media-imdb.com/images/M/MV5BMTI5MTgy... 1993 grumpy_old_men 5.9 36.0 24.0 12.0 66.0 7.0 6.0 5.0 1.0 83.0 3.2 10489.0 66.0 http://content6.flixster.com/movie/25/60/25602...
114885 4 Waiting to Exhale Esperando un respiro http://ia.media-imdb.com/images/M/MV5BMTczMTMy... 1995 waiting_to_exhale 5.6 25.0 14.0 11.0 56.0 5.5 11.0 5.0 6.0 45.0 3.3 5666.0 79.0 http://content9.flixster.com/movie/10/94/17/10...
113041 5 Father of the Bride Part II Vuelve el padre de la novia (Ahora también abu... http://ia.media-imdb.com/images/M/MV5BMTg1NDc2... 1995 father_of_the_bride_part_ii 5.3 19.0 9.0 10.0 47.0 5.4 5.0 1.0 4.0 20.0 3.0 13761.0 64.0 http://content8.flixster.com/movie/25/54/25542...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
960731 65088 Bedtime Stories Más allá de los sueños http://ia.media-imdb.com/images/M/MV5BMjA5Njk5... 2008 bedtime_stories 4.4 104.0 26.0 78.0 25.0 4.7 26.0 6.0 20.0 23.0 3.5 108877.0 63.0 http://content6.flixster.com/movie/10/94/33/10...
25464 65091 Manhattan Melodrama El enemigo público número 1 http://ia.media-imdb.com/images/M/MV5BMTUyODE3... 1934 manhattan_melodrama 7.0 12.0 10.0 2.0 83.0 0.0 4.0 2.0 2.0 50.0 3.7 344.0 71.0 http://content9.flixster.com/movie/66/44/64/66...
1024715 65126 Choke Choke http://ia.media-imdb.com/images/M/MV5BMTMxMDI4... 2008 choke 5.6 135.0 73.0 62.0 54.0 4.9 26.0 8.0 18.0 30.0 3.3 13893.0 55.0 http://content6.flixster.com/movie/10/85/09/10...
959337 65130 Revolutionary Road Revolutionary Road http://ia.media-imdb.com/images/M/MV5BMTI2MzY2... 2008 revolutionary_road 6.7 194.0 133.0 61.0 68.0 6.9 36.0 25.0 11.0 69.0 3.5 46044.0 70.0 http://content8.flixster.com/movie/10/88/40/10...
212579 65133 Blackadder Back & Forth Blackadder Back & Forth http://ia.media-imdb.com/images/M/MV5BMjA5MjU4... 1999 blackadder-back-forth 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 http://content7.flixster.com/movie/34/10/69/34...

10197 rows × 20 columns

Now let’s sort and show again:

imdb_movies.sort_index(inplace=True)
imdb_movies
id title spanishTitle imdbPictureURL year rtID rtAllCriticsRating rtAllCriticsNumReviews rtAllCriticsNumFresh rtAllCriticsNumRotten rtAllCriticsScore rtTopCriticsRating rtTopCriticsNumReviews rtTopCriticsNumFresh rtTopCriticsNumRotten rtTopCriticsScore rtAudienceRating rtAudienceNumRatings rtAudienceScore rtPictureURL
imdbID
439 6990 The Great Train Robbery Asalto y robo al tren http://ia.media-imdb.com/images/M/MV5BMTk3OTQy... 1903 the-great-train-robbery1903 7.6 6.0 6.0 0.0 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 http://content7.flixster.com/movie/10/87/70/10...
4972 7065 The Birth of a Nation El nacimiento de una nación http://ia.media-imdb.com/images/M/MV5BMjI2MzU4... 1915 birth_of_a_nation 8.0 31.0 31.0 0.0 100.0 0.0 4.0 4.0 0.0 100.0 3.3 1249.0 58.0 http://content9.flixster.com/movie/10/88/58/10...
6864 7243 Intolerance: Love's Struggle Throughout the Ages Intolerancia http://ia.media-imdb.com/images/M/MV5BODEyNDUy... 1916 1010644-intolerance 7.8 23.0 22.0 1.0 95.0 0.0 3.0 3.0 0.0 100.0 3.8 1397.0 77.0 http://content6.flixster.com/movie/31/14/31140...
8133 8511 The Immigrant Charlot emigrante http://ia.media-imdb.com/images/M/MV5BMjExNDE5... 1917 immigrant 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 http://content7.flixster.com/movie/10/85/69/10...
8395 48374 Otets Sergiy El padre Sergio NaN 1917 otets-sergiy-father-sergius 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 http://content7.flixster.com/movie/10/91/43/10...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1229827 4260 Jonas Brothers: The 3D Concert Experience Jonas Brothers: The 3D Concert Experience http://ia.media-imdb.com/images/M/MV5BMTU3NzE4... 2009 jonas_brothers_3d 4.2 71.0 16.0 55.0 22.0 4.6 22.0 7.0 15.0 31.0 3.3 74109.0 64.0 http://content7.flixster.com/movie/10/92/39/10...
1233247 757 Dung che sai duk Este contraveneno del Oeste http://ia.media-imdb.com/images/M/MV5BMTc0NzMx... 1994 ashes_of_time_redux 6.6 69.0 54.0 15.0 78.0 6.7 18.0 14.0 4.0 77.0 3.6 2166.0 71.0 http://content8.flixster.com/movie/10/93/19/10...
1233381 63698 Üç maymun Tres monos http://ia.media-imdb.com/images/M/MV5BMTMwNjM3... 2008 three_monkeys 6.7 56.0 43.0 13.0 76.0 7.7 13.0 12.0 1.0 92.0 3.6 1239.0 72.0 http://content8.flixster.com/movie/11/13/29/11...
1275532 4457 Bianca Beauchamp All Access 2: Rubberized Bianca Beauchamp All Access 2: Rubberized http://ia.media-imdb.com/images/M/MV5BMTI1ODQ1... 2008 bianca-beauchamp-all-access-2-rubberized 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 http://content6.flixster.com/movie/10/88/44/10...
1349938 418 Being Human Being Human http://ia.media-imdb.com/images/M/MV5BMjA1MzIw... 2008 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

10197 rows × 20 columns

Now the index reports that it’s sorted too:

imdb_movies.index.is_monotonic
True

Reindexing#

Another operation on an index is a reindex. When we reindex a data frame or series, we give it an array of key values, and it rearranges elements to correspond to the new array. It will insert missing values for keys that don’t exist in the original frame.

So let’s reindex the movie frame (we are not passing inplace, so it’s going to return a new frame):

movies.reindex([1, 6, 101, 2038401])
title imdbID spanishTitle imdbPictureURL year rtID rtAllCriticsRating rtAllCriticsNumReviews rtAllCriticsNumFresh rtAllCriticsNumRotten rtAllCriticsScore rtTopCriticsRating rtTopCriticsNumReviews rtTopCriticsNumFresh rtTopCriticsNumRotten rtTopCriticsScore rtAudienceRating rtAudienceNumRatings rtAudienceScore rtPictureURL
id
1 Toy story 114709.0 Toy story (juguetes) http://ia.media-imdb.com/images/M/MV5BMTMwNDU0... 1995.0 toy_story 9.0 73.0 73.0 0.0 100.0 8.5 17.0 17.0 0.0 100.0 3.7 102338.0 81.0 http://content7.flixster.com/movie/10/93/63/10...
6 Heat 113277.0 Heat http://ia.media-imdb.com/images/M/MV5BMTM1NDc4... 1995.0 1068182-heat 7.7 58.0 50.0 8.0 86.0 7.2 17.0 14.0 3.0 82.0 3.9 42785.0 92.0 http://content9.flixster.com/movie/26/80/26809...
101 Bottle Rocket 115734.0 Ladrón que roba a otro ladrón http://ia.media-imdb.com/images/M/MV5BMTk1ODI3... 1996.0 bottle_rocket 6.6 42.0 33.0 9.0 78.0 6.1 9.0 5.0 4.0 55.0 3.7 16217.0 80.0 http://content6.flixster.com/movie/26/73/26732...
2038401 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

There is no movie 2038401, so it fills in NaN everywhere.

Alignment#

Another useful operation is align. It reindexes two series so that their indexes match, and we can then perform more operations on them. When you do vectorized operations on two series, Pandas first aligns their indexes and then does the operation on the corresponding values.

For example, we can treat users’ ratings as item-dimensional vectors, and compute the cosine as a measure of how similar their ratings are.

\[\mathrm{cos}(\vec{x}, \vec{y}) = \frac{\vec{x} \cdot \vec{y}}{\|\vec{x}\|_2 \|\vec{x}\|_2}\]

First, let’s get the rating vectors for two users:

u1_rates = idx_ratings.loc[75, 'rating']
u2_rates = idx_ratings.loc[190, 'rating']

Let’s multiply their ratings - Pandas will automatically align them:

u12_mult = u1_rates * u2_rates
u12_mult.describe()
count    22.000000
mean     13.534091
std       6.139400
min       2.250000
25%      12.062500
50%      15.375000
75%      17.500000
max      22.500000
Name: rating, dtype: float64
u12_mult
movieID
1         NaN
2         NaN
3         NaN
7         NaN
32       13.5
         ... 
54001     NaN
54259     NaN
58559     NaN
59315     NaN
64497     NaN
Name: rating, Length: 329, dtype: float64

There are 22 values that aren’t NaN - those are the values where both users have a rating. This is because the alignment fills in NaN, and NaN times a rating value is NaN.

To finish the cosine, we can compute the sum of this, and the norms of the two vectors:

np.sum(u12_mult) / (spla.norm(u1_rates) * spla.norm(u2_rates))
0.19577120060037112

The scipy.linalg.norm function computes the Euclidean norm \(\| \cdot \|_2\).

But what does this alignment do? We can do it ourselves with .align. First let’s look at our vectors:

u1_rates 
movieID
3        1.0
32       4.5
110      4.0
160      2.0
163      4.0
165      4.5
173      3.5
296      5.0
353      3.5
420      2.0
589      4.0
653      3.0
832      4.5
920      0.5
996      4.5
1036     4.0
1127     3.5
1215     4.5
1233     4.0
1304     2.5
1370     4.0
1374     4.0
1485     4.0
1527     4.5
1917     2.5
2011     2.0
2054     1.5
2058     4.0
2490     4.0
2571     4.5
2640     3.0
2688     3.0
2700     4.5
2762     3.5
2959     4.5
3258     1.5
3793     3.0
3889     3.0
3994     3.5
4993     3.5
5107     3.0
5833     2.5
5952     3.5
6213     4.0
6225     0.5
6333     4.0
7000     3.5
7007     4.5
7153     3.5
32029    4.5
32587    5.0
33437    3.5
39052    3.5
45431    3.5
45722    4.5
Name: rating, dtype: float64
u2_rates
movieID
1        4.5
2        4.0
7        2.0
32       3.0
95       2.0
        ... 
54001    4.5
54259    4.0
58559    4.5
59315    4.0
64497    3.0
Name: rating, Length: 296, dtype: float64

The align method returns a tuple of the aligned series:

u1a, u2a = u1_rates.align(u2_rates)
u1a
movieID
1        NaN
2        NaN
3        1.0
7        NaN
32       4.5
        ... 
54001    NaN
54259    NaN
58559    NaN
59315    NaN
64497    NaN
Name: rating, Length: 329, dtype: float64
u2a
movieID
1        4.5
2        4.0
3        NaN
7        2.0
32       3.0
        ... 
54001    4.5
54259    4.0
58559    4.5
59315    4.0
64497    3.0
Name: rating, Length: 329, dtype: float64

They now have the same index, and the values are put in place. Missing values are filled in with NaN. These vectors are now ready to be multipled, or have other operations applied!

Index Gotchas#

Indexes do not have to be unique - the same key can be in multiple rows.

I don’t recommend using non-unique indexes. They often aren’t terribly useful, and they very quickly get slow.

As described above, hierarchical indexes also have problems, particularly with groupby. They can eat an incredible amount of memory, send your system deep into swap, and slow it to a crawl.

Learning More#