Indexing¶

This notebook discusses Pandas indexing operations, and pitfalls.

Setup¶

First our Python modules:

In [58]:

import numpy as np
import pandas as pd
import scipy.linalg as spla
import matplotlib.pyplot as plt
import seaborn as sns

And then the movie data, as usual. Movies:

In [2]:

movies = pd.read_csv('hetrec2011-ml/movies.dat', delimiter='\t', encoding='latin1', na_values=['\\N'])
movies.head()

Out[2]:

	id	title	imdbID	spanishTitle	imdbPictureURL	year	rtID	rtAllCriticsRating	rtAllCriticsNumReviews	rtAllCriticsNumFresh	...	rtAllCriticsScore	rtTopCriticsRating	rtTopCriticsNumReviews	rtTopCriticsNumFresh	rtTopCriticsNumRotten	rtTopCriticsScore	rtAudienceRating	rtAudienceNumRatings	rtAudienceScore	rtPictureURL
0	1	Toy story	114709	Toy story (juguetes)	http://ia.media-imdb.com/images/M/MV5BMTMwNDU0...	1995	toy_story	9.0	73.0	73.0	...	100.0	8.5	17.0	17.0	0.0	100.0	3.7	102338.0	81.0	http://content7.flixster.com/movie/10/93/63/10...
1	2	Jumanji	113497	Jumanji	http://ia.media-imdb.com/images/M/MV5BMzM5NjE1...	1995	1068044-jumanji	5.6	28.0	13.0	...	46.0	5.8	5.0	2.0	3.0	40.0	3.2	44587.0	61.0	http://content8.flixster.com/movie/56/79/73/56...
2	3	Grumpy Old Men	107050	Dos viejos gruñones	http://ia.media-imdb.com/images/M/MV5BMTI5MTgy...	1993	grumpy_old_men	5.9	36.0	24.0	...	66.0	7.0	6.0	5.0	1.0	83.0	3.2	10489.0	66.0	http://content6.flixster.com/movie/25/60/25602...
3	4	Waiting to Exhale	114885	Esperando un respiro	http://ia.media-imdb.com/images/M/MV5BMTczMTMy...	1995	waiting_to_exhale	5.6	25.0	14.0	...	56.0	5.5	11.0	5.0	6.0	45.0	3.3	5666.0	79.0	http://content9.flixster.com/movie/10/94/17/10...
4	5	Father of the Bride Part II	113041	Vuelve el padre de la novia (Ahora también abu...	http://ia.media-imdb.com/images/M/MV5BMTg1NDc2...	1995	father_of_the_bride_part_ii	5.3	19.0	9.0	...	47.0	5.4	5.0	1.0	4.0	20.0	3.0	13761.0	64.0	http://content8.flixster.com/movie/25/54/25542...

5 rows × 21 columns

And ratings:

In [3]:

ratings = pd.read_csv('hetrec2011-ml/user_ratedmovies-timestamps.dat', delimiter='\t', encoding='latin1')
ratings.head()

Out[3]:

	userID	movieID	rating	timestamp
0	75	3	1.0	1162160236000
1	75	32	4.5	1162160624000
2	75	110	4.0	1162161008000
3	75	160	2.0	1162160212000
4	75	163	4.0	1162160970000

And let's load tags:

In [4]:

tags = pd.read_csv('hetrec2011-ml/tags.dat', delimiter='\t', encoding='latin1')
tags.head()

Out[4]:

	id	value
0	1	earth
1	2	police
2	3	boxing
3	4	painter
4	5	whale

What Is an Index?¶

An index is what we use to locate rows or columns in a Pandas data structure. (While conceptually we can think of a data frame as a dictionary of columns, the name-to-column mapping is actually done with an index.)

An index is two things in one data structure:

An array of key values (given a position, we can get the key value)
A hash map to look up row position(s) for a given key value

Indexes keys can be any (immutable) data type. Pandas has special support for indexing over datetime64 objects, for time series.

Range Index¶

When we load the data from the CSV file, it by default it is loaded with a RangeIndex, in which the keys and the positions are the same. These indexes are very memory-efficient, because they don't actually need to store key values. Let's see the movie index:

In [5]:

movies.index

Out[5]:

RangeIndex(start=0, stop=10197, step=1)

With a range index, loc and iloc do the same thing (for rows):

In [8]:

movies.loc[5, ['id', 'title']]

Out[8]:

id          6
title    Heat
Name: 5, dtype: object

In [10]:

movies.iloc[5, 0:2]

Out[10]:

id          6
title    Heat
Name: 5, dtype: object

Setting an Index¶

The set_index method makes a column into the index. It will replace the current index; if the current index is a range index, that is usually fine.

There are two versions:

By default, it returns a new data frame with the new index, and leaves the original unchanged. The frame shares underlying data with the original, so it doesn't take up that much memory.
With inplace=True, it modifies the data frame to have the new index.

Both are useful. Be very careful with inplace in a function; the caller probably doesn't expect the data frame to be modified!

Let's index the movies frame by movie ID:

In [11]:

movies = movies.set_index('id')
movies.index

Out[11]:

Int64Index([    1,     2,     3,     4,     5,     6,     7,     8,     9,
               10,
            ...
            64997, 64999, 65006, 65011, 65037, 65088, 65091, 65126, 65130,
            65133],
           dtype='int64', name='id', length=10197)

The new index is also reflected in .info(), and the id column is gone:

In [25]:

movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10197 entries, 1 to 65133
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   title                   10197 non-null  object 
 1   imdbID                  10197 non-null  int64  
 2   spanishTitle            10197 non-null  object 
 3   imdbPictureURL          10016 non-null  object 
 4   year                    10197 non-null  int64  
 5   rtID                    9886 non-null   object 
 6   rtAllCriticsRating      9967 non-null   float64
 7   rtAllCriticsNumReviews  9967 non-null   float64
 8   rtAllCriticsNumFresh    9967 non-null   float64
 9   rtAllCriticsNumRotten   9967 non-null   float64
 10  rtAllCriticsScore       9967 non-null   float64
 11  rtTopCriticsRating      9967 non-null   float64
 12  rtTopCriticsNumReviews  9967 non-null   float64
 13  rtTopCriticsNumFresh    9967 non-null   float64
 14  rtTopCriticsNumRotten   9967 non-null   float64
 15  rtTopCriticsScore       9967 non-null   float64
 16  rtAudienceRating        9967 non-null   float64
 17  rtAudienceNumRatings    9967 non-null   float64
 18  rtAudienceScore         9967 non-null   float64
 19  rtPictureURL            9967 non-null   object 
dtypes: float64(13), int64(2), object(5)
memory usage: 1.9+ MB

We now have an index that lets us up by 64-bit integer keys. We can now look up Heat by movie ID:

In [12]:

movies.loc[6]

Out[12]:

title                                                                  Heat
imdbID                                                               113277
spanishTitle                                                           Heat
imdbPictureURL            http://ia.media-imdb.com/images/M/MV5BMTM1NDc4...
year                                                                   1995
rtID                                                           1068182-heat
rtAllCriticsRating                                                      7.7
rtAllCriticsNumReviews                                                   58
rtAllCriticsNumFresh                                                     50
rtAllCriticsNumRotten                                                     8
rtAllCriticsScore                                                        86
rtTopCriticsRating                                                      7.2
rtTopCriticsNumReviews                                                   17
rtTopCriticsNumFresh                                                     14
rtTopCriticsNumRotten                                                     3
rtTopCriticsScore                                                        82
rtAudienceRating                                                        3.9
rtAudienceNumRatings                                                  42785
rtAudienceScore                                                          92
rtPictureURL              http://content9.flixster.com/movie/26/80/26809...
Name: 6, dtype: object

iloc still returns this at position 5:

In [13]:

movies.iloc[5]

Out[13]:

title                                                                  Heat
imdbID                                                               113277
spanishTitle                                                           Heat
imdbPictureURL            http://ia.media-imdb.com/images/M/MV5BMTM1NDc4...
year                                                                   1995
rtID                                                           1068182-heat
rtAllCriticsRating                                                      7.7
rtAllCriticsNumReviews                                                   58
rtAllCriticsNumFresh                                                     50
rtAllCriticsNumRotten                                                     8
rtAllCriticsScore                                                        86
rtTopCriticsRating                                                      7.2
rtTopCriticsNumReviews                                                   17
rtTopCriticsNumFresh                                                     14
rtTopCriticsNumRotten                                                     3
rtTopCriticsScore                                                        82
rtAudienceRating                                                        3.9
rtAudienceNumRatings                                                  42785
rtAudienceScore                                                          92
rtPictureURL              http://content9.flixster.com/movie/26/80/26809...
Name: 6, dtype: object

The reset_index method will turn the index back into a column, and put a range index back in place:

In [24]:

movies.reset_index().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10197 entries, 0 to 10196
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      10197 non-null  int64  
 1   title                   10197 non-null  object 
 2   imdbID                  10197 non-null  int64  
 3   spanishTitle            10197 non-null  object 
 4   imdbPictureURL          10016 non-null  object 
 5   year                    10197 non-null  int64  
 6   rtID                    9886 non-null   object 
 7   rtAllCriticsRating      9967 non-null   float64
 8   rtAllCriticsNumReviews  9967 non-null   float64
 9   rtAllCriticsNumFresh    9967 non-null   float64
 10  rtAllCriticsNumRotten   9967 non-null   float64
 11  rtAllCriticsScore       9967 non-null   float64
 12  rtTopCriticsRating      9967 non-null   float64
 13  rtTopCriticsNumReviews  9967 non-null   float64
 14  rtTopCriticsNumFresh    9967 non-null   float64
 15  rtTopCriticsNumRotten   9967 non-null   float64
 16  rtTopCriticsScore       9967 non-null   float64
 17  rtAudienceRating        9967 non-null   float64
 18  rtAudienceNumRatings    9967 non-null   float64
 19  rtAudienceScore         9967 non-null   float64
 20  rtPictureURL            9967 non-null   object 
dtypes: float64(13), int64(3), object(5)
memory usage: 1.6+ MB

Index Properties and Operations¶

We can do quite a few things with an Index.

First, it has a name:

In [14]:

movies.index.name

Out[14]:

'id'

We could change the name if we wanted to, by assigning to it:

movies.index.name = 'movieId'

But we won't actually do that right now.

Arrays and Keys¶

The index is also an array - we can get the key at a certain position:

In [15]:

movies.index[5]

Out[15]:

To go the other way — find the location or position for a key value — we can use the get_loc method:

In [16]:

movies.index.get_loc(6)

Out[16]:

If we look for a nonexistent key, it fails:

In [17]:

movies.index.get_loc(30481080)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2645             try:
-> 2646                 return self._engine.get_loc(key)
   2647             except KeyError:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 30481080

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-17-937b631c964d> in <module>
----> 1 movies.index.get_loc(30481080)

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2646                 return self._engine.get_loc(key)
   2647             except KeyError:
-> 2648                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2649         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2650         if indexer.ndim > 1 or indexer.size > 1:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 30481080

This is a KeyError, meaning we tried to look up a key that does not exist.

Internally, the index uses a NumPy ndarray to store the keys. We can access this array as .values:

In [18]:

movies.index.values

Out[18]:

array([    1,     2,     3, ..., 65126, 65130, 65133], dtype=int64)

Index Characteristics¶

There are a few useful characteristics an index can have. It may be unique, or it may have duplicates. This is a unique index:

In [19]:

movies.index.is_unique

Out[19]:

True

In [20]:

movies.index.has_duplicates

Out[20]:

False

An index can also be monotonic, or sorted; the is_monotonic accessor is a shortcut for is_monotonic_increasing and returns True if the index is sorted in nondecreasing order:

In [22]:

movies.index.is_monotonic

Out[22]:

True

Since this index is both monotonic and unique, it is sorted in strictly increasing order.

Since it is monotonic, we can also use ranges for key-based lookup:

In [23]:

movies.loc[6:10]

Out[23]:

	title	imdbID	spanishTitle	imdbPictureURL	year	rtID	rtAllCriticsRating	rtAllCriticsNumReviews	rtAllCriticsNumFresh	rtAllCriticsNumRotten	rtAllCriticsScore	rtTopCriticsRating	rtTopCriticsNumReviews	rtTopCriticsNumFresh	rtTopCriticsNumRotten	rtTopCriticsScore	rtAudienceRating	rtAudienceNumRatings	rtAudienceScore	rtPictureURL
id
6	Heat	113277	Heat	http://ia.media-imdb.com/images/M/MV5BMTM1NDc4...	1995	1068182-heat	7.7	58.0	50.0	8.0	86.0	7.2	17.0	14.0	3.0	82.0	3.9	42785.0	92.0	http://content9.flixster.com/movie/26/80/26809...
7	Sabrina	47437	Sabrina	http://ia.media-imdb.com/images/M/MV5BMTYyNDM1...	1954	1018047-sabrina	7.4	31.0	28.0	3.0	90.0	7.2	5.0	5.0	0.0	100.0	3.8	12812.0	87.0	http://content7.flixster.com/movie/10/93/36/10...
8	Tom and Huck	112302	Tom y Huck	http://ia.media-imdb.com/images/M/MV5BMTUxNDYz...	1995	tom_and_huck	4.2	8.0	2.0	6.0	25.0	0.0	2.0	1.0	1.0	50.0	2.7	2649.0	45.0	http://content9.flixster.com/movie/26/16/26169...
9	Sudden Death	114576	Sudden Death: muerte súbita	http://ia.media-imdb.com/images/M/MV5BMTcwMTU2...	1995	1068470-sudden_death	5.2	32.0	17.0	15.0	53.0	5.6	9.0	5.0	4.0	55.0	2.6	3626.0	40.0	http://content8.flixster.com/movie/27/91/27912...
10	GoldenEye	113189	GoldenEye	http://ia.media-imdb.com/images/M/MV5BNTE1OTEx...	1995	goldeneye	6.8	41.0	33.0	8.0	80.0	6.2	11.0	7.0	4.0	63.0	3.4	28260.0	78.0	http://content9.flixster.com/movie/26/66/26669...

Note that range-based indexing in .loc includes both ends, unlike slices in almost any other part of Python.

Unsorted Indexes: While we might expect slice-based key indexing to get things in the range, that is not what it does. It locates the positions of the start and end keys, and returns everything between them. If the index is not sorted, this is usually not useful. Only use key slices on sorted indexes.

Hierarchical Indexing¶

Pandas supports hierarchical indexes, with MultiIndex. Create a hierarchical index by passing multiple columns to set_index.

For example, let's make a series that indexes ratings by user and item IDs:

In [32]:

idx_ratings = ratings.set_index(['userID', 'movieID'])
idx_ratings

Out[32]:

		rating	timestamp
userID	movieID
75	3	1.0	1162160236000
	32	4.5	1162160624000
	110	4.0	1162161008000
	160	2.0	1162160212000
	163	4.0	1162160970000
...	...	...	...
71534	44555	4.0	1196647538000
	46578	4.0	1196647004000
	48516	4.5	1196646826000
	61075	5.0	1223625365000
	62049	4.5	1223625490000

855598 rows × 2 columns

We see on the left that there are two levels to the index: userID and movieID. We can see this with .names:

In [33]:

idx_ratings.index.names

Out[33]:

FrozenList(['userID', 'movieID'])

The multi-level index lets us look up values by just the outermost index, getting all rows with that outer index value. For example, to get all ratings by user 190:

In [43]:

idx_ratings.loc[190]

Out[43]:

	rating	timestamp
movieID
1	4.5	1057778398000
2	4.0	1053457414000
7	2.0	1049823937000
32	3.0	1151336652000
95	2.0	1078019174000
...	...	...
54001	4.5	1185219688000
54259	4.0	1217875029000
58559	4.5	1217258754000
59315	4.0	1210627067000
64497	3.0	1230991072000

296 rows × 2 columns

Since we are looking up one user, it drops the user level from the index. If we looked up multiple users, it would not:

In [44]:

idx_ratings.loc[[75, 190]]

Out[44]:

		rating	timestamp
userID	movieID
75	3	1.0	1162160236000
	32	4.5	1162160624000
	110	4.0	1162161008000
	160	2.0	1162160212000
	163	4.0	1162160970000
...	...	...	...
190	54001	4.5	1185219688000
	54259	4.0	1217875029000
	58559	4.5	1217258754000
	59315	4.0	1210627067000
	64497	3.0	1230991072000

351 rows × 2 columns

Multi-indexes are one place where Python makes a distinction between lists and tuples. If we look up (75, 32), it will look up user 75's rating for item 32:

In [45]:

idx_ratings.loc[(75, 32)]

Out[45]:

rating       4.500000e+00
timestamp    1.162161e+12
Name: (75, 32), dtype: float64

We can index by a list of tuples:

In [46]:

idx_ratings.loc[[(75, 32), (190, 58559)]]

Out[46]:

		rating	timestamp
userID	movieID
75	32	4.5	1162160624000
190	58559	4.5	1217258754000

So we have the following principle:

The index locator can be a key, or a list or array of keys
A key can be a single value, looking up in the outermost index level; or a tuple of values, looking up by multiple levels simultaneously

Be careful: hierarchical indexes are not as useful as they look. They are great for lookups, and joins, but they often have problems too. In particular, if you have a large data frame with a hiearchical index, there are several operations — including groupby — that will cause memory explosion. If you call melt with multiple ID columns, it will turn them into a hierarchical index.

I usually use hierarchical indexes for two purpose:

Storing data for multidimensional lookup, like the user-and-item ratings
Joining on mulitiple simultaneous columns

In both cases, once I have the data I need, I usually strip away the multi-index before moving forward (with reset_index, or with a join or lookup operation that leaves the data frame with a normal one-level index) to reduce the risk of unexpected memory explosion.

Sorting and Realigning¶

We can rearrange data frames based on their indexes.

Sorting¶

One useful operation to realign a data frame (or series) with respect to its index is to sort the index. Our index is already sorted by ID, but what if we wanted to use IMDB IDs?

In [26]:

imdb_movies = movies.reset_index().set_index('imdbID')
imdb_movies

Out[26]:

	id	title	spanishTitle	imdbPictureURL	year	rtID	rtAllCriticsRating	rtAllCriticsNumReviews	rtAllCriticsNumFresh	rtAllCriticsNumRotten	rtAllCriticsScore	rtTopCriticsRating	rtTopCriticsNumReviews	rtTopCriticsNumFresh	rtTopCriticsNumRotten	rtTopCriticsScore	rtAudienceRating	rtAudienceNumRatings	rtAudienceScore	rtPictureURL
imdbID
114709	1	Toy story	Toy story (juguetes)	http://ia.media-imdb.com/images/M/MV5BMTMwNDU0...	1995	toy_story	9.0	73.0	73.0	0.0	100.0	8.5	17.0	17.0	0.0	100.0	3.7	102338.0	81.0	http://content7.flixster.com/movie/10/93/63/10...
113497	2	Jumanji	Jumanji	http://ia.media-imdb.com/images/M/MV5BMzM5NjE1...	1995	1068044-jumanji	5.6	28.0	13.0	15.0	46.0	5.8	5.0	2.0	3.0	40.0	3.2	44587.0	61.0	http://content8.flixster.com/movie/56/79/73/56...
107050	3	Grumpy Old Men	Dos viejos gruñones	http://ia.media-imdb.com/images/M/MV5BMTI5MTgy...	1993	grumpy_old_men	5.9	36.0	24.0	12.0	66.0	7.0	6.0	5.0	1.0	83.0	3.2	10489.0	66.0	http://content6.flixster.com/movie/25/60/25602...
114885	4	Waiting to Exhale	Esperando un respiro	http://ia.media-imdb.com/images/M/MV5BMTczMTMy...	1995	waiting_to_exhale	5.6	25.0	14.0	11.0	56.0	5.5	11.0	5.0	6.0	45.0	3.3	5666.0	79.0	http://content9.flixster.com/movie/10/94/17/10...
113041	5	Father of the Bride Part II	Vuelve el padre de la novia (Ahora también abu...	http://ia.media-imdb.com/images/M/MV5BMTg1NDc2...	1995	father_of_the_bride_part_ii	5.3	19.0	9.0	10.0	47.0	5.4	5.0	1.0	4.0	20.0	3.0	13761.0	64.0	http://content8.flixster.com/movie/25/54/25542...
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
960731	65088	Bedtime Stories	Más allá de los sueños	http://ia.media-imdb.com/images/M/MV5BMjA5Njk5...	2008	bedtime_stories	4.4	104.0	26.0	78.0	25.0	4.7	26.0	6.0	20.0	23.0	3.5	108877.0	63.0	http://content6.flixster.com/movie/10/94/33/10...
25464	65091	Manhattan Melodrama	El enemigo público número 1	http://ia.media-imdb.com/images/M/MV5BMTUyODE3...	1934	manhattan_melodrama	7.0	12.0	10.0	2.0	83.0	0.0	4.0	2.0	2.0	50.0	3.7	344.0	71.0	http://content9.flixster.com/movie/66/44/64/66...
1024715	65126	Choke	Choke	http://ia.media-imdb.com/images/M/MV5BMTMxMDI4...	2008	choke	5.6	135.0	73.0	62.0	54.0	4.9	26.0	8.0	18.0	30.0	3.3	13893.0	55.0	http://content6.flixster.com/movie/10/85/09/10...
959337	65130	Revolutionary Road	Revolutionary Road	http://ia.media-imdb.com/images/M/MV5BMTI2MzY2...	2008	revolutionary_road	6.7	194.0	133.0	61.0	68.0	6.9	36.0	25.0	11.0	69.0	3.5	46044.0	70.0	http://content8.flixster.com/movie/10/88/40/10...
212579	65133	Blackadder Back & Forth	Blackadder Back & Forth	http://ia.media-imdb.com/images/M/MV5BMjA5MjU4...	1999	blackadder-back-forth	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	http://content7.flixster.com/movie/34/10/69/34...

10197 rows × 20 columns

Now let's sort and show again:

In [27]:

imdb_movies.sort_index(inplace=True)
imdb_movies

Out[27]:

	id	title	spanishTitle	imdbPictureURL	year	rtID	rtAllCriticsRating	rtAllCriticsNumReviews	rtAllCriticsNumFresh	rtAllCriticsNumRotten	rtAllCriticsScore	rtTopCriticsRating	rtTopCriticsNumReviews	rtTopCriticsNumFresh	rtTopCriticsNumRotten	rtTopCriticsScore	rtAudienceRating	rtAudienceNumRatings	rtAudienceScore	rtPictureURL
imdbID
439	6990	The Great Train Robbery	Asalto y robo al tren	http://ia.media-imdb.com/images/M/MV5BMTk3OTQy...	1903	the-great-train-robbery1903	7.6	6.0	6.0	0.0	100.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	http://content7.flixster.com/movie/10/87/70/10...
4972	7065	The Birth of a Nation	El nacimiento de una nación	http://ia.media-imdb.com/images/M/MV5BMjI2MzU4...	1915	birth_of_a_nation	8.0	31.0	31.0	0.0	100.0	0.0	4.0	4.0	0.0	100.0	3.3	1249.0	58.0	http://content9.flixster.com/movie/10/88/58/10...
6864	7243	Intolerance: Love's Struggle Throughout the Ages	Intolerancia	http://ia.media-imdb.com/images/M/MV5BODEyNDUy...	1916	1010644-intolerance	7.8	23.0	22.0	1.0	95.0	0.0	3.0	3.0	0.0	100.0	3.8	1397.0	77.0	http://content6.flixster.com/movie/31/14/31140...
8133	8511	The Immigrant	Charlot emigrante	http://ia.media-imdb.com/images/M/MV5BMjExNDE5...	1917	immigrant	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	http://content7.flixster.com/movie/10/85/69/10...
8395	48374	Otets Sergiy	El padre Sergio	NaN	1917	otets-sergiy-father-sergius	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	http://content7.flixster.com/movie/10/91/43/10...
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1229827	4260	Jonas Brothers: The 3D Concert Experience	Jonas Brothers: The 3D Concert Experience	http://ia.media-imdb.com/images/M/MV5BMTU3NzE4...	2009	jonas_brothers_3d	4.2	71.0	16.0	55.0	22.0	4.6	22.0	7.0	15.0	31.0	3.3	74109.0	64.0	http://content7.flixster.com/movie/10/92/39/10...
1233247	757	Dung che sai duk	Este contraveneno del Oeste	http://ia.media-imdb.com/images/M/MV5BMTc0NzMx...	1994	ashes_of_time_redux	6.6	69.0	54.0	15.0	78.0	6.7	18.0	14.0	4.0	77.0	3.6	2166.0	71.0	http://content8.flixster.com/movie/10/93/19/10...
1233381	63698	Üç maymun	Tres monos	http://ia.media-imdb.com/images/M/MV5BMTMwNjM3...	2008	three_monkeys	6.7	56.0	43.0	13.0	76.0	7.7	13.0	12.0	1.0	92.0	3.6	1239.0	72.0	http://content8.flixster.com/movie/11/13/29/11...
1275532	4457	Bianca Beauchamp All Access 2: Rubberized	Bianca Beauchamp All Access 2: Rubberized	http://ia.media-imdb.com/images/M/MV5BMTI1ODQ1...	2008	bianca-beauchamp-all-access-2-rubberized	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	http://content6.flixster.com/movie/10/88/44/10...
1349938	418	Being Human	Being Human	http://ia.media-imdb.com/images/M/MV5BMjA1MzIw...	2008	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

10197 rows × 20 columns

Now the index reports that it's sorted too:

In [29]:

imdb_movies.index.is_monotonic

Out[29]:

True

Reindexing¶

Another operation on an index is a reindex. When we reindex a data frame or series, we give it an array of key values, and it rearranges elements to correspond to the new array. It will insert missing values for keys that don't exist in the original frame.

So let's reindex the movie frame (we are not passing inplace, so it's going to return a new frame):

In [31]:

movies.reindex([1, 6, 101, 2038401])

Out[31]:

	title	imdbID	spanishTitle	imdbPictureURL	year	rtID	rtAllCriticsRating	rtAllCriticsNumReviews	rtAllCriticsNumFresh	rtAllCriticsNumRotten	rtAllCriticsScore	rtTopCriticsRating	rtTopCriticsNumReviews	rtTopCriticsNumFresh	rtTopCriticsNumRotten	rtTopCriticsScore	rtAudienceRating	rtAudienceNumRatings	rtAudienceScore	rtPictureURL
id
1	Toy story	114709.0	Toy story (juguetes)	http://ia.media-imdb.com/images/M/MV5BMTMwNDU0...	1995.0	toy_story	9.0	73.0	73.0	0.0	100.0	8.5	17.0	17.0	0.0	100.0	3.7	102338.0	81.0	http://content7.flixster.com/movie/10/93/63/10...
6	Heat	113277.0	Heat	http://ia.media-imdb.com/images/M/MV5BMTM1NDc4...	1995.0	1068182-heat	7.7	58.0	50.0	8.0	86.0	7.2	17.0	14.0	3.0	82.0	3.9	42785.0	92.0	http://content9.flixster.com/movie/26/80/26809...
101	Bottle Rocket	115734.0	Ladrón que roba a otro ladrón	http://ia.media-imdb.com/images/M/MV5BMTk1ODI3...	1996.0	bottle_rocket	6.6	42.0	33.0	9.0	78.0	6.1	9.0	5.0	4.0	55.0	3.7	16217.0	80.0	http://content6.flixster.com/movie/26/73/26732...
2038401	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

There is no movie 2038401, so it fills in NaN everywhere.

Alignment¶

Another useful operation is align. It reindexes two series so that their indexes match, and we can then perform more operations on them. When you do vectorized operations on two series, Pandas first aligns their indexes and then does the operation on the corresponding values.

For example, we can treat users' ratings as item-dimensional vectors, and compute the cosine as a measure of how similar their ratings are.

$$\mathrm{cos}(\vec{x}, \vec{y}) = \frac{\vec{x} \cdot \vec{y}}{\|\vec{x}\|_2 \|\vec{x}\|_2}$$

First, let's get the rating vectors for two users:

In [50]:

u1_rates = idx_ratings.loc[75, 'rating']
u2_rates = idx_ratings.loc[190, 'rating']

Let's multiply their ratings - Pandas will automatically align them:

In [54]:

u12_mult = u1_rates * u2_rates
u12_mult.describe()

Out[54]:

count    22.000000
mean     13.534091
std       6.139400
min       2.250000
25%      12.062500
50%      15.375000
75%      17.500000
max      22.500000
Name: rating, dtype: float64

In [55]:

u12_mult

Out[55]:

movieID
1         NaN
2         NaN
3         NaN
7         NaN
32       13.5
         ... 
54001     NaN
54259     NaN
58559     NaN
59315     NaN
64497     NaN
Name: rating, Length: 329, dtype: float64

There are 22 values that aren't NaN - those are the values where both users have a rating. This is because the alignment fills in NaN, and NaN times a rating value is NaN.

To finish the cosine, we can compute the sum of this, and the norms of the two vectors:

In [59]:

np.sum(u12_mult) / (spla.norm(u1_rates) * spla.norm(u2_rates))

Out[59]:

0.19577120060037112

The scipy.linalg.norm function computes the Euclidean norm $\| \cdot \|_2$.

But what does this alignment do? We can do it ourselves with .align. First let's look at our vectors:

In [60]:

u1_rates

Out[60]:

movieID
3        1.0
32       4.5
110      4.0
160      2.0
163      4.0
165      4.5
173      3.5
296      5.0
353      3.5
420      2.0
589      4.0
653      3.0
832      4.5
920      0.5
996      4.5
1036     4.0
1127     3.5
1215     4.5
1233     4.0
1304     2.5
1370     4.0
1374     4.0
1485     4.0
1527     4.5
1917     2.5
2011     2.0
2054     1.5
2058     4.0
2490     4.0
2571     4.5
2640     3.0
2688     3.0
2700     4.5
2762     3.5
2959     4.5
3258     1.5
3793     3.0
3889     3.0
3994     3.5
4993     3.5
5107     3.0
5833     2.5
5952     3.5
6213     4.0
6225     0.5
6333     4.0
7000     3.5
7007     4.5
7153     3.5
32029    4.5
32587    5.0
33437    3.5
39052    3.5
45431    3.5
45722    4.5
Name: rating, dtype: float64

In [62]:

u2_rates

Out[62]:

movieID
1        4.5
2        4.0
7        2.0
32       3.0
95       2.0
        ... 
54001    4.5
54259    4.0
58559    4.5
59315    4.0
64497    3.0
Name: rating, Length: 296, dtype: float64

The align method returns a tuple of the aligned series:

In [63]:

u1a, u2a = u1_rates.align(u2_rates)
u1a

Out[63]:

movieID
1        NaN
2        NaN
3        1.0
7        NaN
32       4.5
        ... 
54001    NaN
54259    NaN
58559    NaN
59315    NaN
64497    NaN
Name: rating, Length: 329, dtype: float64

In [64]:

u2a

Out[64]:

movieID
1        4.5
2        4.0
3        NaN
7        2.0
32       3.0
        ... 
54001    4.5
54259    4.0
58559    4.5
59315    4.0
64497    3.0
Name: rating, Length: 329, dtype: float64

They now have the same index, and the values are put in place. Missing values are filled in with NaN. These vectors are now ready to be multipled, or have other operations applied!

Index Gotchas¶

Indexes do not have to be unique - the same key can be in multiple rows.

I don't recommend using non-unique indexes. They often aren't terribly useful, and they very quickly get slow.

As described above, hierarchical indexes also have problems, particularly with groupby. They can eat an incredible amount of memory, send your system deep into swap, and slow it to a crawl.

Learning More¶

The textbook
Indexing and Selecting Data in the Pandas user guide
MultiIndex / Advanced Indexing in the Pandas user guide