Aggregates and Groups#

Setup#

Import our modules again:

import numpy as np
import pandas as pd

And load the MovieLens data. We’re going to pass the memory_use='deep' to info, so we can see the total memory use including the strings.

movies = pd.read_csv('ml-25m/movies.csv')
movies.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  62423 non-null  int64 
 1   title    62423 non-null  object
 2   genres   62423 non-null  object
dtypes: int64(1), object(2)
memory usage: 9.6 MB
ratings = pd.read_csv('ml-25m/ratings.csv')
ratings.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000095 entries, 0 to 25000094
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int64(3)
memory usage: 762.9 MB

Quickly preview the ratings frame:

ratings
userId movieId rating timestamp
0 1 296 5.0 1147880044
1 1 306 3.5 1147868817
2 1 307 5.0 1147868828
3 1 665 5.0 1147878820
4 1 899 3.5 1147868510
... ... ... ... ...
25000090 162541 50872 4.5 1240953372
25000091 162541 55768 2.5 1240951998
25000092 162541 56176 2.0 1240950697
25000093 162541 58559 4.0 1240953434
25000094 162541 63876 5.0 1240952515

25000095 rows Γ— 4 columns

Aggregate Functions#

An aggregate function combines a series (or array) into a single value:

ratings['rating'].mean()
3.533854451353085
ratings['rating'].sum()
88346697.0

Alternate form β€” function from numpy:

np.sum(ratings['rating'])
88346697.0

Series Size#

How big is the series?

ratings['rating'].size
25000095
len(ratings['rating'])
25000095

We can also get the array shape:

ratings['rating'].shape
(25000095,)

And the count of non-null values (we don’t have any null values!):

ratings['rating'].count()
25000095
  • .shape returns the array shape as a tuple β€” the array is one-dimensional with length 25M

  • .size returns the series size (length). len(…) is identical.

  • .count() counts values, not including missing values

Quantiles#

Let’s see the quantile function:

ratings['rating'].quantile(0.5)
3.5
ratings['rating'].quantile(0.2)
3.0

Fascinating! 80% of ratings are 3.0 or higher.

Grouped Aggregates#

We can group by a column and compute aggregates within the group.

How many ratings per movie?

ratings.groupby('movieId')['rating'].count()
movieId
1         57309
2         24228
3         11804
4          2523
5         11714
          ...  
209157        1
209159        1
209163        1
209169        1
209171        1
Name: rating, Length: 59047, dtype: int64

We can compute multiple aggregates at the same time:

movie_stats = ratings.groupby('movieId')['rating'].agg(['mean', 'count'])
movie_stats
mean count
movieId
1 3.893708 57309
2 3.251527 24228
3 3.142028 11804
4 2.853547 2523
5 3.058434 11714
... ... ...
209157 1.500000 1
209159 3.000000 1
209163 4.500000 1
209169 3.000000 1
209171 3.000000 1

59047 rows Γ— 2 columns

This result is a data frame, again indexed by movieId.

Finding Largest#

We can get the 10 movies with the most ratings:

movie_stats.nlargest(10, 'count')
mean count
movieId
356 4.048011 81491
318 4.413576 81482
296 4.188912 79672
593 4.151342 74127
2571 4.154099 72674
260 4.120189 68717
480 3.679175 64144
527 4.247579 60411
110 4.002273 59184
2959 4.228311 58773

Joining#

We want to combine our stats with movie info.

  • on='movieId' says to use the movieId column of the first frame instead of its index.

  • Matches values to index in other frame

movie_info = movies.join(movie_stats, on='movieId')
movie_info
movieId title genres mean count
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 3.893708 57309.0
1 2 Jumanji (1995) Adventure|Children|Fantasy 3.251527 24228.0
2 3 Grumpier Old Men (1995) Comedy|Romance 3.142028 11804.0
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance 2.853547 2523.0
4 5 Father of the Bride Part II (1995) Comedy 3.058434 11714.0
... ... ... ... ... ...
62418 209157 We (2018) Drama 1.500000 1.0
62419 209159 Window of the Soul (2001) Documentary 3.000000 1.0
62420 209163 Bad Poems (2018) Comedy|Drama 4.500000 1.0
62421 209169 A Girl Thing (2001) (no genres listed) 3.000000 1.0
62422 209171 Women of Devil's Island (1962) Action|Adventure|Drama 3.000000 1.0

62423 rows Γ— 5 columns

movie_info.nlargest(10, 'count')
movieId title genres mean count
351 356 Forrest Gump (1994) Comedy|Drama|Romance|War 4.048011 81491.0
314 318 Shawshank Redemption, The (1994) Crime|Drama 4.413576 81482.0
292 296 Pulp Fiction (1994) Comedy|Crime|Drama|Thriller 4.188912 79672.0
585 593 Silence of the Lambs, The (1991) Crime|Horror|Thriller 4.151342 74127.0
2480 2571 Matrix, The (1999) Action|Sci-Fi|Thriller 4.154099 72674.0
257 260 Star Wars: Episode IV - A New Hope (1977) Action|Adventure|Sci-Fi 4.120189 68717.0
475 480 Jurassic Park (1993) Action|Adventure|Sci-Fi|Thriller 3.679175 64144.0
522 527 Schindler's List (1993) Drama|War 4.247579 60411.0
108 110 Braveheart (1995) Action|Drama|War 4.002273 59184.0
2867 2959 Fight Club (1999) Action|Crime|Drama|Thriller 4.228311 58773.0
movie_info['count'].describe()
count    59047.000000
mean       423.393144
std       2477.885821
min          1.000000
25%          2.000000
50%          6.000000
75%         36.000000
max      81491.000000
Name: count, dtype: float64