Aggregates and Groups¶
Setup¶
Import our modules again:
import numpy as np
import pandas as pd
And load the MovieLens data. We're going to pass the memory_use='deep'
to info
, so we can see the total memory use including the strings.
movies = pd.read_csv('ml-25m/movies.csv')
movies.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'> RangeIndex: 62423 entries, 0 to 62422 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 movieId 62423 non-null int64 1 title 62423 non-null object 2 genres 62423 non-null object dtypes: int64(1), object(2) memory usage: 9.6 MB
ratings = pd.read_csv('ml-25m/ratings.csv')
ratings.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 25000095 entries, 0 to 25000094 Data columns (total 4 columns): # Column Dtype --- ------ ----- 0 userId int64 1 movieId int64 2 rating float64 3 timestamp int64 dtypes: float64(1), int64(3) memory usage: 762.9 MB
Quickly preview the ratings
frame:
ratings
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 296 | 5.0 | 1147880044 |
1 | 1 | 306 | 3.5 | 1147868817 |
2 | 1 | 307 | 5.0 | 1147868828 |
3 | 1 | 665 | 5.0 | 1147878820 |
4 | 1 | 899 | 3.5 | 1147868510 |
... | ... | ... | ... | ... |
25000090 | 162541 | 50872 | 4.5 | 1240953372 |
25000091 | 162541 | 55768 | 2.5 | 1240951998 |
25000092 | 162541 | 56176 | 2.0 | 1240950697 |
25000093 | 162541 | 58559 | 4.0 | 1240953434 |
25000094 | 162541 | 63876 | 5.0 | 1240952515 |
25000095 rows × 4 columns
Aggregate Functions¶
An aggregate function combines a series (or array) into a single value:
ratings['rating'].mean()
3.533854451353085
ratings['rating'].sum()
88346697.0
Alternate form --- function from numpy
:
np.sum(ratings['rating'])
88346697.0
Series Size¶
How big is the series?
ratings['rating'].size
25000095
len(ratings['rating'])
25000095
We can also get the array shape:
ratings['rating'].shape
(25000095,)
And the count of non-null values (we don't have any null values!):
ratings['rating'].count()
25000095
.shape
returns the array shape as a tuple — the array is one-dimensional with length 25M.size
returns the series size (length).len(…)
is identical..count()
counts values, not including missing values
Quantiles¶
Let's see the quantile
function:
ratings['rating'].quantile(0.5)
3.5
ratings['rating'].quantile(0.2)
3.0
Fascinating! 80% of ratings are 3.0 or higher.
Grouped Aggregates¶
We can group by a column and compute aggregates within the group.
How many ratings per movie?
ratings.groupby('movieId')['rating'].count()
movieId 1 57309 2 24228 3 11804 4 2523 5 11714 ... 209157 1 209159 1 209163 1 209169 1 209171 1 Name: rating, Length: 59047, dtype: int64
We can compute multiple aggregates at the same time:
movie_stats = ratings.groupby('movieId')['rating'].agg(['mean', 'count'])
movie_stats
mean | count | |
---|---|---|
movieId | ||
1 | 3.893708 | 57309 |
2 | 3.251527 | 24228 |
3 | 3.142028 | 11804 |
4 | 2.853547 | 2523 |
5 | 3.058434 | 11714 |
... | ... | ... |
209157 | 1.500000 | 1 |
209159 | 3.000000 | 1 |
209163 | 4.500000 | 1 |
209169 | 3.000000 | 1 |
209171 | 3.000000 | 1 |
59047 rows × 2 columns
This result is a data frame, again indexed by movieId
.
Finding Largest¶
We can get the 10 movies with the most ratings:
movie_stats.nlargest(10, 'count')
mean | count | |
---|---|---|
movieId | ||
356 | 4.048011 | 81491 |
318 | 4.413576 | 81482 |
296 | 4.188912 | 79672 |
593 | 4.151342 | 74127 |
2571 | 4.154099 | 72674 |
260 | 4.120189 | 68717 |
480 | 3.679175 | 64144 |
527 | 4.247579 | 60411 |
110 | 4.002273 | 59184 |
2959 | 4.228311 | 58773 |
Joining¶
We want to combine our stats with movie info.
on='movieId'
says to use themovieId
column of the first frame instead of its index.- Matches values to index in other frame
movie_info = movies.join(movie_stats, on='movieId')
movie_info
movieId | title | genres | mean | count | |
---|---|---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | 3.893708 | 57309.0 |
1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy | 3.251527 | 24228.0 |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance | 3.142028 | 11804.0 |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance | 2.853547 | 2523.0 |
4 | 5 | Father of the Bride Part II (1995) | Comedy | 3.058434 | 11714.0 |
... | ... | ... | ... | ... | ... |
62418 | 209157 | We (2018) | Drama | 1.500000 | 1.0 |
62419 | 209159 | Window of the Soul (2001) | Documentary | 3.000000 | 1.0 |
62420 | 209163 | Bad Poems (2018) | Comedy|Drama | 4.500000 | 1.0 |
62421 | 209169 | A Girl Thing (2001) | (no genres listed) | 3.000000 | 1.0 |
62422 | 209171 | Women of Devil's Island (1962) | Action|Adventure|Drama | 3.000000 | 1.0 |
62423 rows × 5 columns
movie_info.nlargest(10, 'count')
movieId | title | genres | mean | count | |
---|---|---|---|---|---|
351 | 356 | Forrest Gump (1994) | Comedy|Drama|Romance|War | 4.048011 | 81491.0 |
314 | 318 | Shawshank Redemption, The (1994) | Crime|Drama | 4.413576 | 81482.0 |
292 | 296 | Pulp Fiction (1994) | Comedy|Crime|Drama|Thriller | 4.188912 | 79672.0 |
585 | 593 | Silence of the Lambs, The (1991) | Crime|Horror|Thriller | 4.151342 | 74127.0 |
2480 | 2571 | Matrix, The (1999) | Action|Sci-Fi|Thriller | 4.154099 | 72674.0 |
257 | 260 | Star Wars: Episode IV - A New Hope (1977) | Action|Adventure|Sci-Fi | 4.120189 | 68717.0 |
475 | 480 | Jurassic Park (1993) | Action|Adventure|Sci-Fi|Thriller | 3.679175 | 64144.0 |
522 | 527 | Schindler's List (1993) | Drama|War | 4.247579 | 60411.0 |
108 | 110 | Braveheart (1995) | Action|Drama|War | 4.002273 | 59184.0 |
2867 | 2959 | Fight Club (1999) | Action|Crime|Drama|Thriller | 4.228311 | 58773.0 |
movie_info['count'].describe()
count 59047.000000 mean 423.393144 std 2477.885821 min 1.000000 25% 2.000000 50% 6.000000 75% 36.000000 max 81491.000000 Name: count, dtype: float64