Aggregates and Groups¶
Setup¶
Import our modules again:
import numpy as np
import pandas as pd
And load the MovieLens data. We're going to pass the memory_use='deep' to info, so we can see the total memory use including the strings.
movies = pd.read_csv('ml-25m/movies.csv')
movies.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'> RangeIndex: 62423 entries, 0 to 62422 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 movieId 62423 non-null int64 1 title 62423 non-null object 2 genres 62423 non-null object dtypes: int64(1), object(2) memory usage: 9.6 MB
ratings = pd.read_csv('ml-25m/ratings.csv')
ratings.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 25000095 entries, 0 to 25000094 Data columns (total 4 columns): # Column Dtype --- ------ ----- 0 userId int64 1 movieId int64 2 rating float64 3 timestamp int64 dtypes: float64(1), int64(3) memory usage: 762.9 MB
Quickly preview the ratings frame:
ratings
| userId | movieId | rating | timestamp | |
|---|---|---|---|---|
| 0 | 1 | 296 | 5.0 | 1147880044 |
| 1 | 1 | 306 | 3.5 | 1147868817 |
| 2 | 1 | 307 | 5.0 | 1147868828 |
| 3 | 1 | 665 | 5.0 | 1147878820 |
| 4 | 1 | 899 | 3.5 | 1147868510 |
| ... | ... | ... | ... | ... |
| 25000090 | 162541 | 50872 | 4.5 | 1240953372 |
| 25000091 | 162541 | 55768 | 2.5 | 1240951998 |
| 25000092 | 162541 | 56176 | 2.0 | 1240950697 |
| 25000093 | 162541 | 58559 | 4.0 | 1240953434 |
| 25000094 | 162541 | 63876 | 5.0 | 1240952515 |
25000095 rows × 4 columns
Aggregate Functions¶
An aggregate function combines a series (or array) into a single value:
ratings['rating'].mean()
3.533854451353085
ratings['rating'].sum()
88346697.0
Alternate form --- function from numpy:
np.sum(ratings['rating'])
88346697.0
Series Size¶
How big is the series?
ratings['rating'].size
25000095
len(ratings['rating'])
25000095
We can also get the array shape:
ratings['rating'].shape
(25000095,)
And the count of non-null values (we don't have any null values!):
ratings['rating'].count()
25000095
.shapereturns the array shape as a tuple — the array is one-dimensional with length 25M.sizereturns the series size (length).len(…)is identical..count()counts values, not including missing values
Quantiles¶
Let's see the quantile function:
ratings['rating'].quantile(0.5)
3.5
ratings['rating'].quantile(0.2)
3.0
Fascinating! 80% of ratings are 3.0 or higher.
Grouped Aggregates¶
We can group by a column and compute aggregates within the group.
How many ratings per movie?
ratings.groupby('movieId')['rating'].count()
movieId
1 57309
2 24228
3 11804
4 2523
5 11714
...
209157 1
209159 1
209163 1
209169 1
209171 1
Name: rating, Length: 59047, dtype: int64
We can compute multiple aggregates at the same time:
movie_stats = ratings.groupby('movieId')['rating'].agg(['mean', 'count'])
movie_stats
| mean | count | |
|---|---|---|
| movieId | ||
| 1 | 3.893708 | 57309 |
| 2 | 3.251527 | 24228 |
| 3 | 3.142028 | 11804 |
| 4 | 2.853547 | 2523 |
| 5 | 3.058434 | 11714 |
| ... | ... | ... |
| 209157 | 1.500000 | 1 |
| 209159 | 3.000000 | 1 |
| 209163 | 4.500000 | 1 |
| 209169 | 3.000000 | 1 |
| 209171 | 3.000000 | 1 |
59047 rows × 2 columns
This result is a data frame, again indexed by movieId.
Finding Largest¶
We can get the 10 movies with the most ratings:
movie_stats.nlargest(10, 'count')
| mean | count | |
|---|---|---|
| movieId | ||
| 356 | 4.048011 | 81491 |
| 318 | 4.413576 | 81482 |
| 296 | 4.188912 | 79672 |
| 593 | 4.151342 | 74127 |
| 2571 | 4.154099 | 72674 |
| 260 | 4.120189 | 68717 |
| 480 | 3.679175 | 64144 |
| 527 | 4.247579 | 60411 |
| 110 | 4.002273 | 59184 |
| 2959 | 4.228311 | 58773 |
Joining¶
We want to combine our stats with movie info.
on='movieId'says to use themovieIdcolumn of the first frame instead of its index.- Matches values to index in other frame
movie_info = movies.join(movie_stats, on='movieId')
movie_info
| movieId | title | genres | mean | count | |
|---|---|---|---|---|---|
| 0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | 3.893708 | 57309.0 |
| 1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy | 3.251527 | 24228.0 |
| 2 | 3 | Grumpier Old Men (1995) | Comedy|Romance | 3.142028 | 11804.0 |
| 3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance | 2.853547 | 2523.0 |
| 4 | 5 | Father of the Bride Part II (1995) | Comedy | 3.058434 | 11714.0 |
| ... | ... | ... | ... | ... | ... |
| 62418 | 209157 | We (2018) | Drama | 1.500000 | 1.0 |
| 62419 | 209159 | Window of the Soul (2001) | Documentary | 3.000000 | 1.0 |
| 62420 | 209163 | Bad Poems (2018) | Comedy|Drama | 4.500000 | 1.0 |
| 62421 | 209169 | A Girl Thing (2001) | (no genres listed) | 3.000000 | 1.0 |
| 62422 | 209171 | Women of Devil's Island (1962) | Action|Adventure|Drama | 3.000000 | 1.0 |
62423 rows × 5 columns
movie_info.nlargest(10, 'count')
| movieId | title | genres | mean | count | |
|---|---|---|---|---|---|
| 351 | 356 | Forrest Gump (1994) | Comedy|Drama|Romance|War | 4.048011 | 81491.0 |
| 314 | 318 | Shawshank Redemption, The (1994) | Crime|Drama | 4.413576 | 81482.0 |
| 292 | 296 | Pulp Fiction (1994) | Comedy|Crime|Drama|Thriller | 4.188912 | 79672.0 |
| 585 | 593 | Silence of the Lambs, The (1991) | Crime|Horror|Thriller | 4.151342 | 74127.0 |
| 2480 | 2571 | Matrix, The (1999) | Action|Sci-Fi|Thriller | 4.154099 | 72674.0 |
| 257 | 260 | Star Wars: Episode IV - A New Hope (1977) | Action|Adventure|Sci-Fi | 4.120189 | 68717.0 |
| 475 | 480 | Jurassic Park (1993) | Action|Adventure|Sci-Fi|Thriller | 3.679175 | 64144.0 |
| 522 | 527 | Schindler's List (1993) | Drama|War | 4.247579 | 60411.0 |
| 108 | 110 | Braveheart (1995) | Action|Drama|War | 4.002273 | 59184.0 |
| 2867 | 2959 | Fight Club (1999) | Action|Crime|Drama|Thriller | 4.228311 | 58773.0 |
movie_info['count'].describe()
count 59047.000000 mean 423.393144 std 2477.885821 min 1.000000 25% 2.000000 50% 6.000000 75% 36.000000 max 81491.000000 Name: count, dtype: float64