Aggregates and Groups#

Setup#

Import our modules again:

import numpy as np
import pandas as pd

And load the MovieLens data. We’re going to pass the memory_use='deep' to info, so we can see the total memory use including the strings.

movies = pd.read_csv('ml-25m/movies.csv')
movies.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  62423 non-null  int64 
 1   title    62423 non-null  object
 2   genres   62423 non-null  object
dtypes: int64(1), object(2)
memory usage: 9.6 MB

ratings = pd.read_csv('ml-25m/ratings.csv')
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000095 entries, 0 to 25000094
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int64(3)
memory usage: 762.9 MB

Quickly preview the ratings frame:

ratings

	userId	movieId	rating	timestamp
0	1	296	5.0	1147880044
1	1	306	3.5	1147868817
2	1	307	5.0	1147868828
3	1	665	5.0	1147878820
4	1	899	3.5	1147868510
...	...	...	...	...
25000090	162541	50872	4.5	1240953372
25000091	162541	55768	2.5	1240951998
25000092	162541	56176	2.0	1240950697
25000093	162541	58559	4.0	1240953434
25000094	162541	63876	5.0	1240952515

25000095 rows × 4 columns

Aggregate Functions#

An aggregate function combines a series (or array) into a single value:

ratings['rating'].mean()

3.533854451353085

ratings['rating'].sum()

88346697.0

Alternate form — function from numpy:

np.sum(ratings['rating'])

88346697.0

Series Size#

How big is the series?

ratings['rating'].size

25000095

len(ratings['rating'])

25000095

We can also get the array shape:

ratings['rating'].shape

(25000095,)

And the count of non-null values (we don’t have any null values!):

ratings['rating'].count()

25000095

.shape returns the array shape as a tuple — the array is one-dimensional with length 25M
.size returns the series size (length). len(…) is identical.
.count() counts values, not including missing values

Quantiles#

Let’s see the quantile function:

ratings['rating'].quantile(0.5)

3.5

ratings['rating'].quantile(0.2)

3.0

Fascinating! 80% of ratings are 3.0 or higher.

Grouped Aggregates#

We can group by a column and compute aggregates within the group.

How many ratings per movie?

ratings.groupby('movieId')['rating'].count()

movieId
1         57309
2         24228
3         11804
4          2523
5         11714
          ...  
209157        1
209159        1
209163        1
209169        1
209171        1
Name: rating, Length: 59047, dtype: int64

We can compute multiple aggregates at the same time:

movie_stats = ratings.groupby('movieId')['rating'].agg(['mean', 'count'])
movie_stats

	mean	count
movieId
1	3.893708	57309
2	3.251527	24228
3	3.142028	11804
4	2.853547	2523
5	3.058434	11714
...	...	...
209157	1.500000	1
209159	3.000000	1
209163	4.500000	1
209169	3.000000	1
209171	3.000000	1

59047 rows × 2 columns

This result is a data frame, again indexed by movieId.

Finding Largest#

We can get the 10 movies with the most ratings:

movie_stats.nlargest(10, 'count')

	mean	count
movieId
356	4.048011	81491
318	4.413576	81482
296	4.188912	79672
593	4.151342	74127
2571	4.154099	72674
260	4.120189	68717
480	3.679175	64144
527	4.247579	60411
110	4.002273	59184
2959	4.228311	58773

Joining#

We want to combine our stats with movie info.

on='movieId' says to use the movieId column of the first frame instead of its index.
Matches values to index in other frame

movie_info = movies.join(movie_stats, on='movieId')
movie_info

	movieId	title	genres	mean	count
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	3.893708	57309.0
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy	3.251527	24228.0
2	3	Grumpier Old Men (1995)	Comedy\|Romance	3.142028	11804.0
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance	2.853547	2523.0
4	5	Father of the Bride Part II (1995)	Comedy	3.058434	11714.0
...	...	...	...	...	...
62418	209157	We (2018)	Drama	1.500000	1.0
62419	209159	Window of the Soul (2001)	Documentary	3.000000	1.0
62420	209163	Bad Poems (2018)	Comedy\|Drama	4.500000	1.0
62421	209169	A Girl Thing (2001)	(no genres listed)	3.000000	1.0
62422	209171	Women of Devil's Island (1962)	Action\|Adventure\|Drama	3.000000	1.0

62423 rows × 5 columns

movie_info.nlargest(10, 'count')

	movieId	title	genres	mean	count
351	356	Forrest Gump (1994)	Comedy\|Drama\|Romance\|War	4.048011	81491.0
314	318	Shawshank Redemption, The (1994)	Crime\|Drama	4.413576	81482.0
292	296	Pulp Fiction (1994)	Comedy\|Crime\|Drama\|Thriller	4.188912	79672.0
585	593	Silence of the Lambs, The (1991)	Crime\|Horror\|Thriller	4.151342	74127.0
2480	2571	Matrix, The (1999)	Action\|Sci-Fi\|Thriller	4.154099	72674.0
257	260	Star Wars: Episode IV - A New Hope (1977)	Action\|Adventure\|Sci-Fi	4.120189	68717.0
475	480	Jurassic Park (1993)	Action\|Adventure\|Sci-Fi\|Thriller	3.679175	64144.0
522	527	Schindler's List (1993)	Drama\|War	4.247579	60411.0
108	110	Braveheart (1995)	Action\|Drama\|War	4.002273	59184.0
2867	2959	Fight Club (1999)	Action\|Crime\|Drama\|Thriller	4.228311	58773.0

movie_info['count'].describe()

count    59047.000000
mean       423.393144
std       2477.885821
min          1.000000
25%          2.000000
50%          6.000000
75%         36.000000
max      81491.000000
Name: count, dtype: float64

CS 533 Fall 2022

Aggregates and Groups

Contents