Sessionization¶
This notebook shows how to use vectorized operations to compute session identifiers from timestamped user activity data, where we consider a user to have begun a new session if it has been at least an hour since their last activity.
Setup¶
Import our Python modules:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Import progress bar library & install on Pandas:
from tqdm.notebook import tqdm
tqdm.pandas()
Load the rating data - we're going to specify data types for user and movie IDs to save space:
ratings = pd.read_csv('ml-25m/ratings.csv', dtype={
'userId': 'int32',
'movieId': 'int32'
})
ratings.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 25000095 entries, 0 to 25000094 Data columns (total 4 columns): # Column Dtype --- ------ ----- 0 userId int32 1 movieId int32 2 rating float64 3 timestamp int64 dtypes: float64(1), int32(2), int64(1) memory usage: 572.2 MB
Compute user statistics:
user_stats = ratings.groupby('userId')['rating'].agg(['mean', 'count'])
user_stats.sort_values('count', ascending=False).head()
mean | count | |
---|---|---|
userId | ||
72315 | 3.080601 | 32202 |
80974 | 3.280290 | 9178 |
137293 | 3.184001 | 8913 |
33844 | 2.580124 | 7919 |
20055 | 3.208868 | 7488 |
Computing Sessions¶
Now we're going to find one user with a lot of ratings and compute their sesion IDs.
bigu_ratings = ratings[ratings['userId'] == 20055]
bigu_ratings
userId | movieId | rating | timestamp | |
---|---|---|---|---|
3037025 | 20055 | 1 | 5.0 | 1160463838 |
3037026 | 20055 | 2 | 3.0 | 1160533952 |
3037027 | 20055 | 3 | 2.5 | 1163336461 |
3037028 | 20055 | 5 | 1.5 | 1163336723 |
3037029 | 20055 | 6 | 4.0 | 1160532816 |
... | ... | ... | ... | ... |
3044508 | 20055 | 92643 | 4.0 | 1336287468 |
3044509 | 20055 | 93116 | 3.0 | 1344823537 |
3044510 | 20055 | 93838 | 4.5 | 1336287562 |
3044511 | 20055 | 99114 | 3.0 | 1423722951 |
3044512 | 20055 | 109374 | 4.5 | 1423722957 |
7488 rows × 4 columns
bigu_ratings['timestamp'].agg(['min', 'max'])
min 1160462492 max 1423722957 Name: timestamp, dtype: int64
We want to determine the user's sessions.
We'll break sessions at gaps of 1 hour or more. (1h = 3600s)
Step 1: sort by timestamp
bigu_ratings = bigu_ratings.sort_values('timestamp')
bigu_ratings.head()
userId | movieId | rating | timestamp | gap | |
---|---|---|---|---|---|
3037863 | 20055 | 1252 | 5.0 | 1160462492 | -21497210.0 |
3037889 | 20055 | 1278 | 5.0 | 1160462504 | -519240.0 |
3037142 | 20055 | 163 | 3.0 | 1160462585 | -86467.0 |
3038773 | 20055 | 2599 | 4.5 | 1160462620 | -2709870.0 |
3039404 | 20055 | 3408 | 4.0 | 1160462658 | -29119552.0 |
Step 2: find inter-activity gap.
2 ways:
lag
shifts a series by 1diff
subtracts from previous value
We'll use diff
:
bigu_ratings['gap'] = bigu_ratings['timestamp'].diff()
bigu_ratings.head()
userId | movieId | rating | timestamp | gap | |
---|---|---|---|---|---|
3037863 | 20055 | 1252 | 5.0 | 1160462492 | NaN |
3037889 | 20055 | 1278 | 5.0 | 1160462504 | 12.0 |
3037142 | 20055 | 163 | 3.0 | 1160462585 | 81.0 |
3038773 | 20055 | 2599 | 4.5 | 1160462620 | 35.0 |
3039404 | 20055 | 3408 | 4.0 | 1160462658 | 38.0 |
Note: s.diff()
is the same as s - s.lag()
Step 3: mark start-of-new-session.
bigu_ratings['new_session'] = bigu_ratings['gap'] >= 3600
bigu_ratings.head()
userId | movieId | rating | timestamp | gap | new_session | |
---|---|---|---|---|---|---|
3037863 | 20055 | 1252 | 5.0 | 1160462492 | NaN | False |
3037889 | 20055 | 1278 | 5.0 | 1160462504 | 12.0 | False |
3037142 | 20055 | 163 | 3.0 | 1160462585 | 81.0 | False |
3038773 | 20055 | 2599 | 4.5 | 1160462620 | 35.0 | False |
3039404 | 20055 | 3408 | 4.0 | 1160462658 | 38.0 | False |
bigu_ratings[bigu_ratings['new_session']]
userId | movieId | rating | timestamp | gap | new_session | |
---|---|---|---|---|---|---|
3043463 | 20055 | 46850 | 3.5 | 1160531047 | 65771.0 | True |
3039293 | 20055 | 3265 | 4.0 | 1160546577 | 10687.0 | True |
3038163 | 20055 | 1757 | 4.0 | 1160625516 | 70234.0 | True |
3037233 | 20055 | 306 | 5.0 | 1160639746 | 11676.0 | True |
3037234 | 20055 | 307 | 4.5 | 1160981034 | 339863.0 | True |
... | ... | ... | ... | ... | ... | ... |
3044254 | 20055 | 70423 | 4.0 | 1402399658 | 43043898.0 | True |
3044106 | 20055 | 62662 | 4.5 | 1404551809 | 2152151.0 | True |
3044367 | 20055 | 78895 | 4.5 | 1404734374 | 182479.0 | True |
3042931 | 20055 | 26547 | 4.5 | 1404813926 | 79501.0 | True |
3044511 | 20055 | 99114 | 3.0 | 1423722951 | 18909023.0 | True |
275 rows × 6 columns
Step 4: compute session IDs. cumsum
to the rescue!
It will increment its sum each time it hits a True
, and then keep the sum (now the ID!) through the rest of the session.
bigu_ratings['session_id'] = bigu_ratings['new_session'].cumsum()
bigu_ratings
userId | movieId | rating | timestamp | gap | new_session | session_id | |
---|---|---|---|---|---|---|---|
3037863 | 20055 | 1252 | 5.0 | 1160462492 | NaN | False | 0 |
3037889 | 20055 | 1278 | 5.0 | 1160462504 | 12.0 | False | 0 |
3037142 | 20055 | 163 | 3.0 | 1160462585 | 81.0 | False | 0 |
3038773 | 20055 | 2599 | 4.5 | 1160462620 | 35.0 | False | 0 |
3039404 | 20055 | 3408 | 4.0 | 1160462658 | 38.0 | False | 0 |
... | ... | ... | ... | ... | ... | ... | ... |
3042931 | 20055 | 26547 | 4.5 | 1404813926 | 79501.0 | True | 274 |
3044361 | 20055 | 78276 | 2.5 | 1404813928 | 2.0 | False | 274 |
3044511 | 20055 | 99114 | 3.0 | 1423722951 | 18909023.0 | True | 275 |
3044502 | 20055 | 91529 | 3.5 | 1423722955 | 4.0 | False | 275 |
3044512 | 20055 | 109374 | 4.5 | 1423722957 | 2.0 | False | 275 |
7488 rows × 7 columns
Sessionizing All Users¶
Let's put that logic in a function:
def sessionize(udf):
gap = udf['timestamp'].diff()
new_s = gap >= 3600
return new_s.cumsum()
sessionize(bigu_ratings)
3037863 0 3037889 0 3037142 0 3038773 0 3039404 0 ... 3042931 274 3044361 274 3044511 275 3044502 275 3044512 275 Name: timestamp, Length: 7488, dtype: int32
Set up the ratings as sorted:
ratings.sort_values('timestamp', inplace=True)
ratings.reset_index(inplace=True, drop=True)
ratings.head()
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 2262 | 21 | 3.0 | 789652009 |
1 | 2262 | 1079 | 3.0 | 789652009 |
2 | 2262 | 47 | 5.0 | 789652009 |
3 | 102689 | 1 | 4.0 | 822873600 |
4 | 102689 | 39 | 5.0 | 822873600 |
And call:
sessions = ratings.groupby('userId').progress_apply(sessionize)
sessions.head()
userId 1 10978976 0 10978977 0 10978978 0 10978979 0 10978991 0 Name: timestamp, dtype: int32
Now we do 3 things:
- Drop the outermost index level (userID), since the inner one is row numbers that match the
RangeIndex
onratings
- Convert to a frame with a column name
- Join with
ratings
to connect to the original data
rating_sessions = ratings.join(sessions.reset_index(level=0, drop=True).to_frame('session_id'))
rating_sessions.head()
userId | movieId | rating | timestamp | session_id | |
---|---|---|---|---|---|
0 | 2262 | 21 | 3.0 | 789652009 | 0 |
1 | 2262 | 1079 | 3.0 | 789652009 | 0 |
2 | 2262 | 47 | 5.0 | 789652009 | 0 |
3 | 102689 | 1 | 4.0 | 822873600 | 0 |
4 | 102689 | 39 | 5.0 | 822873600 | 0 |