Sessionization#

This notebook shows how to use vectorized operations to compute session identifiers from timestamped user activity data, where we consider a user to have begun a new session if it has been at least an hour since their last activity.

Setup#

Import our Python modules:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Import progress bar library & install on Pandas:

from tqdm.notebook import tqdm
tqdm.pandas()

Load the rating data - we’re going to specify data types for user and movie IDs to save space:

ratings = pd.read_csv('ml-25m/ratings.csv', dtype={
    'userId': 'int32',
    'movieId': 'int32'
})
ratings.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000095 entries, 0 to 25000094
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int32  
 1   movieId    int32  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int32(2), int64(1)
memory usage: 572.2 MB

Compute user statistics:

user_stats = ratings.groupby('userId')['rating'].agg(['mean', 'count'])
user_stats.sort_values('count', ascending=False).head()
mean count
userId
72315 3.080601 32202
80974 3.280290 9178
137293 3.184001 8913
33844 2.580124 7919
20055 3.208868 7488

Computing Sessions#

Now we’re going to find one user with a lot of ratings and compute their sesion IDs.

bigu_ratings = ratings[ratings['userId'] == 20055]
bigu_ratings
userId movieId rating timestamp
3037025 20055 1 5.0 1160463838
3037026 20055 2 3.0 1160533952
3037027 20055 3 2.5 1163336461
3037028 20055 5 1.5 1163336723
3037029 20055 6 4.0 1160532816
... ... ... ... ...
3044508 20055 92643 4.0 1336287468
3044509 20055 93116 3.0 1344823537
3044510 20055 93838 4.5 1336287562
3044511 20055 99114 3.0 1423722951
3044512 20055 109374 4.5 1423722957

7488 rows Γ— 4 columns

bigu_ratings['timestamp'].agg(['min', 'max'])
min    1160462492
max    1423722957
Name: timestamp, dtype: int64

We want to determine the user’s sessions.

We’ll break sessions at gaps of 1 hour or more. (1h = 3600s)

Step 1: sort by timestamp

bigu_ratings = bigu_ratings.sort_values('timestamp')
bigu_ratings.head()
userId movieId rating timestamp gap
3037863 20055 1252 5.0 1160462492 -21497210.0
3037889 20055 1278 5.0 1160462504 -519240.0
3037142 20055 163 3.0 1160462585 -86467.0
3038773 20055 2599 4.5 1160462620 -2709870.0
3039404 20055 3408 4.0 1160462658 -29119552.0

Step 2: find inter-activity gap.

2 ways:

  • lag shifts a series by 1

  • diff subtracts from previous value

We’ll use diff:

bigu_ratings['gap'] = bigu_ratings['timestamp'].diff()
bigu_ratings.head()
userId movieId rating timestamp gap
3037863 20055 1252 5.0 1160462492 NaN
3037889 20055 1278 5.0 1160462504 12.0
3037142 20055 163 3.0 1160462585 81.0
3038773 20055 2599 4.5 1160462620 35.0
3039404 20055 3408 4.0 1160462658 38.0

Note: s.diff() is the same as s - s.lag()

Step 3: mark start-of-new-session.

bigu_ratings['new_session'] = bigu_ratings['gap'] >= 3600
bigu_ratings.head()
userId movieId rating timestamp gap new_session
3037863 20055 1252 5.0 1160462492 NaN False
3037889 20055 1278 5.0 1160462504 12.0 False
3037142 20055 163 3.0 1160462585 81.0 False
3038773 20055 2599 4.5 1160462620 35.0 False
3039404 20055 3408 4.0 1160462658 38.0 False
bigu_ratings[bigu_ratings['new_session']]
userId movieId rating timestamp gap new_session
3043463 20055 46850 3.5 1160531047 65771.0 True
3039293 20055 3265 4.0 1160546577 10687.0 True
3038163 20055 1757 4.0 1160625516 70234.0 True
3037233 20055 306 5.0 1160639746 11676.0 True
3037234 20055 307 4.5 1160981034 339863.0 True
... ... ... ... ... ... ...
3044254 20055 70423 4.0 1402399658 43043898.0 True
3044106 20055 62662 4.5 1404551809 2152151.0 True
3044367 20055 78895 4.5 1404734374 182479.0 True
3042931 20055 26547 4.5 1404813926 79501.0 True
3044511 20055 99114 3.0 1423722951 18909023.0 True

275 rows Γ— 6 columns

Step 4: compute session IDs. cumsum to the rescue!

It will increment its sum each time it hits a True, and then keep the sum (now the ID!) through the rest of the session.

bigu_ratings['session_id'] = bigu_ratings['new_session'].cumsum()
bigu_ratings
userId movieId rating timestamp gap new_session session_id
3037863 20055 1252 5.0 1160462492 NaN False 0
3037889 20055 1278 5.0 1160462504 12.0 False 0
3037142 20055 163 3.0 1160462585 81.0 False 0
3038773 20055 2599 4.5 1160462620 35.0 False 0
3039404 20055 3408 4.0 1160462658 38.0 False 0
... ... ... ... ... ... ... ...
3042931 20055 26547 4.5 1404813926 79501.0 True 274
3044361 20055 78276 2.5 1404813928 2.0 False 274
3044511 20055 99114 3.0 1423722951 18909023.0 True 275
3044502 20055 91529 3.5 1423722955 4.0 False 275
3044512 20055 109374 4.5 1423722957 2.0 False 275

7488 rows Γ— 7 columns

Sessionizing All Users#

Let’s put that logic in a function:

def sessionize(udf):
    gap = udf['timestamp'].diff()
    new_s = gap >= 3600
    return new_s.cumsum()
sessionize(bigu_ratings)
3037863      0
3037889      0
3037142      0
3038773      0
3039404      0
          ... 
3042931    274
3044361    274
3044511    275
3044502    275
3044512    275
Name: timestamp, Length: 7488, dtype: int32

Set up the ratings as sorted:

ratings.sort_values('timestamp', inplace=True)
ratings.reset_index(inplace=True, drop=True)
ratings.head()
userId movieId rating timestamp
0 2262 21 3.0 789652009
1 2262 1079 3.0 789652009
2 2262 47 5.0 789652009
3 102689 1 4.0 822873600
4 102689 39 5.0 822873600

And call:

sessions = ratings.groupby('userId').progress_apply(sessionize)
sessions.head()

userId          
1       10978976    0
        10978977    0
        10978978    0
        10978979    0
        10978991    0
Name: timestamp, dtype: int32

Now we do 3 things:

  1. Drop the outermost index level (userID), since the inner one is row numbers that match the RangeIndex on ratings

  2. Convert to a frame with a column name

  3. Join with ratings to connect to the original data

rating_sessions = ratings.join(sessions.reset_index(level=0, drop=True).to_frame('session_id'))
rating_sessions.head()
userId movieId rating timestamp session_id
0 2262 21 3.0 789652009 0
1 2262 1079 3.0 789652009 0
2 2262 47 5.0 789652009 0
3 102689 1 4.0 822873600 0
4 102689 39 5.0 822873600 0