Sessionization¶

This notebook shows how to use vectorized operations to compute session identifiers from timestamped user activity data, where we consider a user to have begun a new session if it has been at least an hour since their last activity.

Setup¶

Import our Python modules:

In [2]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Import progress bar library & install on Pandas:

In [3]:

from tqdm.notebook import tqdm
tqdm.pandas()

Load the rating data - we're going to specify data types for user and movie IDs to save space:

In [22]:

ratings = pd.read_csv('ml-25m/ratings.csv', dtype={
    'userId': 'int32',
    'movieId': 'int32'
})
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000095 entries, 0 to 25000094
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int32  
 1   movieId    int32  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int32(2), int64(1)
memory usage: 572.2 MB

Compute user statistics:

In [6]:

user_stats = ratings.groupby('userId')['rating'].agg(['mean', 'count'])
user_stats.sort_values('count', ascending=False).head()

Out[6]:

	mean	count
userId
72315	3.080601	32202
80974	3.280290	9178
137293	3.184001	8913
33844	2.580124	7919
20055	3.208868	7488

Computing Sessions¶

Now we're going to find one user with a lot of ratings and compute their sesion IDs.

In [7]:

bigu_ratings = ratings[ratings['userId'] == 20055]
bigu_ratings

Out[7]:

	userId	movieId	rating	timestamp
3037025	20055	1	5.0	1160463838
3037026	20055	2	3.0	1160533952
3037027	20055	3	2.5	1163336461
3037028	20055	5	1.5	1163336723
3037029	20055	6	4.0	1160532816
...	...	...	...	...
3044508	20055	92643	4.0	1336287468
3044509	20055	93116	3.0	1344823537
3044510	20055	93838	4.5	1336287562
3044511	20055	99114	3.0	1423722951
3044512	20055	109374	4.5	1423722957

7488 rows × 4 columns

In [8]:

bigu_ratings['timestamp'].agg(['min', 'max'])

Out[8]:

min    1160462492
max    1423722957
Name: timestamp, dtype: int64

We want to determine the user's sessions.

We'll break sessions at gaps of 1 hour or more. (1h = 3600s)

Step 1: sort by timestamp

In [11]:

bigu_ratings = bigu_ratings.sort_values('timestamp')
bigu_ratings.head()

Out[11]:

	userId	movieId	rating	timestamp	gap
3037863	20055	1252	5.0	1160462492	-21497210.0
3037889	20055	1278	5.0	1160462504	-519240.0
3037142	20055	163	3.0	1160462585	-86467.0
3038773	20055	2599	4.5	1160462620	-2709870.0
3039404	20055	3408	4.0	1160462658	-29119552.0

Step 2: find inter-activity gap.

2 ways:

lag shifts a series by 1
diff subtracts from previous value

We'll use diff:

In [12]:

bigu_ratings['gap'] = bigu_ratings['timestamp'].diff()
bigu_ratings.head()

Out[12]:

	userId	movieId	rating	timestamp	gap
3037863	20055	1252	5.0	1160462492	NaN
3037889	20055	1278	5.0	1160462504	12.0
3037142	20055	163	3.0	1160462585	81.0
3038773	20055	2599	4.5	1160462620	35.0
3039404	20055	3408	4.0	1160462658	38.0

Note: s.diff() is the same as s - s.lag()

Step 3: mark start-of-new-session.

In [13]:

bigu_ratings['new_session'] = bigu_ratings['gap'] >= 3600
bigu_ratings.head()

Out[13]:

	userId	movieId	rating	timestamp	gap	new_session
3037863	20055	1252	5.0	1160462492	NaN	False
3037889	20055	1278	5.0	1160462504	12.0	False
3037142	20055	163	3.0	1160462585	81.0	False
3038773	20055	2599	4.5	1160462620	35.0	False
3039404	20055	3408	4.0	1160462658	38.0	False

In [14]:

bigu_ratings[bigu_ratings['new_session']]

Out[14]:

	userId	movieId	rating	timestamp	gap	new_session
3043463	20055	46850	3.5	1160531047	65771.0	True
3039293	20055	3265	4.0	1160546577	10687.0	True
3038163	20055	1757	4.0	1160625516	70234.0	True
3037233	20055	306	5.0	1160639746	11676.0	True
3037234	20055	307	4.5	1160981034	339863.0	True
...	...	...	...	...	...	...
3044254	20055	70423	4.0	1402399658	43043898.0	True
3044106	20055	62662	4.5	1404551809	2152151.0	True
3044367	20055	78895	4.5	1404734374	182479.0	True
3042931	20055	26547	4.5	1404813926	79501.0	True
3044511	20055	99114	3.0	1423722951	18909023.0	True

275 rows × 6 columns

Step 4: compute session IDs. cumsum to the rescue!

It will increment its sum each time it hits a True, and then keep the sum (now the ID!) through the rest of the session.

In [15]:

bigu_ratings['session_id'] = bigu_ratings['new_session'].cumsum()
bigu_ratings

Out[15]:

	userId	movieId	rating	timestamp	gap	new_session	session_id
3037863	20055	1252	5.0	1160462492	NaN	False	0
3037889	20055	1278	5.0	1160462504	12.0	False	0
3037142	20055	163	3.0	1160462585	81.0	False	0
3038773	20055	2599	4.5	1160462620	35.0	False	0
3039404	20055	3408	4.0	1160462658	38.0	False	0
...	...	...	...	...	...	...	...
3042931	20055	26547	4.5	1404813926	79501.0	True	274
3044361	20055	78276	2.5	1404813928	2.0	False	274
3044511	20055	99114	3.0	1423722951	18909023.0	True	275
3044502	20055	91529	3.5	1423722955	4.0	False	275
3044512	20055	109374	4.5	1423722957	2.0	False	275

7488 rows × 7 columns

Sessionizing All Users¶

Let's put that logic in a function:

In [16]:

def sessionize(udf):
    gap = udf['timestamp'].diff()
    new_s = gap >= 3600
    return new_s.cumsum()

In [21]:

sessionize(bigu_ratings)

Out[21]:

3037863      0
3037889      0
3037142      0
3038773      0
3039404      0
          ... 
3042931    274
3044361    274
3044511    275
3044502    275
3044512    275
Name: timestamp, Length: 7488, dtype: int32

Set up the ratings as sorted:

In [17]:

ratings.sort_values('timestamp', inplace=True)
ratings.reset_index(inplace=True, drop=True)
ratings.head()

Out[17]:

	userId	movieId	rating	timestamp
0	2262	21	3.0	789652009
1	2262	1079	3.0	789652009
2	2262	47	5.0	789652009
3	102689	1	4.0	822873600
4	102689	39	5.0	822873600

And call:

In [18]:

sessions = ratings.groupby('userId').progress_apply(sessionize)
sessions.head()

Out[18]:

userId          
1       10978976    0
        10978977    0
        10978978    0
        10978979    0
        10978991    0
Name: timestamp, dtype: int32

Now we do 3 things:

Drop the outermost index level (userID), since the inner one is row numbers that match the RangeIndex on ratings
Convert to a frame with a column name
Join with ratings to connect to the original data

In [20]:

rating_sessions = ratings.join(sessions.reset_index(level=0, drop=True).to_frame('session_id'))
rating_sessions.head()

Out[20]:

	userId	movieId	rating	timestamp
0	2262	21	3.0	789652009
1	2262	1079	3.0	789652009
2	2262	47	5.0	789652009
3	102689	1	4.0	822873600
4	102689	39	5.0	822873600

In [ ]: