Advanced Pipeline Example¶
This notebook demonstrates more advanced use of SciKit-Learn pipelines; more sophisticated than my basic pipeline example and simpler than the TDS example.
The classification task is to predict whether or not a movie is an action movie. We are going to use regularized logistic regression for this task. If we can effectively predict whether a movie is an action movie from its ratings (number of ratings and/or rating values), that is evidence that action movies have different patterns than other movies; maybe they are more popular, or (on average) higher- or lower-rated.
Setup¶
As usual, we want to import things:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
And we’ll import many things from SciKit-Learn:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, FunctionTransformer, PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
And seed the RNG:
import seedbank
seedbank.initialize(20211028)
SeedSequence(
entropy=20211028,
)
Data Preparation¶
Let’s first load the movie data:
movies = pd.read_csv('../data/hetrec2011-ml/movies.dat', sep='\t', encoding='latin1',
na_values='\\N')
movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10197 entries, 0 to 10196
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 10197 non-null int64
1 title 10197 non-null object
2 imdbID 10197 non-null int64
3 spanishTitle 10197 non-null object
4 imdbPictureURL 10016 non-null object
5 year 10197 non-null int64
6 rtID 9886 non-null object
7 rtAllCriticsRating 9967 non-null float64
8 rtAllCriticsNumReviews 9967 non-null float64
9 rtAllCriticsNumFresh 9967 non-null float64
10 rtAllCriticsNumRotten 9967 non-null float64
11 rtAllCriticsScore 9967 non-null float64
12 rtTopCriticsRating 9967 non-null float64
13 rtTopCriticsNumReviews 9967 non-null float64
14 rtTopCriticsNumFresh 9967 non-null float64
15 rtTopCriticsNumRotten 9967 non-null float64
16 rtTopCriticsScore 9967 non-null float64
17 rtAudienceRating 9967 non-null float64
18 rtAudienceNumRatings 9967 non-null float64
19 rtAudienceScore 9967 non-null float64
20 rtPictureURL 9967 non-null object
dtypes: float64(13), int64(3), object(5)
memory usage: 1.6+ MB
And we’ll need genres:
genres = pd.read_csv('../data/hetrec2011-ml/movie_genres.dat', sep='\t', encoding='latin1',
na_values='\\N')
genres
movieID | genre | |
---|---|---|
0 | 1 | Adventure |
1 | 1 | Animation |
2 | 1 | Children |
3 | 1 | Comedy |
4 | 1 | Fantasy |
... | ... | ... |
20804 | 65126 | Comedy |
20805 | 65126 | Drama |
20806 | 65130 | Drama |
20807 | 65130 | Romance |
20808 | 65133 | Comedy |
20809 rows × 2 columns
We’re going to get a list of action movie IDs and make our outcome column:
action_movies = genres.loc[genres['genre'] == 'Action', 'movieID']
movies['isAction'] = movies['id'].isin(action_movies)
Zero things out:
movies.loc[movies['rtAllCriticsRating'] == 0, 'rtAllCriticsRating'] = np.nan
movies.loc[movies['rtTopCriticsRating'] == 0, 'rtTopCriticsRating'] = np.nan
movies.loc[movies['rtAudienceRating'] == 0, 'rtAudienceRating'] = np.nan
Create our train/test split:
train, test = train_test_split(movies, test_size=0.25)
train.set_index('id', inplace=True)
test.set_index('id', inplace=True)
Exploration¶
Now that we have training data, let’s look at our class balance:
sns.countplot(x=train['isAction'])
<AxesSubplot:xlabel='isAction', ylabel='count'>
Non-action is our majority class. What is that fraction?
maj_frac = 1 - train['isAction'].mean()
maj_frac
0.8595527657905061
If our accuracy is less than 0.859, we aren’t beating majority class.
Let’s start to look at some possible correlations with our outcome variable.
rate_cols = [
'rtAllCriticsRating', 'rtAllCriticsNumReviews',
'rtTopCriticsRating', 'rtTopCriticsNumReviews',
'rtAudienceRating', 'rtAudienceNumRatings'
]
pivot = train[rate_cols].melt()
pivot = pivot.join(train['isAction'])
grid = sns.FacetGrid(data=pivot, col='variable', hue='isAction', sharey=False, sharex=False)
grid.map(sns.kdeplot, 'value')
plt.show()
That isn’t looking promising. But the purpose of this tutorial is to demonstrate pipelines, not to build a great classifier.
Building the Classifier¶
For this tutorial, we are going to apply different transformations to different columns.
For average ratings, we will standardize the variables, and fill in missing values with 0 (unknown -> ignore)
For counts, we will take the log, and then standardize the logs; fill missing values with 0 (one rating)
The SciKit-Learn tool sklearn.compose.ColumnTransformer
allows us to specify a transform that will do that.
SciKit-Learn doesn’t have a direct log transform, but we can use sklearn.preprocessing.FunctionTransformer
with numpy.log1p()
.
col_ops = ColumnTransformer([
('ratings', Pipeline([('std', StandardScaler()), ('fill', SimpleImputer(strategy='constant', fill_value=0))]),
['rtAllCriticsRating', 'rtTopCriticsRating', 'rtAudienceRating']),
('counts', Pipeline([
('log', FunctionTransformer(np.log1p)),
('std', StandardScaler()),
('fill', SimpleImputer(strategy='constant', fill_value=0))]),
['rtAllCriticsNumReviews', 'rtTopCriticsNumReviews', 'rtAudienceNumRatings'])
])
By default, the column transformer ignores any columns that aren’t used in one of the runs. It does know enough about Pandas to use the column names.
Let’s see it in action:
col_ops.fit_transform(train)
array([[-1.42368447, -0.85740736, -1.08079972, 0.13052435, 0.43061088,
0.89681628],
[ 0. , 0. , 0. , -1.4082733 , -1.28161907,
-1.53417271],
[ 0.54852497, 0. , 0.46093164, -0.53448073, -0.0274333 ,
-0.10633124],
...,
[ 0.61426528, 0. , 1.12167365, -0.10789001, -0.4255041 ,
0.7642107 ],
[-0.50332006, -0.59676893, -0.64030504, 1.42116578, 1.46636589,
0.98862888],
[-0.04313786, 0. , 0. , -0.46530863, -1.28161907,
-1.53417271]])
We’re also going to allow the model to use first-order interaction terms: the product of the (standardized) rating count and rating value, for example. The sklearn.preprocessing.PolynomialFeatures
transformer can do that. So we will put it in a pipeline with our column operations and our final logistic regression:
lg_pipe = Pipeline([
('features', col_ops),
('interact', PolynomialFeatures()),
('predict', LogisticRegressionCV(penalty='l1', solver='liblinear'))
])
Now let’s fit that model:
lg_pipe.fit(train, train['isAction'])
Pipeline(steps=[('features',
ColumnTransformer(transformers=[('ratings',
Pipeline(steps=[('std',
StandardScaler()),
('fill',
SimpleImputer(fill_value=0,
strategy='constant'))]),
['rtAllCriticsRating',
'rtTopCriticsRating',
'rtAudienceRating']),
('counts',
Pipeline(steps=[('log',
FunctionTransformer(func=<ufunc 'log1p'>)),
('std',
StandardScaler()),
('fill',
SimpleImputer(fill_value=0,
strategy='constant'))]),
['rtAllCriticsNumReviews',
'rtTopCriticsNumReviews',
'rtAudienceNumRatings'])])),
('interact', PolynomialFeatures()),
('predict',
LogisticRegressionCV(penalty='l1', solver='liblinear'))])
How do we do on that test data?
preds = lg_pipe.predict(test)
Compute the accuracy:
np.mean(preds == test['isAction'])
0.8545098039215686
Did we just predict everything is False
?
np.sum(preds)
0
Yes, yes we did. We predicted the majority class.
This classifier is not good! That happens sometimes, but we have now seen an example of a more sophisticated SciKit-Learn pipeline.