Advanced Pipeline Example

This notebook demonstrates more advanced use of SciKit-Learn pipelines; more sophisticated than my basic pipeline example and simpler than the TDS example.

The classification task is to predict whether or not a movie is an action movie. We are going to use regularized logistic regression for this task. If we can effectively predict whether a movie is an action movie from its ratings (number of ratings and/or rating values), that is evidence that action movies have different patterns than other movies; maybe they are more popular, or (on average) higher- or lower-rated.


As usual, we want to import things:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

And we’ll import many things from SciKit-Learn:

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, FunctionTransformer, PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split

And seed the RNG:

import seedbank

Data Preparation

Let’s first load the movie data:

movies = pd.read_csv('../data/hetrec2011-ml/movies.dat', sep='\t', encoding='latin1',
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10197 entries, 0 to 10196
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      10197 non-null  int64  
 1   title                   10197 non-null  object 
 2   imdbID                  10197 non-null  int64  
 3   spanishTitle            10197 non-null  object 
 4   imdbPictureURL          10016 non-null  object 
 5   year                    10197 non-null  int64  
 6   rtID                    9886 non-null   object 
 7   rtAllCriticsRating      9967 non-null   float64
 8   rtAllCriticsNumReviews  9967 non-null   float64
 9   rtAllCriticsNumFresh    9967 non-null   float64
 10  rtAllCriticsNumRotten   9967 non-null   float64
 11  rtAllCriticsScore       9967 non-null   float64
 12  rtTopCriticsRating      9967 non-null   float64
 13  rtTopCriticsNumReviews  9967 non-null   float64
 14  rtTopCriticsNumFresh    9967 non-null   float64
 15  rtTopCriticsNumRotten   9967 non-null   float64
 16  rtTopCriticsScore       9967 non-null   float64
 17  rtAudienceRating        9967 non-null   float64
 18  rtAudienceNumRatings    9967 non-null   float64
 19  rtAudienceScore         9967 non-null   float64
 20  rtPictureURL            9967 non-null   object 
dtypes: float64(13), int64(3), object(5)
memory usage: 1.6+ MB

And we’ll need genres:

genres = pd.read_csv('../data/hetrec2011-ml/movie_genres.dat', sep='\t', encoding='latin1',
movieID genre
0 1 Adventure
1 1 Animation
2 1 Children
3 1 Comedy
4 1 Fantasy
... ... ...
20804 65126 Comedy
20805 65126 Drama
20806 65130 Drama
20807 65130 Romance
20808 65133 Comedy

20809 rows × 2 columns

We’re going to get a list of action movie IDs and make our outcome column:

action_movies = genres.loc[genres['genre'] == 'Action', 'movieID']
movies['isAction'] = movies['id'].isin(action_movies)

Zero things out:

movies.loc[movies['rtAllCriticsRating'] == 0, 'rtAllCriticsRating'] = np.nan
movies.loc[movies['rtTopCriticsRating'] == 0, 'rtTopCriticsRating'] = np.nan
movies.loc[movies['rtAudienceRating'] == 0, 'rtAudienceRating'] = np.nan

Create our train/test split:

train, test = train_test_split(movies, test_size=0.25)
train.set_index('id', inplace=True)
test.set_index('id', inplace=True)


Now that we have training data, let’s look at our class balance:

<AxesSubplot:xlabel='isAction', ylabel='count'>

Non-action is our majority class. What is that fraction?

maj_frac = 1 - train['isAction'].mean()

If our accuracy is less than 0.859, we aren’t beating majority class.

Let’s start to look at some possible correlations with our outcome variable.

rate_cols = [
    'rtAllCriticsRating', 'rtAllCriticsNumReviews',
    'rtTopCriticsRating', 'rtTopCriticsNumReviews',
    'rtAudienceRating', 'rtAudienceNumRatings'
pivot = train[rate_cols].melt()
pivot = pivot.join(train['isAction'])
grid = sns.FacetGrid(data=pivot, col='variable', hue='isAction', sharey=False, sharex=False), 'value')

That isn’t looking promising. But the purpose of this tutorial is to demonstrate pipelines, not to build a great classifier.

Building the Classifier

For this tutorial, we are going to apply different transformations to different columns.

  • For average ratings, we will standardize the variables, and fill in missing values with 0 (unknown -> ignore)

  • For counts, we will take the log, and then standardize the logs; fill missing values with 0 (one rating)

The SciKit-Learn tool sklearn.compose.ColumnTransformer allows us to specify a transform that will do that.

SciKit-Learn doesn’t have a direct log transform, but we can use sklearn.preprocessing.FunctionTransformer with numpy.log1p().

col_ops = ColumnTransformer([
    ('ratings', Pipeline([('std', StandardScaler()), ('fill', SimpleImputer(strategy='constant', fill_value=0))]),
     ['rtAllCriticsRating', 'rtTopCriticsRating', 'rtAudienceRating']),
    ('counts', Pipeline([
        ('log', FunctionTransformer(np.log1p)),
        ('std', StandardScaler()),
        ('fill', SimpleImputer(strategy='constant', fill_value=0))]),
     ['rtAllCriticsNumReviews', 'rtTopCriticsNumReviews', 'rtAudienceNumRatings'])

By default, the column transformer ignores any columns that aren’t used in one of the runs. It does know enough about Pandas to use the column names.

Let’s see it in action:

array([[-1.42368447, -0.85740736, -1.08079972,  0.13052435,  0.43061088,
       [ 0.        ,  0.        ,  0.        , -1.4082733 , -1.28161907,
       [ 0.54852497,  0.        ,  0.46093164, -0.53448073, -0.0274333 ,
       [ 0.61426528,  0.        ,  1.12167365, -0.10789001, -0.4255041 ,
         0.7642107 ],
       [-0.50332006, -0.59676893, -0.64030504,  1.42116578,  1.46636589,
       [-0.04313786,  0.        ,  0.        , -0.46530863, -1.28161907,

We’re also going to allow the model to use first-order interaction terms: the product of the (standardized) rating count and rating value, for example. The sklearn.preprocessing.PolynomialFeatures transformer can do that. So we will put it in a pipeline with our column operations and our final logistic regression:

lg_pipe = Pipeline([
    ('features', col_ops),
    ('interact', PolynomialFeatures()),
    ('predict', LogisticRegressionCV(penalty='l1', solver='liblinear'))

Now let’s fit that model:, train['isAction'])
                                                                   FunctionTransformer(func=<ufunc 'log1p'>)),
                ('interact', PolynomialFeatures()),
                 LogisticRegressionCV(penalty='l1', solver='liblinear'))])

How do we do on that test data?

preds = lg_pipe.predict(test)

Compute the accuracy:

np.mean(preds == test['isAction'])

Did we just predict everything is False?


Yes, yes we did. We predicted the majority class.

This classifier is not good! That happens sometimes, but we have now seen an example of a more sophisticated SciKit-Learn pipeline.