SciKit-Learn Linear Regression Demo¶

This notebook demonstrates running a linear regression with sklearn.linear_model.LinearRegression, and comparing its results with those of a model from statsmodels.formula.api.ols().

Setup¶

First, let’s import our basic modules:

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import seaborn as sns

And initialize the RNG:

import seedbank
seedbank.initialize(20211026)
SeedSequence(
    entropy=20211026,
)

And we’ll load the movies we’ve been working with in several other examples:

movies = pd.read_csv('../data/hetrec2011-ml/movies.dat', sep='\t', encoding='latin1',
                     na_values='\\N')
movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10197 entries, 0 to 10196
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      10197 non-null  int64  
 1   title                   10197 non-null  object 
 2   imdbID                  10197 non-null  int64  
 3   spanishTitle            10197 non-null  object 
 4   imdbPictureURL          10016 non-null  object 
 5   year                    10197 non-null  int64  
 6   rtID                    9886 non-null   object 
 7   rtAllCriticsRating      9967 non-null   float64
 8   rtAllCriticsNumReviews  9967 non-null   float64
 9   rtAllCriticsNumFresh    9967 non-null   float64
 10  rtAllCriticsNumRotten   9967 non-null   float64
 11  rtAllCriticsScore       9967 non-null   float64
 12  rtTopCriticsRating      9967 non-null   float64
 13  rtTopCriticsNumReviews  9967 non-null   float64
 14  rtTopCriticsNumFresh    9967 non-null   float64
 15  rtTopCriticsNumRotten   9967 non-null   float64
 16  rtTopCriticsScore       9967 non-null   float64
 17  rtAudienceRating        9967 non-null   float64
 18  rtAudienceNumRatings    9967 non-null   float64
 19  rtAudienceScore         9967 non-null   float64
 20  rtPictureURL            9967 non-null   object 
dtypes: float64(13), int64(3), object(5)
memory usage: 1.6+ MB

And clear out the zeros, as discussed in Missing Data:

movies.loc[movies['rtAllCriticsRating'] == 0, 'rtAllCriticsRating'] = np.nan
movies.loc[movies['rtAudienceRating'] == 0, 'rtAudienceRating'] = np.nan

Finally, we’ll create a test set of movies, and store the rest in training data:

test_movies = movies.sample(frac=0.2)
train_mask = pd.Series(True, index=movies.index)
train_mask[test_movies.index] = False
train_movies = movies[train_mask]

We could also do this with sklearn.model_selection.train_test_split(), but I’m using Pandas directly here so we can see clearly what the split is doing.

Statsmodels¶

We’re first going to use Statsmodels to compute a linear regression, like we have done on other data sets. Our goal is to use the all-critics rating to predict the audience rating:

lm_sm = smf.ols('rtAudienceRating ~ rtAllCriticsRating', data=train_movies)
lm_sm_fit = lm_sm.fit()
lm_sm_fit.summary()
OLS Regression Results
Dep. Variable: rtAudienceRating R-squared: 0.511
Model: OLS Adj. R-squared: 0.511
Method: Least Squares F-statistic: 6002.
Date: Thu, 28 Oct 2021 Prob (F-statistic): 0.00
Time: 14:30:25 Log-Likelihood: -1554.7
No. Observations: 5749 AIC: 3113.
Df Residuals: 5747 BIC: 3127.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 2.0949 0.017 121.401 0.000 2.061 2.129
rtAllCriticsRating 0.2137 0.003 77.474 0.000 0.208 0.219
Omnibus: 24.811 Durbin-Watson: 1.863
Prob(Omnibus): 0.000 Jarque-Bera (JB): 31.568
Skew: 0.064 Prob(JB): 1.40e-07
Kurtosis: 3.340 Cond. No. 26.4


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We could also try to do this with a standardized predictor variable:

critic_ratings = train_movies['rtAllCriticsRating']
cr_mean = critic_ratings.mean()
cr_std = critic_ratings.std()
critic_std = (critic_ratings - cr_mean) / cr_std
train_norm = pd.DataFrame({
    'critics': critic_std,
    'audience': train_movies['rtAudienceRating']
})
lm_nsm = smf.ols('audience ~ critics', data=train_norm)
lm_nsm_fit = lm_nsm.fit()
lm_nsm_fit.summary()
OLS Regression Results
Dep. Variable: audience R-squared: 0.511
Model: OLS Adj. R-squared: 0.511
Method: Least Squares F-statistic: 6002.
Date: Thu, 28 Oct 2021 Prob (F-statistic): 0.00
Time: 14:30:25 Log-Likelihood: -1554.7
No. Observations: 5749 AIC: 3113.
Df Residuals: 5747 BIC: 3127.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 3.3926 0.004 811.043 0.000 3.384 3.401
critics 0.3271 0.004 77.474 0.000 0.319 0.335
Omnibus: 24.811 Durbin-Watson: 1.863
Prob(Omnibus): 0.000 Jarque-Bera (JB): 31.568
Skew: 0.064 Prob(JB): 1.40e-07
Kurtosis: 3.340 Cond. No. 1.01


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

This means that an increase of 1 standard deviation in the mean critic rating corresponds to an increase of 1 in the audience rating (since the audience ratings have not been standardized, it is 1 actual mean rating value, not 1 standard deviation in the response variable).

SciKit-Learn¶

Now we want to implement this model with SciKit-Learn’s linear regression model.

Import the linear regression class:

from sklearn.linear_model import LinearRegression

SciKit-Learn doesn’t automatically drop missing data, so we’re going to create a filtered copy of our training data that doesn’t have nulls in our predictor or outcome variables:

train_filt = train_movies.dropna(subset=['rtAllCriticsRating', 'rtAudienceRating'])

Now we’re going to create an X matrix with one column to contain our predictor variable:

train_X = train_filt[['rtAllCriticsRating']]
train_X.head()
rtAllCriticsRating
0 9.0
1 5.6
4 5.3
5 7.7
6 7.4

And a y vector with the outcomes:

train_y = train_filt['rtAudienceRating']
train_y.head()
0    3.7
1    3.2
4    3.0
5    3.9
6    3.8
Name: rtAudienceRating, dtype: float64

With the data prepared, we can create and train our linear regression object. sklearn.linear_model.LinearRegression does not support regularization, so we can just use it as-is to get a model equivalent to Statsmodels:

linreg = LinearRegression()
linreg.fit(train_X, train_y)
LinearRegression()

Let’s look at its coefficient(s):

linreg.coef_
array([0.21366369])

That’s exactly what we got from statsmodels.

linreg.intercept_
2.0949060615377193

Also the same.

Let’s create a test X matrix to measure some predictions, after filtering:

test_movies = test_movies.dropna(subset=['rtAllCriticsRating', 'rtAudienceRating'])
test_X = test_movies[['rtAllCriticsRating']]

And a y that contains the outcomes for the same movies (make sure you understand how this next row works):

test_y = test_movies.loc[test_X.index, 'rtAudienceRating']

And we can generate predictions — SciKit-Learn doesn’t know about Pandas, so it will be an array:

preds = linreg.predict(test_X)
preds
array([3.95378017, 3.56918553, 3.01365993, ..., 2.88546172, 3.33415547,
       3.44098731])

And we can compute our errors and our mean squared error:

err = test_y - preds
err
8979    0.146220
6752   -0.069186
904    -0.213660
2293    0.453052
4588   -0.221363
          ...   
625    -0.476017
5342   -0.476017
8581    0.314538
6114   -0.234155
702    -0.540987
Name: rtAudienceRating, Length: 1463, dtype: float64
np.mean(np.square(err))
0.09531922808833015

The sklearn.metrics.mean_squared_error() function can do mean squared error for us:

from sklearn.metrics import mean_squared_error
mean_squared_error(test_y, preds)
0.09531922808833015

Standardizing with SciKit-Learn¶

Now let’s think about that standardized model. We can do it with sklearn.preprocessing.StandardScaler and a pipeline. Import our classes:

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
lm_pipe = Pipeline([
    ('scale', StandardScaler()),
    ('predict', LinearRegression())
])
lm_pipe
Pipeline(steps=[('scale', StandardScaler()), ('predict', LinearRegression())])

We can fit the whole model:

lm_pipe.fit(train_X, train_y)
Pipeline(steps=[('scale', StandardScaler()), ('predict', LinearRegression())])

Now let’s see the coefficients:

lm_pipe.named_steps['predict'].coef_
array([0.32407576])

That matches what we got from statsmodels for this model! Let’s check our prediction error - it should be the same, since the new coefficient and intercept will compensate for the scaling to produce the same predictions (if we had regularization, this would not be true):

mean_squared_error(test_y, lm_pipe.predict(test_X))
0.09531922808833014

Regularization¶

Finally, I would like to demonstrate sklearn.linear_model.Ridge to try this with L2 regression; in particular, we’ll use sklearn.linear_model.RidgeCV to learn a good regression strength with cross-validation on the training data.

from sklearn.linear_model import RidgeCV
reg_pipe = Pipeline([
    ('scale', StandardScaler()),
    ('predict', RidgeCV([0.001, 0.1, 1, 5, 10, 20]))
])
reg_pipe
Pipeline(steps=[('scale', StandardScaler()),
                ('predict',
                 RidgeCV(alphas=array([1.e-03, 1.e-01, 1.e+00, 5.e+00, 1.e+01, 2.e+01])))])

Fit the data:

reg_pipe.fit(train_X, train_y)
Pipeline(steps=[('scale', StandardScaler()),
                ('predict',
                 RidgeCV(alphas=array([1.e-03, 1.e-01, 1.e+00, 5.e+00, 1.e+01, 2.e+01])))])

What regularization did it learn?

reg_pipe.named_steps['predict'].alpha_
1.0

And let’s look at the prediction accuracy:

mean_squared_error(test_y, reg_pipe.predict(test_X))
0.0953203458413433

Very, very slightly higher. Even with a very simple model, regularization can sometimes improve generalization. The difference in this case is almost certainly not statistically significant.