SciKit-Learn Linear Regression Demo¶

This notebook demonstrates running a linear regression with sklearn.linear_model.LinearRegression, and comparing its results with those of a model from statsmodels.formula.api.ols().

Setup¶

First, let’s import our basic modules:

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import seaborn as sns

And initialize the RNG:

import seedbank
seedbank.initialize(20211026)

SeedSequence(
    entropy=20211026,
)

And we’ll load the movies we’ve been working with in several other examples:

movies = pd.read_csv('../data/hetrec2011-ml/movies.dat', sep='\t', encoding='latin1',
                     na_values='\\N')

movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10197 entries, 0 to 10196
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      10197 non-null  int64  
 1   title                   10197 non-null  object 
 2   imdbID                  10197 non-null  int64  
 3   spanishTitle            10197 non-null  object 
 4   imdbPictureURL          10016 non-null  object 
 5   year                    10197 non-null  int64  
 6   rtID                    9886 non-null   object 
 7   rtAllCriticsRating      9967 non-null   float64
 8   rtAllCriticsNumReviews  9967 non-null   float64
 9   rtAllCriticsNumFresh    9967 non-null   float64
 10  rtAllCriticsNumRotten   9967 non-null   float64
 11  rtAllCriticsScore       9967 non-null   float64
 12  rtTopCriticsRating      9967 non-null   float64
 13  rtTopCriticsNumReviews  9967 non-null   float64
 14  rtTopCriticsNumFresh    9967 non-null   float64
 15  rtTopCriticsNumRotten   9967 non-null   float64
 16  rtTopCriticsScore       9967 non-null   float64
 17  rtAudienceRating        9967 non-null   float64
 18  rtAudienceNumRatings    9967 non-null   float64
 19  rtAudienceScore         9967 non-null   float64
 20  rtPictureURL            9967 non-null   object 
dtypes: float64(13), int64(3), object(5)
memory usage: 1.6+ MB

And clear out the zeros, as discussed in Missing Data:

movies.loc[movies['rtAllCriticsRating'] == 0, 'rtAllCriticsRating'] = np.nan
movies.loc[movies['rtAudienceRating'] == 0, 'rtAudienceRating'] = np.nan

Finally, we’ll create a test set of movies, and store the rest in training data:

test_movies = movies.sample(frac=0.2)

train_mask = pd.Series(True, index=movies.index)
train_mask[test_movies.index] = False
train_movies = movies[train_mask]

We could also do this with sklearn.model_selection.train_test_split(), but I’m using Pandas directly here so we can see clearly what the split is doing.

Statsmodels¶

We’re first going to use Statsmodels to compute a linear regression, like we have done on other data sets. Our goal is to use the all-critics rating to predict the audience rating:

lm_sm = smf.ols('rtAudienceRating ~ rtAllCriticsRating', data=train_movies)
lm_sm_fit = lm_sm.fit()
lm_sm_fit.summary()

OLS Regression Results
Dep. Variable:	rtAudienceRating	R-squared:	0.511
Model:	OLS	Adj. R-squared:	0.511
Method:	Least Squares	F-statistic:	6002.
Date:	Thu, 28 Oct 2021	Prob (F-statistic):	0.00
Time:	14:30:25	Log-Likelihood:	-1554.7
No. Observations:	5749	AIC:	3113.
Df Residuals:	5747	BIC:	3127.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	2.0949	0.017	121.401	0.000	2.061	2.129
rtAllCriticsRating	0.2137	0.003	77.474	0.000	0.208	0.219

Omnibus:	24.811	Durbin-Watson:	1.863
Prob(Omnibus):	0.000	Jarque-Bera (JB):	31.568
Skew:	0.064	Prob(JB):	1.40e-07
Kurtosis:	3.340	Cond. No.	26.4

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We could also try to do this with a standardized predictor variable:

critic_ratings = train_movies['rtAllCriticsRating']
cr_mean = critic_ratings.mean()
cr_std = critic_ratings.std()
critic_std = (critic_ratings - cr_mean) / cr_std
train_norm = pd.DataFrame({
    'critics': critic_std,
    'audience': train_movies['rtAudienceRating']
})

lm_nsm = smf.ols('audience ~ critics', data=train_norm)
lm_nsm_fit = lm_nsm.fit()
lm_nsm_fit.summary()

OLS Regression Results
Dep. Variable:	audience	R-squared:	0.511
Model:	OLS	Adj. R-squared:	0.511
Method:	Least Squares	F-statistic:	6002.
Date:	Thu, 28 Oct 2021	Prob (F-statistic):	0.00
Time:	14:30:25	Log-Likelihood:	-1554.7
No. Observations:	5749	AIC:	3113.
Df Residuals:	5747	BIC:	3127.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	3.3926	0.004	811.043	0.000	3.384	3.401
critics	0.3271	0.004	77.474	0.000	0.319	0.335

Omnibus:	24.811	Durbin-Watson:	1.863
Prob(Omnibus):	0.000	Jarque-Bera (JB):	31.568
Skew:	0.064	Prob(JB):	1.40e-07
Kurtosis:	3.340	Cond. No.	1.01

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

This means that an increase of 1 standard deviation in the mean critic rating corresponds to an increase of 1 in the audience rating (since the audience ratings have not been standardized, it is 1 actual mean rating value, not 1 standard deviation in the response variable).

SciKit-Learn¶

Now we want to implement this model with SciKit-Learn’s linear regression model.

Import the linear regression class:

from sklearn.linear_model import LinearRegression

SciKit-Learn doesn’t automatically drop missing data, so we’re going to create a filtered copy of our training data that doesn’t have nulls in our predictor or outcome variables:

train_filt = train_movies.dropna(subset=['rtAllCriticsRating', 'rtAudienceRating'])

Now we’re going to create an X matrix with one column to contain our predictor variable:

train_X = train_filt[['rtAllCriticsRating']]
train_X.head()

	rtAllCriticsRating
0	9.0
1	5.6
4	5.3
5	7.7
6	7.4

And a y vector with the outcomes:

train_y = train_filt['rtAudienceRating']
train_y.head()

0    3.7
1    3.2
4    3.0
5    3.9
6    3.8
Name: rtAudienceRating, dtype: float64

With the data prepared, we can create and train our linear regression object. sklearn.linear_model.LinearRegression does not support regularization, so we can just use it as-is to get a model equivalent to Statsmodels:

linreg = LinearRegression()
linreg.fit(train_X, train_y)

LinearRegression()

Let’s look at its coefficient(s):

linreg.coef_

array([0.21366369])

That’s exactly what we got from statsmodels.

linreg.intercept_

2.0949060615377193

Also the same.

Let’s create a test X matrix to measure some predictions, after filtering:

test_movies = test_movies.dropna(subset=['rtAllCriticsRating', 'rtAudienceRating'])

test_X = test_movies[['rtAllCriticsRating']]

And a y that contains the outcomes for the same movies (make sure you understand how this next row works):

test_y = test_movies.loc[test_X.index, 'rtAudienceRating']

And we can generate predictions — SciKit-Learn doesn’t know about Pandas, so it will be an array:

preds = linreg.predict(test_X)
preds

array([3.95378017, 3.56918553, 3.01365993, ..., 2.88546172, 3.33415547,
       3.44098731])

And we can compute our errors and our mean squared error:

err = test_y - preds
err

8979    0.146220
6752   -0.069186
904    -0.213660
2293    0.453052
4588   -0.221363
          ...   
625    -0.476017
5342   -0.476017
8581    0.314538
6114   -0.234155
702    -0.540987
Name: rtAudienceRating, Length: 1463, dtype: float64

np.mean(np.square(err))

0.09531922808833015

The sklearn.metrics.mean_squared_error() function can do mean squared error for us:

from sklearn.metrics import mean_squared_error
mean_squared_error(test_y, preds)

0.09531922808833015

Standardizing with SciKit-Learn¶

Now let’s think about that standardized model. We can do it with sklearn.preprocessing.StandardScaler and a pipeline. Import our classes:

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

lm_pipe = Pipeline([
    ('scale', StandardScaler()),
    ('predict', LinearRegression())
])
lm_pipe

Pipeline(steps=[('scale', StandardScaler()), ('predict', LinearRegression())])

We can fit the whole model:

lm_pipe.fit(train_X, train_y)

Pipeline(steps=[('scale', StandardScaler()), ('predict', LinearRegression())])

Now let’s see the coefficients:

lm_pipe.named_steps['predict'].coef_

array([0.32407576])

That matches what we got from statsmodels for this model! Let’s check our prediction error - it should be the same, since the new coefficient and intercept will compensate for the scaling to produce the same predictions (if we had regularization, this would not be true):

mean_squared_error(test_y, lm_pipe.predict(test_X))

0.09531922808833014

Regularization¶

Finally, I would like to demonstrate sklearn.linear_model.Ridge to try this with L2 regression; in particular, we’ll use sklearn.linear_model.RidgeCV to learn a good regression strength with cross-validation on the training data.

from sklearn.linear_model import RidgeCV

reg_pipe = Pipeline([
    ('scale', StandardScaler()),
    ('predict', RidgeCV([0.001, 0.1, 1, 5, 10, 20]))
])
reg_pipe

Pipeline(steps=[('scale', StandardScaler()),
                ('predict',
                 RidgeCV(alphas=array([1.e-03, 1.e-01, 1.e+00, 5.e+00, 1.e+01, 2.e+01])))])

Fit the data:

reg_pipe.fit(train_X, train_y)

Pipeline(steps=[('scale', StandardScaler()),
                ('predict',
                 RidgeCV(alphas=array([1.e-03, 1.e-01, 1.e+00, 5.e+00, 1.e+01, 2.e+01])))])

What regularization did it learn?

reg_pipe.named_steps['predict'].alpha_

1.0

And let’s look at the prediction accuracy:

mean_squared_error(test_y, reg_pipe.predict(test_X))

0.0953203458413433

Very, very slightly higher. Even with a very simple model, regularization can sometimes improve generalization. The difference in this case is almost certainly not statistically significant.

CS 533 Fall 2021