# SciKit-Learn Linear Regression Demo¶

This notebook demonstrates running a linear regression with sklearn.linear_model.LinearRegression, and comparing its results with those of a model from statsmodels.formula.api.ols().

## Setup¶

First, let’s import our basic modules:

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import seaborn as sns


And initialize the RNG:

import seedbank
seedbank.initialize(20211026)

SeedSequence(
entropy=20211026,
)


And we’ll load the movies we’ve been working with in several other examples:

movies = pd.read_csv('../data/hetrec2011-ml/movies.dat', sep='\t', encoding='latin1',
na_values='\\N')

movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10197 entries, 0 to 10196
Data columns (total 21 columns):
#   Column                  Non-Null Count  Dtype
---  ------                  --------------  -----
0   id                      10197 non-null  int64
1   title                   10197 non-null  object
2   imdbID                  10197 non-null  int64
3   spanishTitle            10197 non-null  object
4   imdbPictureURL          10016 non-null  object
5   year                    10197 non-null  int64
6   rtID                    9886 non-null   object
7   rtAllCriticsRating      9967 non-null   float64
8   rtAllCriticsNumReviews  9967 non-null   float64
9   rtAllCriticsNumFresh    9967 non-null   float64
10  rtAllCriticsNumRotten   9967 non-null   float64
11  rtAllCriticsScore       9967 non-null   float64
12  rtTopCriticsRating      9967 non-null   float64
13  rtTopCriticsNumReviews  9967 non-null   float64
14  rtTopCriticsNumFresh    9967 non-null   float64
15  rtTopCriticsNumRotten   9967 non-null   float64
16  rtTopCriticsScore       9967 non-null   float64
17  rtAudienceRating        9967 non-null   float64
18  rtAudienceNumRatings    9967 non-null   float64
19  rtAudienceScore         9967 non-null   float64
20  rtPictureURL            9967 non-null   object
dtypes: float64(13), int64(3), object(5)
memory usage: 1.6+ MB


And clear out the zeros, as discussed in Missing Data:

movies.loc[movies['rtAllCriticsRating'] == 0, 'rtAllCriticsRating'] = np.nan
movies.loc[movies['rtAudienceRating'] == 0, 'rtAudienceRating'] = np.nan


Finally, we’ll create a test set of movies, and store the rest in training data:

test_movies = movies.sample(frac=0.2)

train_mask = pd.Series(True, index=movies.index)


We could also do this with sklearn.model_selection.train_test_split(), but I’m using Pandas directly here so we can see clearly what the split is doing.

## Statsmodels¶

We’re first going to use Statsmodels to compute a linear regression, like we have done on other data sets. Our goal is to use the all-critics rating to predict the audience rating:

lm_sm = smf.ols('rtAudienceRating ~ rtAllCriticsRating', data=train_movies)
lm_sm_fit = lm_sm.fit()
lm_sm_fit.summary()

Dep. Variable: R-squared: rtAudienceRating 0.511 OLS 0.511 Least Squares 6002. Thu, 28 Oct 2021 0.00 14:30:25 -1554.7 5749 3113. 5747 3127. 1 nonrobust
coef std err t P>|t| [0.025 0.975] 2.0949 0.017 121.401 0.000 2.061 2.129 0.2137 0.003 77.474 0.000 0.208 0.219
 Omnibus: Durbin-Watson: 24.811 1.863 0 31.568 0.064 1.4e-07 3.34 26.4

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We could also try to do this with a standardized predictor variable:

critic_ratings = train_movies['rtAllCriticsRating']
cr_mean = critic_ratings.mean()
cr_std = critic_ratings.std()
critic_std = (critic_ratings - cr_mean) / cr_std
train_norm = pd.DataFrame({
'critics': critic_std,
'audience': train_movies['rtAudienceRating']
})

lm_nsm = smf.ols('audience ~ critics', data=train_norm)
lm_nsm_fit = lm_nsm.fit()
lm_nsm_fit.summary()

Dep. Variable: R-squared: audience 0.511 OLS 0.511 Least Squares 6002. Thu, 28 Oct 2021 0.00 14:30:25 -1554.7 5749 3113. 5747 3127. 1 nonrobust
coef std err t P>|t| [0.025 0.975] 3.3926 0.004 811.043 0.000 3.384 3.401 0.3271 0.004 77.474 0.000 0.319 0.335
 Omnibus: Durbin-Watson: 24.811 1.863 0 31.568 0.064 1.4e-07 3.34 1.01

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

This means that an increase of 1 standard deviation in the mean critic rating corresponds to an increase of 1 in the audience rating (since the audience ratings have not been standardized, it is 1 actual mean rating value, not 1 standard deviation in the response variable).

## SciKit-Learn¶

Now we want to implement this model with SciKit-Learn’s linear regression model.

Import the linear regression class:

from sklearn.linear_model import LinearRegression


SciKit-Learn doesn’t automatically drop missing data, so we’re going to create a filtered copy of our training data that doesn’t have nulls in our predictor or outcome variables:

train_filt = train_movies.dropna(subset=['rtAllCriticsRating', 'rtAudienceRating'])


Now we’re going to create an X matrix with one column to contain our predictor variable:

train_X = train_filt[['rtAllCriticsRating']]

rtAllCriticsRating
0 9.0
1 5.6
4 5.3
5 7.7
6 7.4

And a y vector with the outcomes:

train_y = train_filt['rtAudienceRating']

0    3.7
1    3.2
4    3.0
5    3.9
6    3.8
Name: rtAudienceRating, dtype: float64


With the data prepared, we can create and train our linear regression object. sklearn.linear_model.LinearRegression does not support regularization, so we can just use it as-is to get a model equivalent to Statsmodels:

linreg = LinearRegression()
linreg.fit(train_X, train_y)

LinearRegression()


Let’s look at its coefficient(s):

linreg.coef_

array([0.21366369])


That’s exactly what we got from statsmodels.

linreg.intercept_

2.0949060615377193


Also the same.

Let’s create a test X matrix to measure some predictions, after filtering:

test_movies = test_movies.dropna(subset=['rtAllCriticsRating', 'rtAudienceRating'])

test_X = test_movies[['rtAllCriticsRating']]


And a y that contains the outcomes for the same movies (make sure you understand how this next row works):

test_y = test_movies.loc[test_X.index, 'rtAudienceRating']


And we can generate predictions — SciKit-Learn doesn’t know about Pandas, so it will be an array:

preds = linreg.predict(test_X)
preds

array([3.95378017, 3.56918553, 3.01365993, ..., 2.88546172, 3.33415547,
3.44098731])


And we can compute our errors and our mean squared error:

err = test_y - preds
err

8979    0.146220
6752   -0.069186
904    -0.213660
2293    0.453052
4588   -0.221363
...
625    -0.476017
5342   -0.476017
8581    0.314538
6114   -0.234155
702    -0.540987
Name: rtAudienceRating, Length: 1463, dtype: float64

np.mean(np.square(err))

0.09531922808833015


The sklearn.metrics.mean_squared_error() function can do mean squared error for us:

from sklearn.metrics import mean_squared_error
mean_squared_error(test_y, preds)

0.09531922808833015


## Standardizing with SciKit-Learn¶

Now let’s think about that standardized model. We can do it with sklearn.preprocessing.StandardScaler and a pipeline. Import our classes:

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

lm_pipe = Pipeline([
('scale', StandardScaler()),
('predict', LinearRegression())
])
lm_pipe

Pipeline(steps=[('scale', StandardScaler()), ('predict', LinearRegression())])


We can fit the whole model:

lm_pipe.fit(train_X, train_y)

Pipeline(steps=[('scale', StandardScaler()), ('predict', LinearRegression())])


Now let’s see the coefficients:

lm_pipe.named_steps['predict'].coef_

array([0.32407576])


That matches what we got from statsmodels for this model! Let’s check our prediction error - it should be the same, since the new coefficient and intercept will compensate for the scaling to produce the same predictions (if we had regularization, this would not be true):

mean_squared_error(test_y, lm_pipe.predict(test_X))

0.09531922808833014


## Regularization¶

Finally, I would like to demonstrate sklearn.linear_model.Ridge to try this with L2 regression; in particular, we’ll use sklearn.linear_model.RidgeCV to learn a good regression strength with cross-validation on the training data.

from sklearn.linear_model import RidgeCV

reg_pipe = Pipeline([
('scale', StandardScaler()),
('predict', RidgeCV([0.001, 0.1, 1, 5, 10, 20]))
])
reg_pipe

Pipeline(steps=[('scale', StandardScaler()),
('predict',
RidgeCV(alphas=array([1.e-03, 1.e-01, 1.e+00, 5.e+00, 1.e+01, 2.e+01])))])


Fit the data:

reg_pipe.fit(train_X, train_y)

Pipeline(steps=[('scale', StandardScaler()),
('predict',
RidgeCV(alphas=array([1.e-03, 1.e-01, 1.e+00, 5.e+00, 1.e+01, 2.e+01])))])


What regularization did it learn?

reg_pipe.named_steps['predict'].alpha_

1.0


And let’s look at the prediction accuracy:

mean_squared_error(test_y, reg_pipe.predict(test_X))

0.0953203458413433


Very, very slightly higher. Even with a very simple model, regularization can sometimes improve generalization. The difference in this case is almost certainly not statistically significant.