SciKit-Learn Linear Regression Demo¶
This notebook demonstrates running a linear regression with sklearn.linear_model.LinearRegression
, and comparing its results with those of a model from statsmodels.formula.api.ols()
.
Setup¶
First, let’s import our basic modules:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import seaborn as sns
And initialize the RNG:
import seedbank
seedbank.initialize(20211026)
SeedSequence(
entropy=20211026,
)
And we’ll load the movies we’ve been working with in several other examples:
movies = pd.read_csv('../data/hetrec2011-ml/movies.dat', sep='\t', encoding='latin1',
na_values='\\N')
movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10197 entries, 0 to 10196
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 10197 non-null int64
1 title 10197 non-null object
2 imdbID 10197 non-null int64
3 spanishTitle 10197 non-null object
4 imdbPictureURL 10016 non-null object
5 year 10197 non-null int64
6 rtID 9886 non-null object
7 rtAllCriticsRating 9967 non-null float64
8 rtAllCriticsNumReviews 9967 non-null float64
9 rtAllCriticsNumFresh 9967 non-null float64
10 rtAllCriticsNumRotten 9967 non-null float64
11 rtAllCriticsScore 9967 non-null float64
12 rtTopCriticsRating 9967 non-null float64
13 rtTopCriticsNumReviews 9967 non-null float64
14 rtTopCriticsNumFresh 9967 non-null float64
15 rtTopCriticsNumRotten 9967 non-null float64
16 rtTopCriticsScore 9967 non-null float64
17 rtAudienceRating 9967 non-null float64
18 rtAudienceNumRatings 9967 non-null float64
19 rtAudienceScore 9967 non-null float64
20 rtPictureURL 9967 non-null object
dtypes: float64(13), int64(3), object(5)
memory usage: 1.6+ MB
And clear out the zeros, as discussed in Missing Data:
movies.loc[movies['rtAllCriticsRating'] == 0, 'rtAllCriticsRating'] = np.nan
movies.loc[movies['rtAudienceRating'] == 0, 'rtAudienceRating'] = np.nan
Finally, we’ll create a test set of movies, and store the rest in training data:
test_movies = movies.sample(frac=0.2)
train_mask = pd.Series(True, index=movies.index)
train_mask[test_movies.index] = False
train_movies = movies[train_mask]
We could also do this with sklearn.model_selection.train_test_split()
, but I’m using Pandas directly here so we can see clearly what the split is doing.
Statsmodels¶
We’re first going to use Statsmodels to compute a linear regression, like we have done on other data sets. Our goal is to use the all-critics rating to predict the audience rating:
lm_sm = smf.ols('rtAudienceRating ~ rtAllCriticsRating', data=train_movies)
lm_sm_fit = lm_sm.fit()
lm_sm_fit.summary()
Dep. Variable: | rtAudienceRating | R-squared: | 0.511 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.511 |
Method: | Least Squares | F-statistic: | 6002. |
Date: | Thu, 28 Oct 2021 | Prob (F-statistic): | 0.00 |
Time: | 14:30:25 | Log-Likelihood: | -1554.7 |
No. Observations: | 5749 | AIC: | 3113. |
Df Residuals: | 5747 | BIC: | 3127. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 2.0949 | 0.017 | 121.401 | 0.000 | 2.061 | 2.129 |
rtAllCriticsRating | 0.2137 | 0.003 | 77.474 | 0.000 | 0.208 | 0.219 |
Omnibus: | 24.811 | Durbin-Watson: | 1.863 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 31.568 |
Skew: | 0.064 | Prob(JB): | 1.40e-07 |
Kurtosis: | 3.340 | Cond. No. | 26.4 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
We could also try to do this with a standardized predictor variable:
critic_ratings = train_movies['rtAllCriticsRating']
cr_mean = critic_ratings.mean()
cr_std = critic_ratings.std()
critic_std = (critic_ratings - cr_mean) / cr_std
train_norm = pd.DataFrame({
'critics': critic_std,
'audience': train_movies['rtAudienceRating']
})
lm_nsm = smf.ols('audience ~ critics', data=train_norm)
lm_nsm_fit = lm_nsm.fit()
lm_nsm_fit.summary()
Dep. Variable: | audience | R-squared: | 0.511 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.511 |
Method: | Least Squares | F-statistic: | 6002. |
Date: | Thu, 28 Oct 2021 | Prob (F-statistic): | 0.00 |
Time: | 14:30:25 | Log-Likelihood: | -1554.7 |
No. Observations: | 5749 | AIC: | 3113. |
Df Residuals: | 5747 | BIC: | 3127. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 3.3926 | 0.004 | 811.043 | 0.000 | 3.384 | 3.401 |
critics | 0.3271 | 0.004 | 77.474 | 0.000 | 0.319 | 0.335 |
Omnibus: | 24.811 | Durbin-Watson: | 1.863 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 31.568 |
Skew: | 0.064 | Prob(JB): | 1.40e-07 |
Kurtosis: | 3.340 | Cond. No. | 1.01 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
This means that an increase of 1 standard deviation in the mean critic rating corresponds to an increase of 1 in the audience rating (since the audience ratings have not been standardized, it is 1 actual mean rating value, not 1 standard deviation in the response variable).
SciKit-Learn¶
Now we want to implement this model with SciKit-Learn’s linear regression model.
Import the linear regression class:
from sklearn.linear_model import LinearRegression
SciKit-Learn doesn’t automatically drop missing data, so we’re going to create a filtered copy of our training data that doesn’t have nulls in our predictor or outcome variables:
train_filt = train_movies.dropna(subset=['rtAllCriticsRating', 'rtAudienceRating'])
Now we’re going to create an X
matrix with one column to contain our predictor variable:
train_X = train_filt[['rtAllCriticsRating']]
train_X.head()
rtAllCriticsRating | |
---|---|
0 | 9.0 |
1 | 5.6 |
4 | 5.3 |
5 | 7.7 |
6 | 7.4 |
And a y
vector with the outcomes:
train_y = train_filt['rtAudienceRating']
train_y.head()
0 3.7
1 3.2
4 3.0
5 3.9
6 3.8
Name: rtAudienceRating, dtype: float64
With the data prepared, we can create and train our linear regression object. sklearn.linear_model.LinearRegression
does not support regularization, so we can just use it as-is to get a model equivalent to Statsmodels:
linreg = LinearRegression()
linreg.fit(train_X, train_y)
LinearRegression()
Let’s look at its coefficient(s):
linreg.coef_
array([0.21366369])
That’s exactly what we got from statsmodels.
linreg.intercept_
2.0949060615377193
Also the same.
Let’s create a test X
matrix to measure some predictions, after filtering:
test_movies = test_movies.dropna(subset=['rtAllCriticsRating', 'rtAudienceRating'])
test_X = test_movies[['rtAllCriticsRating']]
And a y
that contains the outcomes for the same movies (make sure you understand how this next row works):
test_y = test_movies.loc[test_X.index, 'rtAudienceRating']
And we can generate predictions — SciKit-Learn doesn’t know about Pandas, so it will be an array:
preds = linreg.predict(test_X)
preds
array([3.95378017, 3.56918553, 3.01365993, ..., 2.88546172, 3.33415547,
3.44098731])
And we can compute our errors and our mean squared error:
err = test_y - preds
err
8979 0.146220
6752 -0.069186
904 -0.213660
2293 0.453052
4588 -0.221363
...
625 -0.476017
5342 -0.476017
8581 0.314538
6114 -0.234155
702 -0.540987
Name: rtAudienceRating, Length: 1463, dtype: float64
np.mean(np.square(err))
0.09531922808833015
The sklearn.metrics.mean_squared_error()
function can do mean squared error for us:
from sklearn.metrics import mean_squared_error
mean_squared_error(test_y, preds)
0.09531922808833015
Standardizing with SciKit-Learn¶
Now let’s think about that standardized model. We can do it with sklearn.preprocessing.StandardScaler
and a pipeline. Import our classes:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
lm_pipe = Pipeline([
('scale', StandardScaler()),
('predict', LinearRegression())
])
lm_pipe
Pipeline(steps=[('scale', StandardScaler()), ('predict', LinearRegression())])
We can fit the whole model:
lm_pipe.fit(train_X, train_y)
Pipeline(steps=[('scale', StandardScaler()), ('predict', LinearRegression())])
Now let’s see the coefficients:
lm_pipe.named_steps['predict'].coef_
array([0.32407576])
That matches what we got from statsmodels for this model! Let’s check our prediction error - it should be the same, since the new coefficient and intercept will compensate for the scaling to produce the same predictions (if we had regularization, this would not be true):
mean_squared_error(test_y, lm_pipe.predict(test_X))
0.09531922808833014
Regularization¶
Finally, I would like to demonstrate sklearn.linear_model.Ridge
to try this with L2 regression; in particular, we’ll use sklearn.linear_model.RidgeCV
to learn a good regression strength with cross-validation on the training data.
from sklearn.linear_model import RidgeCV
reg_pipe = Pipeline([
('scale', StandardScaler()),
('predict', RidgeCV([0.001, 0.1, 1, 5, 10, 20]))
])
reg_pipe
Pipeline(steps=[('scale', StandardScaler()),
('predict',
RidgeCV(alphas=array([1.e-03, 1.e-01, 1.e+00, 5.e+00, 1.e+01, 2.e+01])))])
Fit the data:
reg_pipe.fit(train_X, train_y)
Pipeline(steps=[('scale', StandardScaler()),
('predict',
RidgeCV(alphas=array([1.e-03, 1.e-01, 1.e+00, 5.e+00, 1.e+01, 2.e+01])))])
What regularization did it learn?
reg_pipe.named_steps['predict'].alpha_
1.0
And let’s look at the prediction accuracy:
mean_squared_error(test_y, reg_pipe.predict(test_X))
0.0953203458413433
Very, very slightly higher. Even with a very simple model, regularization can sometimes improve generalization. The difference in this case is almost certainly not statistically significant.