Rebuilding Regression
Contents
Rebuilding Regression#
This notebook shows how to learn regression parameters using SciPy’s optimization functions
.
The purpose here is to learn the concept of minimizing a loss function.
Setup#
We’re going to import our modules.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as sps
import scipy.optimize as spopt
import statsmodels.formula.api as smf
rng = np.random.default_rng(20201017)
We’re going to use the Penguins data:
penguins = pd.read_csv('penguins.csv')
penguins.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | male | 2007 |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | female | 2007 |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | female | 2007 |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN | 2007 |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | female | 2007 |
I’m also going to create a utility function for plotting lines:
def plot_line(intercept, slope, xmin, xmax, **kwargs):
xs = np.linspace(xmin, xmax, 100)
ys = intercept + slope * xs
plt.plot(xs, ys, **kwargs)
Now let’s standardize the variables, like we did in Week 8:
def zscore(xs):
xbar = xs.mean()
s = xs.std()
return (xs - xbar) / s
And create standardized versions of our variables:
penguin_std = pd.DataFrame({
'species': penguins['species'],
'sex': penguins['sex'],
'mass': zscore(penguins['body_mass_g']),
'flipper': zscore(penguins['flipper_length_mm']),
'bill_len': zscore(penguins['bill_length_mm']),
'bill_depth': zscore(penguins['bill_depth_mm'])
})
penguin_std
species | sex | mass | flipper | bill_len | bill_depth | |
---|---|---|---|---|---|---|
0 | Adelie | male | -0.563317 | -1.416272 | -0.883205 | 0.784300 |
1 | Adelie | female | -0.500969 | -1.060696 | -0.809939 | 0.126003 |
2 | Adelie | female | -1.186793 | -0.420660 | -0.663408 | 0.429833 |
3 | Adelie | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | female | -0.937403 | -0.562890 | -1.322799 | 1.088129 |
... | ... | ... | ... | ... | ... | ... |
339 | Chinstrap | male | -0.251578 | 0.432721 | 2.175637 | 1.341320 |
340 | Chinstrap | female | -0.999750 | 0.077145 | -0.077282 | 0.480471 |
341 | Chinstrap | male | -0.532143 | -0.562890 | 1.040019 | 0.531109 |
342 | Chinstrap | male | -0.126883 | 0.646066 | 1.259816 | 0.936215 |
343 | Chinstrap | female | -0.532143 | -0.207315 | 1.149917 | 0.784300 |
344 rows × 6 columns
Building Regression#
OLS stands for ordinary least squares. That is, it finds the solution that minimizes \(\mathrm{SS}_{\mathrm{resid}}\), the sum of squared error (on the test data).
Remember that the linear model looks like this:
We can frame the problem of fitting a linear model as minimizing a loss function:
Before optimizing this, though, we need a function that will use coefficients to preidct the mass with flipper length. This function will take its parameters as a list or array, for compatibility with APIs we’ll see in a moment:
def predict_mass_uv(params, data=penguin_std):
b0, b1 = params
return b0 + data['flipper'] * b1
Now we can predict flipper length. Let’s run a Statsmodels linear model to get the parameters so we can test this function:
uv_mod = smf.ols('mass ~ flipper', data=penguin_std)
uv_fit = uv_mod.fit()
uv_fit.summary()
Dep. Variable: | mass | R-squared: | 0.759 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.758 |
Method: | Least Squares | F-statistic: | 1071. |
Date: | Sat, 17 Oct 2020 | Prob (F-statistic): | 4.37e-107 |
Time: | 19:54:18 | Log-Likelihood: | -241.46 |
No. Observations: | 342 | AIC: | 486.9 |
Df Residuals: | 340 | BIC: | 494.6 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 6.939e-17 | 0.027 | 2.61e-15 | 1.000 | -0.052 | 0.052 |
flipper | 0.8712 | 0.027 | 32.722 | 0.000 | 0.819 | 0.924 |
Omnibus: | 5.634 | Durbin-Watson: | 2.190 |
---|---|---|---|
Prob(Omnibus): | 0.060 | Jarque-Bera (JB): | 5.585 |
Skew: | 0.313 | Prob(JB): | 0.0613 |
Kurtosis: | 3.019 | Cond. No. | 1.00 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
test_preds = predict_mass_uv(uv_fit.params.values)
test_preds
0 -1.233858
1 -0.924080
2 -0.366480
3 NaN
4 -0.490391
...
339 0.376987
340 0.067209
341 -0.490391
342 0.562854
343 -0.180613
Name: flipper, Length: 344, dtype: float64
Do these match the Statsmodels predictions?
uv_fit.fittedvalues
0 -1.233858
1 -0.924080
2 -0.366480
4 -0.490391
5 -0.676258
...
339 0.376987
340 0.067209
341 -0.490391
342 0.562854
343 -0.180613
Length: 342, dtype: float64
Yes! Our linear model function is working correctly.
Now let’s make a loss function that computes the loss of a model with these parameters:
def squared_loss_uv(params, data=penguin_std):
preds = predict_mass_uv(params, data)
err = data['mass'] - preds
return np.sum(np.square(err))
What’s the squared error with our data?
squared_loss_uv(uv_fit.params.values)
82.18355089922281
That should equal the residual sum of squares from Statsmodels, if we’re writing this code correctly:
uv_fit.ssr
82.18355089922281
Optimization#
Ok. We now have code that can correctly compute the sum of squares (loss), and the model predictions we’re trying to optimize.
Now we want to solve. Let’s do that. The scipy.optimize.minimize()
function searches for a parameter vector (in this case, two coefficients) that minimizes a function (in this case, our loss function).
It needs two things:
The loss function
An initial guess for the parameters
We’ll use standard normals for the parameter guess - since we have standardized variables, this will work pretty well.
init_guess = rng.standard_normal(2)
init_guess
array([-0.18652471, 0.9562162 ])
And now we minimize the loss function:
params = spopt.minimize(squared_loss_uv, init_guess)
params
fun: 82.18355089922284
hess_inv: array([[0.17155552, 0.37538334],
[0.37538334, 0.82990841]])
jac: array([ 6.67572021e-06, -1.90734863e-06])
message: 'Optimization terminated successfully.'
nfev: 15
nit: 2
njev: 5
status: 0
success: True
x: array([1.12829633e-09, 8.71201757e-01])
The x
parameter has the optimized parameter values; fun
has the smallest value of the function found. That’s really close to our residual sum of squares from the OLS model! The parameters are also pretty close too. Let’s put them side by side:
pd.DataFrame({
'OLS': uv_fit.params,
'SPOpt': params.x
})
OLS | SPOpt | |
---|---|---|
Intercept | 6.938894e-17 | 1.128296e-09 |
flipper | 8.712018e-01 | 8.712018e-01 |
The coefficient in particular is (nearly) identical, and the intercept is a little different but is essentially 0 in both cases.
It’s not going to be quite exact, because Statsmodels OLS uses an exact solution for the least squares problem, while what we get with minimize
is an approximate solution. Approximate solutions will be good for many other problems!
Multivariate Regression#
Now let’s learn the multivariate model. From the regression model, the one we settled on was:
mass ~ flipper + species * sex
This expands to:
We’re going to start by dummy-coding our data using pandas.get_dummies()
:
penguin_all = penguin_std.join(pd.get_dummies(penguin_std['species'], drop_first=True))
penguin_all = penguin_all.join(pd.get_dummies(penguin_std['sex'], drop_first=True))
penguin_all
species | sex | mass | flipper | bill_len | bill_depth | Chinstrap | Gentoo | male | |
---|---|---|---|---|---|---|---|---|---|
0 | Adelie | male | -0.563317 | -1.416272 | -0.883205 | 0.784300 | 0 | 0 | 1 |
1 | Adelie | female | -0.500969 | -1.060696 | -0.809939 | 0.126003 | 0 | 0 | 0 |
2 | Adelie | female | -1.186793 | -0.420660 | -0.663408 | 0.429833 | 0 | 0 | 0 |
3 | Adelie | NaN | NaN | NaN | NaN | NaN | 0 | 0 | 0 |
4 | Adelie | female | -0.937403 | -0.562890 | -1.322799 | 1.088129 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
339 | Chinstrap | male | -0.251578 | 0.432721 | 2.175637 | 1.341320 | 1 | 0 | 1 |
340 | Chinstrap | female | -0.999750 | 0.077145 | -0.077282 | 0.480471 | 1 | 0 | 0 |
341 | Chinstrap | male | -0.532143 | -0.562890 | 1.040019 | 0.531109 | 1 | 0 | 1 |
342 | Chinstrap | male | -0.126883 | 0.646066 | 1.259816 | 0.936215 | 1 | 0 | 1 |
343 | Chinstrap | female | -0.532143 | -0.207315 | 1.149917 | 0.784300 | 1 | 0 | 0 |
344 rows × 9 columns
Now let’s set up our prediction function. We’re going to vectorize this - if we do a matrix multiply between a \(n \times k\) matrix and a \(k\)-dimensional array, we’ll get an \(n \times 1\) array of predictions:
def pred_mass_mv(params, data=penguin_all):
intercept = params[0]
slopes = params[1:]
cols = data[['flipper', 'Chinstrap', 'Gentoo', 'male']].copy()
cols['Chinstrap:male'] = cols['Chinstrap'] * cols['male']
cols['Gentoo:male'] = cols['Gentoo'] * cols['male']
return intercept + cols @ slopes
Now let’s make a guess:
init_guess = rng.standard_normal(7)
init_guess
array([ 0.30068623, -0.46800143, 1.05910917, -1.55308712, 0.06762369,
0.47486707, -0.72624577])
Test our prediction function - should give us predictions:
pred_mass_mv(init_guess)
0 1.031127
1 0.797094
2 0.497556
3 NaN
4 0.564120
...
339 1.699772
340 1.323691
341 2.165720
342 1.599926
343 1.456819
Length: 344, dtype: float64
Now we’ll define our loss function:
def squared_loss_mv(params, data=penguin_all):
preds = pred_mass_mv(params, data)
err = data['mass'] - preds
return np.sum(np.square(err))
How bad is that loss with our guess?
squared_loss_mv(init_guess)
2150.413368100089
Now let’s minimize the parameters:
mv_opt = spopt.minimize(squared_loss_mv, init_guess)
mv_opt
fun: 44.02326049720256
hess_inv: array([[ 0.01480044, 0.00888065, -0.00903515, -0.02240841, -0.0094197 ,
0.00427484, 0.00407089],
[ 0.00888065, 0.00940691, -0.00278176, -0.0169299 , -0.00317235,
-0.00226389, -0.00251838],
[-0.00903515, -0.00278176, 0.0219765 , 0.01141677, 0.00732328,
-0.02046506, -0.00564343],
[-0.02240841, -0.0169299 , 0.01141677, 0.04498431, 0.01215354,
-0.00234171, -0.01004325],
[-0.0094197 , -0.00317235, 0.00732328, 0.01215354, 0.0143767 ,
-0.01249076, -0.01250795],
[ 0.00427484, -0.00226389, -0.02046506, -0.00234171, -0.01249076,
0.04323162, 0.01386635],
[ 0.00407089, -0.00251838, -0.00564343, -0.01004325, -0.01250795,
0.01386635, 0.03046714]])
jac: array([-0.00000000e+00, 0.00000000e+00, 4.76837158e-07, 0.00000000e+00,
0.00000000e+00, -0.00000000e+00, 0.00000000e+00])
message: 'Optimization terminated successfully.'
nfev: 128
nit: 13
njev: 16
status: 0
success: True
x: array([-0.68863129, 0.35659433, 0.08029619, 0.97323106, 0.70694806,
-0.40084566, 0.0853413 ])
This gives us an optimized residual sum of squares of 44! If we compare with the regression notebook, the parameters are pretty close! They aren’t quite in the same order, so we have to be careful, but this is clearly an approximation of the same model.