Rebuilding Regression¶
This notebook shows how to learn regression parameters using SciPy's optimization functions. The purpose here is to learn the concept of minimizing a loss function.
Setup¶
We're going to import our modules.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as sps
import scipy.optimize as spopt
import statsmodels.formula.api as smf
rng = np.random.default_rng(20201017)
We're going to use the Penguins data:
penguins = pd.read_csv('penguins.csv')
penguins.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | male | 2007 |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | female | 2007 |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | female | 2007 |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN | 2007 |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | female | 2007 |
I'm also going to create a utility function for plotting lines:
def plot_line(intercept, slope, xmin, xmax, **kwargs):
xs = np.linspace(xmin, xmax, 100)
ys = intercept + slope * xs
plt.plot(xs, ys, **kwargs)
Now let's standardize the variables, like we did in Week 8:
def zscore(xs):
xbar = xs.mean()
s = xs.std()
return (xs - xbar) / s
And create standardized versions of our variables:
penguin_std = pd.DataFrame({
'species': penguins['species'],
'sex': penguins['sex'],
'mass': zscore(penguins['body_mass_g']),
'flipper': zscore(penguins['flipper_length_mm']),
'bill_len': zscore(penguins['bill_length_mm']),
'bill_depth': zscore(penguins['bill_depth_mm'])
})
penguin_std
species | sex | mass | flipper | bill_len | bill_depth | |
---|---|---|---|---|---|---|
0 | Adelie | male | -0.563317 | -1.416272 | -0.883205 | 0.784300 |
1 | Adelie | female | -0.500969 | -1.060696 | -0.809939 | 0.126003 |
2 | Adelie | female | -1.186793 | -0.420660 | -0.663408 | 0.429833 |
3 | Adelie | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | female | -0.937403 | -0.562890 | -1.322799 | 1.088129 |
... | ... | ... | ... | ... | ... | ... |
339 | Chinstrap | male | -0.251578 | 0.432721 | 2.175637 | 1.341320 |
340 | Chinstrap | female | -0.999750 | 0.077145 | -0.077282 | 0.480471 |
341 | Chinstrap | male | -0.532143 | -0.562890 | 1.040019 | 0.531109 |
342 | Chinstrap | male | -0.126883 | 0.646066 | 1.259816 | 0.936215 |
343 | Chinstrap | female | -0.532143 | -0.207315 | 1.149917 | 0.784300 |
344 rows × 6 columns
Building Regression¶
OLS stands for ordinary least squares. That is, it finds the solution that minimizes $\mathrm{SS}_{\mathrm{resid}}$, the sum of squared error (on the test data).
Remember that the linear model looks like this:
$$\hat{y} = \beta_0 + \beta_1 x_1$$We can frame the problem of fitting a linear model as minimizing a loss function:
$$\begin{align*} L(\beta_0, \beta_1) & = \sum_i (y_i - \hat{y}_i)^2 \\ \beta_0, \beta_1 & = \underset{\beta_0, \beta_1}{\operatorname{argmin}} L(\beta_0, \beta_1) \end{align*}$$Before optimizing this, though, we need a function that will use coefficients to preidct the mass with flipper length. This function will take its parameters as a list or array, for compatibility with APIs we'll see in a moment:
def predict_mass_uv(params, data=penguin_std):
b0, b1 = params
return b0 + data['flipper'] * b1
Now we can predict flipper length. Let's run a Statsmodels linear model to get the parameters so we can test this function:
uv_mod = smf.ols('mass ~ flipper', data=penguin_std)
uv_fit = uv_mod.fit()
uv_fit.summary()
Dep. Variable: | mass | R-squared: | 0.759 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.758 |
Method: | Least Squares | F-statistic: | 1071. |
Date: | Sat, 17 Oct 2020 | Prob (F-statistic): | 4.37e-107 |
Time: | 19:54:18 | Log-Likelihood: | -241.46 |
No. Observations: | 342 | AIC: | 486.9 |
Df Residuals: | 340 | BIC: | 494.6 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 6.939e-17 | 0.027 | 2.61e-15 | 1.000 | -0.052 | 0.052 |
flipper | 0.8712 | 0.027 | 32.722 | 0.000 | 0.819 | 0.924 |
Omnibus: | 5.634 | Durbin-Watson: | 2.190 |
---|---|---|---|
Prob(Omnibus): | 0.060 | Jarque-Bera (JB): | 5.585 |
Skew: | 0.313 | Prob(JB): | 0.0613 |
Kurtosis: | 3.019 | Cond. No. | 1.00 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
test_preds = predict_mass_uv(uv_fit.params.values)
test_preds
0 -1.233858 1 -0.924080 2 -0.366480 3 NaN 4 -0.490391 ... 339 0.376987 340 0.067209 341 -0.490391 342 0.562854 343 -0.180613 Name: flipper, Length: 344, dtype: float64
Do these match the Statsmodels predictions?
uv_fit.fittedvalues
0 -1.233858 1 -0.924080 2 -0.366480 4 -0.490391 5 -0.676258 ... 339 0.376987 340 0.067209 341 -0.490391 342 0.562854 343 -0.180613 Length: 342, dtype: float64
Yes! Our linear model function is working correctly.
Now let's make a loss function that computes the loss of a model with these parameters:
def squared_loss_uv(params, data=penguin_std):
preds = predict_mass_uv(params, data)
err = data['mass'] - preds
return np.sum(np.square(err))
What's the squared error with our data?
squared_loss_uv(uv_fit.params.values)
82.18355089922281
That should equal the residual sum of squares from Statsmodels, if we're writing this code correctly:
uv_fit.ssr
82.18355089922281
Optimization¶
Ok. We now have code that can correctly compute the sum of squares (loss), and the model predictions we're trying to optimize.
Now we want to solve. Let's do that. The scipy.optimize.minimize
function searches for a parameter vector (in this case, two coefficients) that minimizes a function (in this case, our loss function).
It needs two things:
- The loss function
- An initial guess for the parameters
We'll use standard normals for the parameter guess - since we have standardized variables, this will work pretty well.
init_guess = rng.standard_normal(2)
init_guess
array([-0.18652471, 0.9562162 ])
And now we minimize the loss function:
params = spopt.minimize(squared_loss_uv, init_guess)
params
fun: 82.18355089922284 hess_inv: array([[0.17155552, 0.37538334], [0.37538334, 0.82990841]]) jac: array([ 6.67572021e-06, -1.90734863e-06]) message: 'Optimization terminated successfully.' nfev: 15 nit: 2 njev: 5 status: 0 success: True x: array([1.12829633e-09, 8.71201757e-01])
The x
parameter has the optimized parameter values; fun
has the smallest value of the function found. That's really close to our residual sum of squares from the OLS model! The parameters are also pretty close too. Let's put them side by side:
pd.DataFrame({
'OLS': uv_fit.params,
'SPOpt': params.x
})
OLS | SPOpt | |
---|---|---|
Intercept | 6.938894e-17 | 1.128296e-09 |
flipper | 8.712018e-01 | 8.712018e-01 |
The coefficient in particular is (nearly) identical, and the intercept is a little different but is essentially 0 in both cases.
It's not going to be quite exact, because Statsmodels OLS uses an exact solution for the least squares problem, while what we get with minimize
is an approximate solution. Approximate solutions will be good for many other problems!
Multivariate Regression¶
Now let's learn the multivariate model. From the regression model, the one we settled on was:
mass ~ flipper + species * sex
This expands to:
$$\hat{y} = \beta_0 + \beta_1 x_f + \beta_2 x_c + \beta_3 x_g + \beta_4 x_m + \beta_5 x_c x_m + \beta_6 x_g x_m$$We're going to start by dummy-coding our data:
penguin_all = penguin_std.join(pd.get_dummies(penguin_std['species'], drop_first=True))
penguin_all = penguin_all.join(pd.get_dummies(penguin_std['sex'], drop_first=True))
penguin_all
species | sex | mass | flipper | bill_len | bill_depth | Chinstrap | Gentoo | male | |
---|---|---|---|---|---|---|---|---|---|
0 | Adelie | male | -0.563317 | -1.416272 | -0.883205 | 0.784300 | 0 | 0 | 1 |
1 | Adelie | female | -0.500969 | -1.060696 | -0.809939 | 0.126003 | 0 | 0 | 0 |
2 | Adelie | female | -1.186793 | -0.420660 | -0.663408 | 0.429833 | 0 | 0 | 0 |
3 | Adelie | NaN | NaN | NaN | NaN | NaN | 0 | 0 | 0 |
4 | Adelie | female | -0.937403 | -0.562890 | -1.322799 | 1.088129 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
339 | Chinstrap | male | -0.251578 | 0.432721 | 2.175637 | 1.341320 | 1 | 0 | 1 |
340 | Chinstrap | female | -0.999750 | 0.077145 | -0.077282 | 0.480471 | 1 | 0 | 0 |
341 | Chinstrap | male | -0.532143 | -0.562890 | 1.040019 | 0.531109 | 1 | 0 | 1 |
342 | Chinstrap | male | -0.126883 | 0.646066 | 1.259816 | 0.936215 | 1 | 0 | 1 |
343 | Chinstrap | female | -0.532143 | -0.207315 | 1.149917 | 0.784300 | 1 | 0 | 0 |
344 rows × 9 columns
Now let's set up our prediction function. We're going to vectorize this - if we do a matrix multiply between a $n \times k$ matrix and a $k$-dimensional array, we'll get an $n \times 1$ array of predictions:
def pred_mass_mv(params, data=penguin_all):
intercept = params[0]
slopes = params[1:]
cols = data[['flipper', 'Chinstrap', 'Gentoo', 'male']].copy()
cols['Chinstrap:male'] = cols['Chinstrap'] * cols['male']
cols['Gentoo:male'] = cols['Gentoo'] * cols['male']
return intercept + cols @ slopes
Now let's make a guess:
init_guess = rng.standard_normal(7)
init_guess
array([ 0.30068623, -0.46800143, 1.05910917, -1.55308712, 0.06762369, 0.47486707, -0.72624577])
Test our prediction function - should give us predictions:
pred_mass_mv(init_guess)
0 1.031127 1 0.797094 2 0.497556 3 NaN 4 0.564120 ... 339 1.699772 340 1.323691 341 2.165720 342 1.599926 343 1.456819 Length: 344, dtype: float64
Now we'll define our loss function:
def squared_loss_mv(params, data=penguin_all):
preds = pred_mass_mv(params, data)
err = data['mass'] - preds
return np.sum(np.square(err))
How bad is that loss with our guess?
squared_loss_mv(init_guess)
2150.413368100089
Now let's minimize the parameters:
mv_opt = spopt.minimize(squared_loss_mv, init_guess)
mv_opt
fun: 44.02326049720256 hess_inv: array([[ 0.01480044, 0.00888065, -0.00903515, -0.02240841, -0.0094197 , 0.00427484, 0.00407089], [ 0.00888065, 0.00940691, -0.00278176, -0.0169299 , -0.00317235, -0.00226389, -0.00251838], [-0.00903515, -0.00278176, 0.0219765 , 0.01141677, 0.00732328, -0.02046506, -0.00564343], [-0.02240841, -0.0169299 , 0.01141677, 0.04498431, 0.01215354, -0.00234171, -0.01004325], [-0.0094197 , -0.00317235, 0.00732328, 0.01215354, 0.0143767 , -0.01249076, -0.01250795], [ 0.00427484, -0.00226389, -0.02046506, -0.00234171, -0.01249076, 0.04323162, 0.01386635], [ 0.00407089, -0.00251838, -0.00564343, -0.01004325, -0.01250795, 0.01386635, 0.03046714]]) jac: array([-0.00000000e+00, 0.00000000e+00, 4.76837158e-07, 0.00000000e+00, 0.00000000e+00, -0.00000000e+00, 0.00000000e+00]) message: 'Optimization terminated successfully.' nfev: 128 nit: 13 njev: 16 status: 0 success: True x: array([-0.68863129, 0.35659433, 0.08029619, 0.97323106, 0.70694806, -0.40084566, 0.0853413 ])
This gives us an optimized residual sum of squares of 44! If we compare with the regression notebook, the parameters are pretty close! They aren't quite in the same order, so we have to be careful, but this is clearly an approximation of the same model.