Rebuilding Regression

This notebook shows how to learn regression parameters using SciPy’s optimization functions. The purpose here is to learn the concept of minimizing a loss function.

Setup

We’re going to import our modules.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as sps
import scipy.optimize as spopt
import statsmodels.formula.api as smf
rng = np.random.default_rng(20201017)

We’re going to use the Penguins data:

penguins = pd.read_csv('penguins.csv')
penguins.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 Adelie Torgersen NaN NaN NaN NaN NaN 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007

I’m also going to create a utility function for plotting lines:

def plot_line(intercept, slope, xmin, xmax, **kwargs):
    xs = np.linspace(xmin, xmax, 100)
    ys = intercept + slope * xs
    plt.plot(xs, ys, **kwargs)

Now let’s standardize the variables, like we did in Week 8:

def zscore(xs):
    xbar = xs.mean()
    s = xs.std()
    return (xs - xbar) / s

And create standardized versions of our variables:

penguin_std = pd.DataFrame({
    'species': penguins['species'],
    'sex': penguins['sex'],
    'mass': zscore(penguins['body_mass_g']),
    'flipper': zscore(penguins['flipper_length_mm']),
    'bill_len': zscore(penguins['bill_length_mm']),
    'bill_depth': zscore(penguins['bill_depth_mm'])
})
penguin_std
species sex mass flipper bill_len bill_depth
0 Adelie male -0.563317 -1.416272 -0.883205 0.784300
1 Adelie female -0.500969 -1.060696 -0.809939 0.126003
2 Adelie female -1.186793 -0.420660 -0.663408 0.429833
3 Adelie NaN NaN NaN NaN NaN
4 Adelie female -0.937403 -0.562890 -1.322799 1.088129
... ... ... ... ... ... ...
339 Chinstrap male -0.251578 0.432721 2.175637 1.341320
340 Chinstrap female -0.999750 0.077145 -0.077282 0.480471
341 Chinstrap male -0.532143 -0.562890 1.040019 0.531109
342 Chinstrap male -0.126883 0.646066 1.259816 0.936215
343 Chinstrap female -0.532143 -0.207315 1.149917 0.784300

344 rows × 6 columns

Building Regression

OLS stands for ordinary least squares. That is, it finds the solution that minimizes \(\mathrm{SS}_{\mathrm{resid}}\), the sum of squared error (on the test data).

Remember that the linear model looks like this:

\[\hat{y} = \beta_0 + \beta_1 x_1\]

We can frame the problem of fitting a linear model as minimizing a loss function:

\[\begin{split}\begin{align*} L(\beta_0, \beta_1) & = \sum_i (y_i - \hat{y}_i)^2 \\ \beta_0, \beta_1 & = \underset{\beta_0, \beta_1}{\operatorname{argmin}} L(\beta_0, \beta_1) \end{align*}\end{split}\]

Before optimizing this, though, we need a function that will use coefficients to preidct the mass with flipper length. This function will take its parameters as a list or array, for compatibility with APIs we’ll see in a moment:

def predict_mass_uv(params, data=penguin_std):
    b0, b1 = params
    return b0 + data['flipper'] * b1

Now we can predict flipper length. Let’s run a Statsmodels linear model to get the parameters so we can test this function:

uv_mod = smf.ols('mass ~ flipper', data=penguin_std)
uv_fit = uv_mod.fit()
uv_fit.summary()
OLS Regression Results
Dep. Variable: mass R-squared: 0.759
Model: OLS Adj. R-squared: 0.758
Method: Least Squares F-statistic: 1071.
Date: Sat, 17 Oct 2020 Prob (F-statistic): 4.37e-107
Time: 19:54:18 Log-Likelihood: -241.46
No. Observations: 342 AIC: 486.9
Df Residuals: 340 BIC: 494.6
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 6.939e-17 0.027 2.61e-15 1.000 -0.052 0.052
flipper 0.8712 0.027 32.722 0.000 0.819 0.924
Omnibus: 5.634 Durbin-Watson: 2.190
Prob(Omnibus): 0.060 Jarque-Bera (JB): 5.585
Skew: 0.313 Prob(JB): 0.0613
Kurtosis: 3.019 Cond. No. 1.00


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
test_preds = predict_mass_uv(uv_fit.params.values)
test_preds
0     -1.233858
1     -0.924080
2     -0.366480
3           NaN
4     -0.490391
         ...   
339    0.376987
340    0.067209
341   -0.490391
342    0.562854
343   -0.180613
Name: flipper, Length: 344, dtype: float64

Do these match the Statsmodels predictions?

uv_fit.fittedvalues
0     -1.233858
1     -0.924080
2     -0.366480
4     -0.490391
5     -0.676258
         ...   
339    0.376987
340    0.067209
341   -0.490391
342    0.562854
343   -0.180613
Length: 342, dtype: float64

Yes! Our linear model function is working correctly.

Now let’s make a loss function that computes the loss of a model with these parameters:

def squared_loss_uv(params, data=penguin_std):
    preds = predict_mass_uv(params, data)
    err = data['mass'] - preds
    return np.sum(np.square(err))

What’s the squared error with our data?

squared_loss_uv(uv_fit.params.values)
82.18355089922281

That should equal the residual sum of squares from Statsmodels, if we’re writing this code correctly:

uv_fit.ssr
82.18355089922281

Optimization

Ok. We now have code that can correctly compute the sum of squares (loss), and the model predictions we’re trying to optimize.

Now we want to solve. Let’s do that. The scipy.optimize.minimize() function searches for a parameter vector (in this case, two coefficients) that minimizes a function (in this case, our loss function).

It needs two things:

  1. The loss function

  2. An initial guess for the parameters

We’ll use standard normals for the parameter guess - since we have standardized variables, this will work pretty well.

init_guess = rng.standard_normal(2)
init_guess
array([-0.18652471,  0.9562162 ])

And now we minimize the loss function:

params = spopt.minimize(squared_loss_uv, init_guess)
params
      fun: 82.18355089922284
 hess_inv: array([[0.17155552, 0.37538334],
       [0.37538334, 0.82990841]])
      jac: array([ 6.67572021e-06, -1.90734863e-06])
  message: 'Optimization terminated successfully.'
     nfev: 15
      nit: 2
     njev: 5
   status: 0
  success: True
        x: array([1.12829633e-09, 8.71201757e-01])

The x parameter has the optimized parameter values; fun has the smallest value of the function found. That’s really close to our residual sum of squares from the OLS model! The parameters are also pretty close too. Let’s put them side by side:

pd.DataFrame({
    'OLS': uv_fit.params,
    'SPOpt': params.x
})
OLS SPOpt
Intercept 6.938894e-17 1.128296e-09
flipper 8.712018e-01 8.712018e-01

The coefficient in particular is (nearly) identical, and the intercept is a little different but is essentially 0 in both cases.

It’s not going to be quite exact, because Statsmodels OLS uses an exact solution for the least squares problem, while what we get with minimize is an approximate solution. Approximate solutions will be good for many other problems!

Multivariate Regression

Now let’s learn the multivariate model. From the regression model, the one we settled on was:

mass ~ flipper + species * sex

This expands to:

\[\hat{y} = \beta_0 + \beta_1 x_f + \beta_2 x_c + \beta_3 x_g + \beta_4 x_m + \beta_5 x_c x_m + \beta_6 x_g x_m\]

We’re going to start by dummy-coding our data using pandas.get_dummies():

penguin_all = penguin_std.join(pd.get_dummies(penguin_std['species'], drop_first=True))
penguin_all = penguin_all.join(pd.get_dummies(penguin_std['sex'], drop_first=True))
penguin_all
species sex mass flipper bill_len bill_depth Chinstrap Gentoo male
0 Adelie male -0.563317 -1.416272 -0.883205 0.784300 0 0 1
1 Adelie female -0.500969 -1.060696 -0.809939 0.126003 0 0 0
2 Adelie female -1.186793 -0.420660 -0.663408 0.429833 0 0 0
3 Adelie NaN NaN NaN NaN NaN 0 0 0
4 Adelie female -0.937403 -0.562890 -1.322799 1.088129 0 0 0
... ... ... ... ... ... ... ... ... ...
339 Chinstrap male -0.251578 0.432721 2.175637 1.341320 1 0 1
340 Chinstrap female -0.999750 0.077145 -0.077282 0.480471 1 0 0
341 Chinstrap male -0.532143 -0.562890 1.040019 0.531109 1 0 1
342 Chinstrap male -0.126883 0.646066 1.259816 0.936215 1 0 1
343 Chinstrap female -0.532143 -0.207315 1.149917 0.784300 1 0 0

344 rows × 9 columns

Now let’s set up our prediction function. We’re going to vectorize this - if we do a matrix multiply between a \(n \times k\) matrix and a \(k\)-dimensional array, we’ll get an \(n \times 1\) array of predictions:

def pred_mass_mv(params, data=penguin_all):
    intercept = params[0]
    slopes = params[1:]
    cols = data[['flipper', 'Chinstrap', 'Gentoo', 'male']].copy()
    cols['Chinstrap:male'] = cols['Chinstrap'] * cols['male']
    cols['Gentoo:male'] = cols['Gentoo'] * cols['male']
    return intercept + cols @ slopes

Now let’s make a guess:

init_guess = rng.standard_normal(7)
init_guess
array([ 0.30068623, -0.46800143,  1.05910917, -1.55308712,  0.06762369,
        0.47486707, -0.72624577])

Test our prediction function - should give us predictions:

pred_mass_mv(init_guess)
0      1.031127
1      0.797094
2      0.497556
3           NaN
4      0.564120
         ...   
339    1.699772
340    1.323691
341    2.165720
342    1.599926
343    1.456819
Length: 344, dtype: float64

Now we’ll define our loss function:

def squared_loss_mv(params, data=penguin_all):
    preds = pred_mass_mv(params, data)
    err = data['mass'] - preds
    return np.sum(np.square(err))

How bad is that loss with our guess?

squared_loss_mv(init_guess)
2150.413368100089

Now let’s minimize the parameters:

mv_opt = spopt.minimize(squared_loss_mv, init_guess)
mv_opt
      fun: 44.02326049720256
 hess_inv: array([[ 0.01480044,  0.00888065, -0.00903515, -0.02240841, -0.0094197 ,
         0.00427484,  0.00407089],
       [ 0.00888065,  0.00940691, -0.00278176, -0.0169299 , -0.00317235,
        -0.00226389, -0.00251838],
       [-0.00903515, -0.00278176,  0.0219765 ,  0.01141677,  0.00732328,
        -0.02046506, -0.00564343],
       [-0.02240841, -0.0169299 ,  0.01141677,  0.04498431,  0.01215354,
        -0.00234171, -0.01004325],
       [-0.0094197 , -0.00317235,  0.00732328,  0.01215354,  0.0143767 ,
        -0.01249076, -0.01250795],
       [ 0.00427484, -0.00226389, -0.02046506, -0.00234171, -0.01249076,
         0.04323162,  0.01386635],
       [ 0.00407089, -0.00251838, -0.00564343, -0.01004325, -0.01250795,
         0.01386635,  0.03046714]])
      jac: array([-0.00000000e+00,  0.00000000e+00,  4.76837158e-07,  0.00000000e+00,
        0.00000000e+00, -0.00000000e+00,  0.00000000e+00])
  message: 'Optimization terminated successfully.'
     nfev: 128
      nit: 13
     njev: 16
   status: 0
  success: True
        x: array([-0.68863129,  0.35659433,  0.08029619,  0.97323106,  0.70694806,
       -0.40084566,  0.0853413 ])

This gives us an optimized residual sum of squares of 44! If we compare with the regression notebook, the parameters are pretty close! They aren’t quite in the same order, so we have to be careful, but this is clearly an approximation of the same model.