Rebuilding Regression¶

This notebook shows how to learn regression parameters using SciPy's optimization functions. The purpose here is to learn the concept of minimizing a loss function.

Setup¶

We're going to import our modules.

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as sps
import scipy.optimize as spopt
import statsmodels.formula.api as smf

In [2]:

rng = np.random.default_rng(20201017)

We're going to use the Penguins data:

In [3]:

penguins = pd.read_csv('penguins.csv')

In [4]:

penguins.head()

Out[4]:

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male	2007
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female	2007
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN	2007
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female	2007

I'm also going to create a utility function for plotting lines:

In [5]:

def plot_line(intercept, slope, xmin, xmax, **kwargs):
    xs = np.linspace(xmin, xmax, 100)
    ys = intercept + slope * xs
    plt.plot(xs, ys, **kwargs)

Now let's standardize the variables, like we did in Week 8:

In [6]:

def zscore(xs):
    xbar = xs.mean()
    s = xs.std()
    return (xs - xbar) / s

And create standardized versions of our variables:

In [7]:

penguin_std = pd.DataFrame({
    'species': penguins['species'],
    'sex': penguins['sex'],
    'mass': zscore(penguins['body_mass_g']),
    'flipper': zscore(penguins['flipper_length_mm']),
    'bill_len': zscore(penguins['bill_length_mm']),
    'bill_depth': zscore(penguins['bill_depth_mm'])
})
penguin_std

Out[7]:

	species	sex	mass	flipper	bill_len	bill_depth
0	Adelie	male	-0.563317	-1.416272	-0.883205	0.784300
1	Adelie	female	-0.500969	-1.060696	-0.809939	0.126003
2	Adelie	female	-1.186793	-0.420660	-0.663408	0.429833
3	Adelie	NaN	NaN	NaN	NaN	NaN
4	Adelie	female	-0.937403	-0.562890	-1.322799	1.088129
...	...	...	...	...	...	...
339	Chinstrap	male	-0.251578	0.432721	2.175637	1.341320
340	Chinstrap	female	-0.999750	0.077145	-0.077282	0.480471
341	Chinstrap	male	-0.532143	-0.562890	1.040019	0.531109
342	Chinstrap	male	-0.126883	0.646066	1.259816	0.936215
343	Chinstrap	female	-0.532143	-0.207315	1.149917	0.784300

344 rows × 6 columns

Building Regression¶

OLS stands for ordinary least squares. That is, it finds the solution that minimizes $\mathrm{SS}_{\mathrm{resid}}$, the sum of squared error (on the test data).

Remember that the linear model looks like this:

$$\hat{y} = \beta_0 + \beta_1 x_1$$

We can frame the problem of fitting a linear model as minimizing a loss function:

$$\begin{align*} L(\beta_0, \beta_1) & = \sum_i (y_i - \hat{y}_i)^2 \\ \beta_0, \beta_1 & = \underset{\beta_0, \beta_1}{\operatorname{argmin}} L(\beta_0, \beta_1) \end{align*}$$

Before optimizing this, though, we need a function that will use coefficients to preidct the mass with flipper length. This function will take its parameters as a list or array, for compatibility with APIs we'll see in a moment:

In [8]:

def predict_mass_uv(params, data=penguin_std):
    b0, b1 = params
    return b0 + data['flipper'] * b1

Now we can predict flipper length. Let's run a Statsmodels linear model to get the parameters so we can test this function:

In [26]:

uv_mod = smf.ols('mass ~ flipper', data=penguin_std)
uv_fit = uv_mod.fit()
uv_fit.summary()

Out[26]:

OLS Regression Results
Dep. Variable:	mass	R-squared:	0.759
Model:	OLS	Adj. R-squared:	0.758
Method:	Least Squares	F-statistic:	1071.
Date:	Sat, 17 Oct 2020	Prob (F-statistic):	4.37e-107
Time:	19:54:18	Log-Likelihood:	-241.46
No. Observations:	342	AIC:	486.9
Df Residuals:	340	BIC:	494.6
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	6.939e-17	0.027	2.61e-15	1.000	-0.052	0.052
flipper	0.8712	0.027	32.722	0.000	0.819	0.924

Omnibus:	5.634	Durbin-Watson:	2.190
Prob(Omnibus):	0.060	Jarque-Bera (JB):	5.585
Skew:	0.313	Prob(JB):	0.0613
Kurtosis:	3.019	Cond. No.	1.00

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [10]:

test_preds = predict_mass_uv(uv_fit.params.values)
test_preds

Out[10]:

0     -1.233858
1     -0.924080
2     -0.366480
3           NaN
4     -0.490391
         ...   
339    0.376987
340    0.067209
341   -0.490391
342    0.562854
343   -0.180613
Name: flipper, Length: 344, dtype: float64

Do these match the Statsmodels predictions?

In [11]:

uv_fit.fittedvalues

Out[11]:

0     -1.233858
1     -0.924080
2     -0.366480
4     -0.490391
5     -0.676258
         ...   
339    0.376987
340    0.067209
341   -0.490391
342    0.562854
343   -0.180613
Length: 342, dtype: float64

Yes! Our linear model function is working correctly.

Now let's make a loss function that computes the loss of a model with these parameters:

In [12]:

def squared_loss_uv(params, data=penguin_std):
    preds = predict_mass_uv(params, data)
    err = data['mass'] - preds
    return np.sum(np.square(err))

What's the squared error with our data?

In [13]:

squared_loss_uv(uv_fit.params.values)

Out[13]:

82.18355089922281

That should equal the residual sum of squares from Statsmodels, if we're writing this code correctly:

In [14]:

uv_fit.ssr

Out[14]:

82.18355089922281

Optimization¶

Ok. We now have code that can correctly compute the sum of squares (loss), and the model predictions we're trying to optimize.

Now we want to solve. Let's do that. The scipy.optimize.minimize function searches for a parameter vector (in this case, two coefficients) that minimizes a function (in this case, our loss function).

It needs two things:

The loss function
An initial guess for the parameters

We'll use standard normals for the parameter guess - since we have standardized variables, this will work pretty well.

In [15]:

init_guess = rng.standard_normal(2)
init_guess

Out[15]:

array([-0.18652471,  0.9562162 ])

And now we minimize the loss function:

In [16]:

params = spopt.minimize(squared_loss_uv, init_guess)
params

Out[16]:

      fun: 82.18355089922284
 hess_inv: array([[0.17155552, 0.37538334],
       [0.37538334, 0.82990841]])
      jac: array([ 6.67572021e-06, -1.90734863e-06])
  message: 'Optimization terminated successfully.'
     nfev: 15
      nit: 2
     njev: 5
   status: 0
  success: True
        x: array([1.12829633e-09, 8.71201757e-01])

The x parameter has the optimized parameter values; fun has the smallest value of the function found. That's really close to our residual sum of squares from the OLS model! The parameters are also pretty close too. Let's put them side by side:

In [17]:

pd.DataFrame({
    'OLS': uv_fit.params,
    'SPOpt': params.x
})

Out[17]:

	OLS	SPOpt
Intercept	6.938894e-17	1.128296e-09
flipper	8.712018e-01	8.712018e-01

The coefficient in particular is (nearly) identical, and the intercept is a little different but is essentially 0 in both cases.

It's not going to be quite exact, because Statsmodels OLS uses an exact solution for the least squares problem, while what we get with minimize is an approximate solution. Approximate solutions will be good for many other problems!

Multivariate Regression¶

Now let's learn the multivariate model. From the regression model, the one we settled on was:

mass ~ flipper + species * sex

This expands to:

$$\hat{y} = \beta_0 + \beta_1 x_f + \beta_2 x_c + \beta_3 x_g + \beta_4 x_m + \beta_5 x_c x_m + \beta_6 x_g x_m$$

We're going to start by dummy-coding our data:

In [18]:

penguin_all = penguin_std.join(pd.get_dummies(penguin_std['species'], drop_first=True))
penguin_all = penguin_all.join(pd.get_dummies(penguin_std['sex'], drop_first=True))
penguin_all

Out[18]:

	species	sex	mass	flipper	bill_len	bill_depth	Chinstrap	Gentoo	male
0	Adelie	male	-0.563317	-1.416272	-0.883205	0.784300	0	0	1
1	Adelie	female	-0.500969	-1.060696	-0.809939	0.126003	0	0	0
2	Adelie	female	-1.186793	-0.420660	-0.663408	0.429833	0	0	0
3	Adelie	NaN	NaN	NaN	NaN	NaN	0	0	0
4	Adelie	female	-0.937403	-0.562890	-1.322799	1.088129	0	0	0
...	...	...	...	...	...	...	...	...	...
339	Chinstrap	male	-0.251578	0.432721	2.175637	1.341320	1	0	1
340	Chinstrap	female	-0.999750	0.077145	-0.077282	0.480471	1	0	0
341	Chinstrap	male	-0.532143	-0.562890	1.040019	0.531109	1	0	1
342	Chinstrap	male	-0.126883	0.646066	1.259816	0.936215	1	0	1
343	Chinstrap	female	-0.532143	-0.207315	1.149917	0.784300	1	0	0

344 rows × 9 columns

Now let's set up our prediction function. We're going to vectorize this - if we do a matrix multiply between a $n \times k$ matrix and a $k$-dimensional array, we'll get an $n \times 1$ array of predictions:

In [19]:

def pred_mass_mv(params, data=penguin_all):
    intercept = params[0]
    slopes = params[1:]
    cols = data[['flipper', 'Chinstrap', 'Gentoo', 'male']].copy()
    cols['Chinstrap:male'] = cols['Chinstrap'] * cols['male']
    cols['Gentoo:male'] = cols['Gentoo'] * cols['male']
    return intercept + cols @ slopes

Now let's make a guess:

In [20]:

init_guess = rng.standard_normal(7)
init_guess

Out[20]:

array([ 0.30068623, -0.46800143,  1.05910917, -1.55308712,  0.06762369,
        0.47486707, -0.72624577])

Test our prediction function - should give us predictions:

In [21]:

pred_mass_mv(init_guess)

Out[21]:

0      1.031127
1      0.797094
2      0.497556
3           NaN
4      0.564120
         ...   
339    1.699772
340    1.323691
341    2.165720
342    1.599926
343    1.456819
Length: 344, dtype: float64

Now we'll define our loss function:

In [22]:

def squared_loss_mv(params, data=penguin_all):
    preds = pred_mass_mv(params, data)
    err = data['mass'] - preds
    return np.sum(np.square(err))

How bad is that loss with our guess?

In [23]:

squared_loss_mv(init_guess)

Out[23]:

2150.413368100089

Now let's minimize the parameters:

In [24]:

mv_opt = spopt.minimize(squared_loss_mv, init_guess)
mv_opt

Out[24]:

      fun: 44.02326049720256
 hess_inv: array([[ 0.01480044,  0.00888065, -0.00903515, -0.02240841, -0.0094197 ,
         0.00427484,  0.00407089],
       [ 0.00888065,  0.00940691, -0.00278176, -0.0169299 , -0.00317235,
        -0.00226389, -0.00251838],
       [-0.00903515, -0.00278176,  0.0219765 ,  0.01141677,  0.00732328,
        -0.02046506, -0.00564343],
       [-0.02240841, -0.0169299 ,  0.01141677,  0.04498431,  0.01215354,
        -0.00234171, -0.01004325],
       [-0.0094197 , -0.00317235,  0.00732328,  0.01215354,  0.0143767 ,
        -0.01249076, -0.01250795],
       [ 0.00427484, -0.00226389, -0.02046506, -0.00234171, -0.01249076,
         0.04323162,  0.01386635],
       [ 0.00407089, -0.00251838, -0.00564343, -0.01004325, -0.01250795,
         0.01386635,  0.03046714]])
      jac: array([-0.00000000e+00,  0.00000000e+00,  4.76837158e-07,  0.00000000e+00,
        0.00000000e+00, -0.00000000e+00,  0.00000000e+00])
  message: 'Optimization terminated successfully.'
     nfev: 128
      nit: 13
     njev: 16
   status: 0
  success: True
        x: array([-0.68863129,  0.35659433,  0.08029619,  0.97323106,  0.70694806,
       -0.40084566,  0.0853413 ])

This gives us an optimized residual sum of squares of 44! If we compare with the regression notebook, the parameters are pretty close! They aren't quite in the same order, so we have to be careful, but this is clearly an approximation of the same model.

In [ ]: