Regressions¶
This notebook demonstrates various regressions, and accompanies the videos in Week 8.
Setup¶
We're going to import our modules. Of particular note is statsmodels.formula.api
, which gives us the regression APIs.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as sps
import statsmodels.api as sm
import statsmodels.formula.api as smf
We're going to use the Penguins data:
penguins = pd.read_csv('penguins.csv')
penguins.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | male | 2007 |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | female | 2007 |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | female | 2007 |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN | 2007 |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | female | 2007 |
I'm also going to create a utility function for plotting lines:
def plot_line(intercept, slope, xmin, xmax, **kwargs):
xs = np.linspace(xmin, xmax, 100)
ys = intercept + slope * xs
plt.plot(xs, ys, **kwargs)
sns.pairplot(penguins[['species', 'sex', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']], hue='species')
<seaborn.axisgrid.PairGrid at 0x1da578ae8e0>
Let's also look at the distribution of sex and species, to see if we have sex imbalances between species:
sns.countplot(x='species', hue='sex', data=penguins)
<matplotlib.axes._subplots.AxesSubplot at 0x1da58a9f490>
Nope!
Now a correlation matrix between our various numeric variables:
penguins.drop(columns=['year']).corr()
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
---|---|---|---|---|
bill_length_mm | 1.000000 | -0.235053 | 0.656181 | 0.595110 |
bill_depth_mm | -0.235053 | 1.000000 | -0.583851 | -0.471916 |
flipper_length_mm | 0.656181 | -0.583851 | 1.000000 | 0.871202 |
body_mass_g | 0.595110 | -0.471916 | 0.871202 | 1.000000 |
The .corr
method on a data frame will compute correlations between all pairs of numeric variables, and show them in a table. The diagonal is 1, because a variable is always correlated with itself.
Dummy Variable Demo¶
Here's the code that demonstrates dummy-coding from the Categorical Predictors lecture:
pd.get_dummies(penguins['species'])
Adelie | Chinstrap | Gentoo | |
---|---|---|---|
0 | 1 | 0 | 0 |
1 | 1 | 0 | 0 |
2 | 1 | 0 | 0 |
3 | 1 | 0 | 0 |
4 | 1 | 0 | 0 |
... | ... | ... | ... |
339 | 0 | 1 | 0 |
340 | 0 | 1 | 0 |
341 | 0 | 1 | 0 |
342 | 0 | 1 | 0 |
343 | 0 | 1 | 0 |
344 rows × 3 columns
And dropping the first level:
pd.get_dummies(penguins['species'], drop_first=True)
Chinstrap | Gentoo | |
---|---|---|
0 | 0 | 0 |
1 | 0 | 0 |
2 | 0 | 0 |
3 | 0 | 0 |
4 | 0 | 0 |
... | ... | ... |
339 | 1 | 0 |
340 | 1 | 0 |
341 | 1 | 0 |
342 | 1 | 0 |
343 | 1 | 0 |
344 rows × 2 columns
Flipper Regression¶
From the pair plot, the flipper length and body mass look like good candidates for regression. Let's show them with a regplot
:
sns.regplot('flipper_length_mm', 'body_mass_g', data=penguins, line_kws={'color': 'orange'})
<matplotlib.axes._subplots.AxesSubplot at 0x1da58d40d60>
Now we're going to fit a logistic regression.
The first step is to create the OLS model. This has two inputs:
- The formula specifying the regression model (in this case, predicting body mass with flipper length)
- The data to tran the model on
Let's do it:
bm_mod = smf.ols('body_mass_g ~ flipper_length_mm', data=penguins)
Once we have created the model, we need to fit it. The fit
method will return an object containing the results of fitting the model, such as its paramters; it can be summarized with summary
:
bmf = bm_mod.fit()
bmf.summary()
Dep. Variable: | body_mass_g | R-squared: | 0.759 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.758 |
Method: | Least Squares | F-statistic: | 1071. |
Date: | Sat, 10 Oct 2020 | Prob (F-statistic): | 4.37e-107 |
Time: | 19:13:14 | Log-Likelihood: | -2528.4 |
No. Observations: | 342 | AIC: | 5061. |
Df Residuals: | 340 | BIC: | 5069. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -5780.8314 | 305.815 | -18.903 | 0.000 | -6382.358 | -5179.305 |
flipper_length_mm | 49.6856 | 1.518 | 32.722 | 0.000 | 46.699 | 52.672 |
Omnibus: | 5.634 | Durbin-Watson: | 2.190 |
---|---|---|---|
Prob(Omnibus): | 0.060 | Jarque-Bera (JB): | 5.585 |
Skew: | 0.313 | Prob(JB): | 0.0613 |
Kurtosis: | 3.019 | Cond. No. | 2.89e+03 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.89e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Now we need to check our residual assumptions. There are two plots to create:
- Residuals vs. Fitted
- Residual Q-Q
I'm going to create a function that plots them both:
def plot_lm_diag(fit):
"Plot linear fit diagnostics"
sns.regplot(x=fit.fittedvalues, y=fit.resid)
plt.xlabel('Fitted')
plt.ylabel('Residuals')
plt.title('Residuals vs. Fitted')
plt.show()
sm.qqplot(fit.resid, fit=True, line='45')
plt.title('Residuals')
plt.show()
And now use it:
plot_lm_diag(bmf)
Standardizing Variables¶
We can also standardize our variables before fitting a model. This converts variable values into z-scores, such that
$$z_i = \frac{x_i - \bar{x}}{s}$$Z-standardized (or z-normalized) variables have a mean of 0 and a standard deviation of 1 (and since $1^2 = 1$, the variance is also 1). The result is that model coefficients are in units of standard deviations.
Let's write a function to convert a series to its z-scores:
def zscore(xs):
xbar = xs.mean()
s = xs.std()
return (xs - xbar) / s
And create standardized versions of our variables:
penguin_std = pd.DataFrame({
'species': penguins['species'],
'sex': penguins['sex'],
'mass': zscore(penguins['body_mass_g']),
'flipper': zscore(penguins['flipper_length_mm']),
'bill_len': zscore(penguins['bill_length_mm']),
'bill_depth': zscore(penguins['bill_depth_mm'])
})
penguin_std
species | sex | mass | flipper | bill_len | bill_depth | |
---|---|---|---|---|---|---|
0 | Adelie | male | -0.563317 | -1.416272 | -0.883205 | 0.784300 |
1 | Adelie | female | -0.500969 | -1.060696 | -0.809939 | 0.126003 |
2 | Adelie | female | -1.186793 | -0.420660 | -0.663408 | 0.429833 |
3 | Adelie | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | female | -0.937403 | -0.562890 | -1.322799 | 1.088129 |
... | ... | ... | ... | ... | ... | ... |
339 | Chinstrap | male | -0.251578 | 0.432721 | 2.175637 | 1.341320 |
340 | Chinstrap | female | -0.999750 | 0.077145 | -0.077282 | 0.480471 |
341 | Chinstrap | male | -0.532143 | -0.562890 | 1.040019 | 0.531109 |
342 | Chinstrap | male | -0.126883 | 0.646066 | 1.259816 | 0.936215 |
343 | Chinstrap | female | -0.532143 | -0.207315 | 1.149917 | 0.784300 |
344 rows × 6 columns
And now we'll fit and describe a model using these:
bm_mod = smf.ols('mass ~ flipper', data=penguin_std)
bmf = bm_mod.fit()
bmf.summary()
Dep. Variable: | mass | R-squared: | 0.759 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.758 |
Method: | Least Squares | F-statistic: | 1071. |
Date: | Sat, 10 Oct 2020 | Prob (F-statistic): | 4.37e-107 |
Time: | 19:13:14 | Log-Likelihood: | -241.46 |
No. Observations: | 342 | AIC: | 486.9 |
Df Residuals: | 340 | BIC: | 494.6 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 6.939e-17 | 0.027 | 2.61e-15 | 1.000 | -0.052 | 0.052 |
flipper | 0.8712 | 0.027 | 32.722 | 0.000 | 0.819 | 0.924 |
Omnibus: | 5.634 | Durbin-Watson: | 2.190 |
---|---|---|---|
Prob(Omnibus): | 0.060 | Jarque-Bera (JB): | 5.585 |
Skew: | 0.313 | Prob(JB): | 0.0613 |
Kurtosis: | 3.019 | Cond. No. | 1.00 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The residual plots will be identical to the unstandardized model, except for the scales of the axes, because all we did was rescale and recenter the variables:
plot_lm_diag(bmf)
Mass and Bill Length¶
Let's do a regression we don't expect to work well: mass vs. bill length.
First, we'll fit the regression:
bm_mod = smf.ols('mass ~ bill_len', data=penguin_std)
bmf = bm_mod.fit()
bmf.summary()
Dep. Variable: | mass | R-squared: | 0.354 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.352 |
Method: | Least Squares | F-statistic: | 186.4 |
Date: | Sat, 10 Oct 2020 | Prob (F-statistic): | 3.81e-34 |
Time: | 19:13:15 | Log-Likelihood: | -410.02 |
No. Observations: | 342 | AIC: | 824.0 |
Df Residuals: | 340 | BIC: | 831.7 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 6.939e-17 | 0.044 | 1.59e-15 | 1.000 | -0.086 | 0.086 |
bill_len | 0.5951 | 0.044 | 13.654 | 0.000 | 0.509 | 0.681 |
Omnibus: | 5.671 | Durbin-Watson: | 0.866 |
---|---|---|---|
Prob(Omnibus): | 0.059 | Jarque-Bera (JB): | 4.857 |
Skew: | -0.211 | Prob(JB): | 0.0882 |
Kurtosis: | 2.597 | Cond. No. | 1.00 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
$R^2$ is much lower than the previous models, indicating this model explains much less of the variance.
Let's look at our diagnostic plots:
plot_lm_diag(bmf)