Overfitting Simulation

This notebook demonstrates overfitting, where we learn noise too well.

Setup

Load our modules:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
import scipy.stats as sps

And initialize a random number generator:

rng = np.random.default_rng(20201015)

Simulating Data

Let’s define a function to generate n data points. This does three things:

  1. Generates n samples from \(X\) uniformly at random in the range \([0,10)\) (by multiplying random by 10)

  2. Computes \(Y\) by \(y_i = 2 x_i + \epsilon\), where \(\epsilon \sim \mathrm{Normal}(0,2)\)

  3. Puts \(X\) and \(Y\) into a data frame

The function:

def generate(n):
    xs = rng.random(n) * 10
    ys = xs * 2 + rng.standard_normal(n) * 2
    return pd.DataFrame({'X': xs, 'Y': ys})

Now we’re going to define a function to generate polynomials of a varible. For order \(k\), it will produce \(x^1, x^2, \dots, x^k\) as columns of a data frame:

def poly_expand(series, order=4):
    df = pd.DataFrame({'X': series})
    for i in range(2, order+1):
        df[f'X{i}'] = series ** i
    return df

Now let’s generate our initial data and plot it:

data = generate(30)
data.plot.scatter('X', 'Y')
<matplotlib.axes._subplots.AxesSubplot at 0x165d85a6160>
../../../_images/OverfittingSimulation_10_1.png

So far, we’ve been using StatsModels formula interface. That is syntax sugar on top of a lower-level matrix-based interface. For this analysis, we will use that, because it’s easier to programatically set up and manipulate.

The matrix interface requires us to construct an OLS instance with a \(n \times 1\) array of endogenous (outcome) variables Y, and a \(n \times k\) matrix of exogenous variables. The rows are data points, and the columns of this matrix are predictor variables. If we want an intercept, it needs to be included as a predictor variable whose value is always 1 — the sm.add_constant function does this.

Let’s set it up and fit the model:

X = sm.add_constant(data[['X']])
lin = sm.OLS(data['Y'], X)
lin = lin.fit()
lin.summary()
OLS Regression Results
Dep. Variable: Y R-squared: 0.919
Model: OLS Adj. R-squared: 0.916
Method: Least Squares F-statistic: 317.7
Date: Fri, 16 Oct 2020 Prob (F-statistic): 8.13e-17
Time: 18:51:19 Log-Likelihood: -59.322
No. Observations: 30 AIC: 122.6
Df Residuals: 28 BIC: 125.4
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -0.7818 0.639 -1.223 0.231 -2.091 0.527
X 2.2016 0.124 17.824 0.000 1.949 2.455
Omnibus: 7.323 Durbin-Watson: 1.736
Prob(Omnibus): 0.026 Jarque-Bera (JB): 5.772
Skew: 0.801 Prob(JB): 0.0558
Kurtosis: 4.431 Cond. No. 10.3


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Now let’s fit a 4th-order polynomial model

\[\hat{y} = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 x^4\]

We’ll this by adding a constant to the poly_expand of the data frame, and fitting the model:

X = sm.add_constant(poly_expand(data['X']))
poly = sm.OLS(data['Y'], X)
poly = poly.fit()
poly.summary()
OLS Regression Results
Dep. Variable: Y R-squared: 0.938
Model: OLS Adj. R-squared: 0.928
Method: Least Squares F-statistic: 94.82
Date: Fri, 16 Oct 2020 Prob (F-statistic): 9.90e-15
Time: 18:52:06 Log-Likelihood: -55.274
No. Observations: 30 AIC: 120.5
Df Residuals: 25 BIC: 127.6
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -0.1426 1.135 -0.126 0.901 -2.479 2.194
X -1.1816 1.803 -0.655 0.518 -4.896 2.532
X2 1.7365 0.805 2.157 0.041 0.079 3.394
X3 -0.2776 0.130 -2.144 0.042 -0.544 -0.011
X4 0.0138 0.007 2.027 0.053 -0.000 0.028
Omnibus: 5.714 Durbin-Watson: 1.689
Prob(Omnibus): 0.057 Jarque-Bera (JB): 4.196
Skew: 0.622 Prob(JB): 0.123
Kurtosis: 4.345 Cond. No. 1.98e+04


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.98e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

And now the same with the 10th-order polynomial:

\[\hat{y} = \sum_{i=0}^{10} \beta_i x^i\]
X = sm.add_constant(poly_expand(data['X'], order=10))
poly10 = sm.OLS(data['Y'], X)
poly10 = poly10.fit()
poly10.summary()
OLS Regression Results
Dep. Variable: Y R-squared: 0.946
Model: OLS Adj. R-squared: 0.918
Method: Least Squares F-statistic: 33.49
Date: Fri, 16 Oct 2020 Prob (F-statistic): 5.98e-10
Time: 18:53:18 Log-Likelihood: -53.154
No. Observations: 30 AIC: 128.3
Df Residuals: 19 BIC: 143.7
Df Model: 10
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 0.5949 2.061 0.289 0.776 -3.719 4.909
X -1.1089 19.180 -0.058 0.954 -41.254 39.036
X2 -18.4133 51.012 -0.361 0.722 -125.182 88.355
X3 35.0879 60.911 0.576 0.571 -92.401 162.577
X4 -25.8969 39.040 -0.663 0.515 -107.608 55.814
X5 10.3116 14.824 0.696 0.495 -20.715 41.338
X6 -2.4437 3.495 -0.699 0.493 -9.759 4.872
X7 0.3545 0.517 0.686 0.501 -0.727 1.436
X8 -0.0309 0.047 -0.664 0.515 -0.128 0.066
X9 0.0015 0.002 0.636 0.532 -0.003 0.006
X10 -3.017e-05 4.97e-05 -0.607 0.551 -0.000 7.39e-05
Omnibus: 4.482 Durbin-Watson: 1.745
Prob(Omnibus): 0.106 Jarque-Bera (JB): 3.177
Skew: 0.422 Prob(JB): 0.204
Kurtosis: 4.353 Cond. No. 4.80e+11


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.8e+11. This might indicate that there are
strong multicollinearity or other numerical problems.

Now let’s plot the data and curves. This time, I’m going to create a linspace, as usual, and then use the fitted model’s predict function to generate the curves (this is easier than manually extracting and writing our own formula):

plt.scatter(data.X, data.Y)
xsp = np.linspace(np.min(data.X), np.max(data.X), 100)
# ysp = poly.predict(sm.add_constant(poly_expand(xsp)))
ysp10 = poly10.predict(sm.add_constant(poly_expand(xsp, order=10)))
ylp = lin.predict(sm.add_constant(xsp))
plt.plot(xsp, ylp, color='orange', label='linear')
# plt.plot(xsp, ysp, color='red', label='poly4')
plt.plot(xsp, ysp10, color='green', label='poly10')
plt.legend()
plt.show()
../../../_images/OverfittingSimulation_18_0.png

What are our squared errors on the training data?

np.mean(np.square(lin.resid))
3.0553433680686
np.mean(np.square(poly10.resid))
2.0253086976466195

New Data

What about new points from our distribution? Let’s generate some:

test = generate(30)
test['LinP'] = lin.predict(sm.add_constant(test[['X']]))
test['LinErr'] = test['Y'] - test['LinP']
test['Poly10P'] = poly10.predict(sm.add_constant(poly_expand(test.X, 10)))
test['Poly10Err'] = test['Y'] - test['Poly10P']

And compute our test error for each model:

np.mean(np.square(test['LinErr']))
2.992713620832882
np.mean(np.square(test['Poly10Err']))
4.382587964880173

And plot the new points, along wiht the old points and our model:

plt.scatter(data.X, data.Y, color='steelblue')
plt.scatter(test.X, test.Y, color='red', marker='+')
xsp = np.linspace(np.min(data.X), np.max(data.X), 100)
# ysp = poly.predict(sm.add_constant(poly_expand(xsp)))
ysp10 = poly10.predict(sm.add_constant(poly_expand(xsp, order=10)))
ylp = lin.predict(sm.add_constant(xsp))
plt.plot(xsp, ylp, color='orange', label='linear')
# plt.plot(xsp, ysp, color='red', label='poly4')
plt.plot(xsp, ysp10, color='green', label='poly10')
plt.legend()
plt.show()
../../../_images/OverfittingSimulation_28_0.png