Assignment 4
Assignment 4 has two purposes:
- To give you more experience with linear regression
- To introduce simulation as a means of studying the behavior of statistical techniques
This assignment is due October 25, 2020 at the end of the day.
Note
In this writeup, I will use capital letters (e.g.
You will find the probability notes useful for understanding the derivations in this assignment.
Revision Log
- October 19, 2020
- Added clarifying notes to Warmup.
Simulation
One common way to understand the behavior of statistical techniques is to use simulation (often called Monte Carlo simulation). In a simulation, we use a psuedorandom number generator to make up data that follows particular patterns (or lack of patterns). We call this data synthetic data.
We then apply a statistical technique, such as correlation coefficient or a linear regression, to the synthetic data, and see how closely its results match the parameters we put in to our simulation. If the analysis reliably estimates the simulation's parameters, we say it recovers the parameters. We can do this many times to estimate that reliability — we can run the simulation 1000 times, for example, and examine the distribution of the error of its parameter estimates to see if it is unbiased, and how broad the errors are.
This technique is commonly used in statistics research (that is, research about statistics itself, rather than research that uses statistics to study other topics) in order to examine the behavior of statistical methods. By simulating samples of different sizes from a population with known parameters, we can compare the results of analyzing those samples with the actual values the statistical method is supposed to estimate. Further, by mapping its behavior over a range of scenarios, we can gain insight into what a statistical technique is likely doing with the particular data we have in front of us.
This is distinct from bootstrapping. In bootstrapping, we are resampling our sample to try to estimate the sampling distribution of a statistic with respect to the population our sample was drawn from; we have actual data, but do not know the actual population parameters. In simulation, we know the population parameters, and do not have any actual data because we make it all up with the random number generator.
Generating Random Numbers
NumPy's Generator
class is the starting point for generating random numbers.
It has methods for generating numbers from a range of distributions.
For more sophisticated distributions, the various distributions in the scipy.stats
also support random draws.
Random number generators have a seed that is the starting point for picking numbers. Two identical generators with the same seed will produce the same sequence of values.
We can create a generator with np.random.default_rng
:
rng = np.random.default_rng(20201014)
In my class examples, I have been using the current date as my seed. If you do not specify a seed, it will pick a fresh one every time you start the program; for reproducibility, it is advised to pick a seed for any particular analysis. It's also useful to re-run the analysis with a different seed and double-check that none of the conclusions changed.
We can then use the random number generator to generate random numbers from various distributions. It's important to note that random does not mean uniform — then uniform distribution is just one kind of random distribution.
For example, we can draw 100 samples from the standard normal distribution (
xs = rng.standard_normal(100)
Warmup: Correlation (10%)
If two variables are independent, their correlation should be zero, right? We can simulate this by drawing two arrays of 100 standard normal variables each, and computing their correlation coefficient:
xs = pd.Series(rng.standard_normal(100))
ys = pd.Series(rng.standard_normal(100))
xs.corr(ys)
Note
This code takes 100 draws (or samples) from the standard normal, twice (once for xs
and again for ys
).
Mathematically, we write this using
Run 1000 iterations of this simulation to compute 1000 correlation coefficients. What is the mean and variance of these simulated coefficients? Plot their distribution. Are the results what you expect for computing correlations of uncorrelated variables?
Tip
What you need to do for this is to run the code example above — that computes the correlation between two 100-item samples — one thousand times.
This will draw a total of 200,000 numbers (100 each for x
and y
, in each simulation iteration).
Repeat the previous simulation, but using 1000 draws per iteration instead of 100. How does this change the mean and variance of the resulting coefficients?
Tip
Now you need to modify the code to draw 1000 normals for x
and 1000 normals for y
in each iteration.
Remember that the example above is drawing 100 normals for each variable.
Remember the covariance of two variables is defined as:
And the correlation is:
If we want to generate correlated variables, we can do so by combining two random variables to form a third:
We can draw them with:
xs = pd.Series(rng.standard_normal(100))
ys = pd.Series(rng.standard_normal(100))
zs = xs + ys
With these variables, we have:
This last identity is from a property called linearity of expectation.
We can now determine the covariance between
With that:
The correlation coefficient depends on
Therefore we have
Covariance
You can compute the covariance with the Pandas .cov
method.
It's instructive to also plot that!
Run 1000 iterations simulating these correlated variables to compute 1000 correlation coefficients (xs.corr(zs)
). Compute the mean and variance of these coefficients, and plot their distributions. Does this match what we expect from the analytic results? What happens when we compute correlations of 1000-element arrays in each iteration? What about 10000-element arrays?
Linear Regression (40%)
If we want to simulate a single-variable linear regression:
there are four things we need to control:
- the distribution of
- the intercept
- the slope
- the variance of errors
Remember that the linear regression model assumes errors are i.i.d. normal, and the OLS model will result in a mean error of 0; thus we have
- Sample
- Sample
- Compute
Let's start with a very simple example:
xs = rng.standard_normal(1000)
errs = rng.standard_normal(1000)
ys = 0 + 1 * xs + errs
data = pd.DataFrame({
'X': xs,
'Y': ys
})
Fit a linear model to this data, predicting
Repeat the simulation 1000 times, fitting a linear model each time. Show the mean, variance, and a distribution plot of the intercept, slope, and
Extracting Parameters
The RegressionResults
class returned by .fit()
contains the model parameters. The .params
field has the coefficients (including intercept), and .rsquared
has the
fit.params['X']
Fit a model to data with
Nonlinear Data (15%)
Generate 1000 data points with the following distributions and formula:
Fit a linear model predicting
Draw a scatter plot of
Drawing Normals
You can draw from normal
method of Generator
, or by drawing an array of standard normals and multiplying it by 5.
Tip
The NumPy function np.exp
computes
Repeat with
Non-Normal Covariates (15%)
Generate 1000 data points with the model:
- Plot the distributions of
and - Fit a linear model predicting
with - How well does this model fit? Do the assumptions hold?
Gamma Distributions
You can draw 1000 from the
rng.gamma(2, 1, 1000)
Multiple Regression (10%)
Now we're going to look at regression with two or more independent variables.
We will use the following data generating process:
Tip
To draw from xs
from a standard normal and compute xs * σ + μ
.
Fit a linear model y ~ x1 + x2
on 1000 data points drawn from this model. What are the intercept and coefficients from the model? Are they what you expect?
Check the model assumptions — do they hold?
Note
You can draw both
xs = rng.multivariate_normal([10, -2], [[2, 0], [0, 5]], 100)
# turn into a data frame
xdf = pd.DataFrame(xs, columns=['X1', 'X2'])
The multivariate normal distribution is parameterized by a list (or array) of means, and a positive symmetric covariance matrix defined as follows:
That is, the diagonals of them matrix are the variances of the individual variables, and the other cells are the covariances between pairs of variables. The example code sets up the following matrix:
Correlated Predictors (10%)
Now we're going to see what happens when we have correlated predictor variables. Remember I said those were a problem?
We're going to use the multivariate normal from the hint in the previous part to draw correlated variables
-
Draw 1000 samples of variables
and from a normal with means , variances of 1, and a covariance :xs = rng.multivariate_normal([1, 3], [[1, 0.85], [0.85, 1]], 1000)
-
Draw
-
Compute
Show a pairplot of our variables
Fit a linear regression for y ~ x1 + x2
. How well does it fit? Do its assumptions hold?
Run this simulation (drawing 1000 variables and fitting a linear model) 100 times. Show the mean, variance, and distribution plots of the estimated intercepts and coefficients (for x1
and x2
).
Repeat the repeated simulation for a variety of different covariances from 0 to 1 (including at least 0, 1, 0.9, and 0.99).
Create line plots (or a single plot with multiple colors) that show how the variance of the estimated regression parameters (intercept and
Reflection (10%)
Write a couple of paragraphs about what you learned from this assignment.
Expected Time
As in A3, I'm providing here some estimates of how long I expect each part might take you.
- Warmup: 1 hour
- Linear Regression: 2 hours
- Nonlinear Data: 30 minutes
- Non-normal Covariates: 30 minutes
- Multiple Regression: 1 hour
- Correlated Predictors: 2 hours
- Reflection: 1 hour