Sampling Distributions¶
This notebook has the plots and code to support the sampling, confidence, and boostrap lecture.
Setup¶
Python modules:
import numpy as np
import pandas as pd
import scipy.stats as sps
import matplotlib.pyplot as plt
import seaborn as sns
Set up a random number generator:
rng = np.random.default_rng(20200912)
Load the Penguin data:
penguins = pd.read_csv('../penguins.csv')
And we're going to want to look at just the chinstraps sometimes:
chinstraps = penguins[penguins['species'] == 'Chinstrap']
chinstraps.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|
276 | Chinstrap | Dream | 46.5 | 17.9 | 192.0 | 3500.0 | female | 2007 |
277 | Chinstrap | Dream | 50.0 | 19.5 | 196.0 | 3900.0 | male | 2007 |
278 | Chinstrap | Dream | 51.3 | 19.2 | 193.0 | 3650.0 | male | 2007 |
279 | Chinstrap | Dream | 45.4 | 18.7 | 188.0 | 3525.0 | female | 2007 |
280 | Chinstrap | Dream | 52.7 | 19.8 | 197.0 | 3725.0 | male | 2007 |
That's all we need this time!
Sampling Distributions¶
Let's say we have a population that is accurately described by the normal distribution $\mathrm{Normal(50, 5)}$ — the population mean $\mu = 50$ and its standard deviation $\sigma = 5$, and it's normal.
We can draw a sample of size 50 from this distribution:
dist = sps.norm(50, 5)
sample = dist.rvs(size=50, random_state=rng)
And show its distribution:
sns.distplot(sample)
<matplotlib.axes._subplots.AxesSubplot at 0x2b419cc5b80>
Note: The curve in a Seaborn
distplot
is the kernel density estimate: it is an estimate of the probability density function.
We can also compute statistics, such as the mean:
np.mean(sample)
50.462609949153304
That mean is pretty close to our true population mean of 50! But it isn't quite exact - any given sample is going to have some error. There is natural variation in samples that affects the statistic.
But what if we did this 1000 times? And each time, we took another sample of 100, and computed its mean?
How would this sample of sample means be distributed?
Note: The syntax
[expr for var in iterable]
is a list comprehension. It lets is quickly write lists based on loops.
sample_means = [np.mean(dist.rvs(size=50, random_state=rng)) for i in range(1000)]
sns.distplot(sample_means)
<matplotlib.axes._subplots.AxesSubplot at 0x2b41a449f40>
We call this distribution the sampling distribution of a statistic.
Let's overlay these on top of each other:
plt.plot(np.linspace(30, 70), dist.pdf(np.linspace(30, 70)), label='Population')
sns.distplot(sample, label='Sample')
sns.kdeplot(sample_means, label='Sampling Dist')
<matplotlib.axes._subplots.AxesSubplot at 0x2b41a4e6250>
The sample mean is normally distributed with mean $\mu$ and standard deviation $\sigma/\sqrt{n}$. Let's see that for a few sample sizes, with the true mean plotted as a vertical line:
xs = np.linspace(45, 55, 1000)
plt.axvline(50, color='black', linestyle=':')
plt.plot(xs, sps.norm(50, 5 / np.sqrt(10)).pdf(xs), label='n=10')
plt.plot(xs, sps.norm(50, 5 / np.sqrt(50)).pdf(xs), label='n=50')
plt.plot(xs, sps.norm(50, 5 / np.sqrt(100)).pdf(xs), label='n=100')
plt.legend()
plt.show()
Confidence Intervals¶
I'm going to write a little function for computing confidence intervals. It will return a (low, high) tuple.
def ci95(mean, sd, n, width=1.96):
se = sd / np.sqrt(n)
off = se * width
return mean - off, mean + off
What's the estimated CI from our sample?
Note:
ddof=1
tells NumPy'sstd
function to compute the sample standard deviation.
ci95(np.mean(sample), np.std(sample, ddof=1), 50)
(49.113141238970556, 51.81207865933605)
Remember from the video and Confidence in Confidence.
What's the range of the middle 95% if we do it a thousand times? Since we can, of course!