We've talked about summary statistics.
We've started talking about sampling.
How do these go together? (Hint: your video got you started!)
Install r-VGAM
before running.
library(VGAM)
library(plyr)
library(tidyverse)
options(repr.plot.height=4.5, repr.matrix.max.rows=20)
It is common to take a sample, and then measure some statistic of it.
In frequentist statistical parlance, we use a statistic of the sample to estimate a parameter of the population.
Let's start doing some sampling. Let's take a sample of 100 random normally-distributed numbers:
nums = data.frame(x=rnorm(100, 10, 2))
nums
summary(nums)
These numbers are randomly drawn from from a population with mean μ=10 and standard deviation σ=2.
Note: The standard deviation is a measure of how spread apart the numbers are.
If we plot them, we'll see something that looks normal:
qplot(nums$x, geom='histogram', binwidth=0.25)
An alternative to a histogram for a continuous variable is a kernel density plot:
qplot(nums$x, geom='density')
The qplot
is a shortcut for fuller ggplot
syntax:
ggplot(nums) +
aes(x) +
geom_density()
We can also stack plots! This is the beauty of ggplot
:
ggplot(nums) +
aes(x) +
geom_histogram(binwidth=0.25, mapping=aes(y=..density..)) +
geom_density(color='red')
But, as we saw, the mean of our numbers is not quite 10!
In statistical inference, we want to ask: what if we did this a bunch of times? If we repeated this experiment (taking the mean of a sample of 100) many times, say 1000, what is the distribution of our observed means?
Let's sample!
sample_means = aaply(1:1000, 1, function(n) {
mean(rnorm(100, 10, 2))
})
qplot(sample_means, geom='histogram')
That almost looks... normal! With a mean at about the mean:
mean(sample_means)
sd(sample_means)
What happens if we take samples of 10 numbers?
Does the distribution of sample means get wider (higher SD) or lower?
So the sampling distribution of the sample mean looks normal when our population is normal. What if our population is something else? Say log-normal?
A log-normal variable is one whose logarithm is exponentially distributed; we can randomly generate such values by drawing normally-distributed values and exponentiating them.
# make a data frame
nums = data.frame(x=exp(rnorm(100, 5, 1)))
# and plot it, with the PDF
ggplot(nums, aes(x)) +
geom_histogram(bins=50) +
stat_function(fun=function(x) {dnorm(log(x), 5, 1) * 50}, color='red') # 50 is a scaling term
Draw 1000 such samples, and plot the distribution of the mean.
Log-normal is pretty different. Let's go more absurd - how about a power law?
A power-law distribution is one that exponentially decays; one example is the Pareto distribution. Power law distributions are extremely commmon in data that arises from human activity.
# make a data frame
nums = data.frame(x=rpareto(500, 1, 5))
# and plot it, with the PDF
ggplot(nums, aes(x)) +
geom_histogram(bins=50) +
stat_function(fun=function(x) {dpareto(x, 1, 5) * 25}, color='red') # 50 is a scaling term
What's the distribution of the sample mean if we take 1000 100-item samples from a Pareto?
Our histograms look normal-ish; that is a good first step at seeing how close they are to normal. But there is another kind of plot we use for normality testing: the Q-Q plot (quantile-quantile). A perfectly normal distribution forms a straight line on a Q-Q plot.
ggplot(nums) +
aes(sample=x) +
stat_qq()
Use Q-Q plots to assess the normality of your sampling distributions.