In the previous lesson, we looked at discrete, and particularly categorical, probablity distributions. In this lesson, we're going to talk about more types of variables, and continuous probability distributions.
Let's grab that NYC data again.
library(tidyverse)
library(nycflights13)
options(repr.matrix.max.rows=20, repr.plot.height=4.5)
We can categorize random variables (or observations, or parameters) into four types:
Types 1–3 are all types of discrete values: they take on distinct values. Type 4 is just continuous.
In the previous demo, we counted up the number of flights by a particular airline, or from a particular airport, to estimate probabilities.
The NYC data also includes weather for each of the airports. Let's look at that data:
head(weather)
Dealing with weather by the hour seems a little obnoxious; let's bin it by day.
day_weather = weather %>%
group_by(origin, year, month, day) %>%
summarize(avg_temp=mean(temp, na.rm=TRUE), max_temp=max(temp), min_temp=min(temp),
avg_wind=mean(wind_speed), max_wind=max(wind_gust),
precip=sum(precip))
head(day_weather)
Let's try to count up average temperatures:
avg_temps = day_weather %>%
filter(origin == 'EWR') %>%
group_by(avg_temp) %>%
summarize(ndays=n()) %>%
ungroup() %>%
mutate(prob=ndays / sum(ndays))
avg_temps
Individual temperatures are not very common! Further, what if we want the probability that it will be 82 degrees?
avg_temps %>% filter(avg_temp == 82)
No entries! But what if we plot the distribution of average temps?
ggplot(day_weather %>% filter(origin == 'EWR')) +
aes(x=avg_temp) +
geom_histogram()
Based on this we'd expect 82 to happen sometimes!
It turns out that, when we have a continuous variable, it isn't meaningful to ask about the probability of it taking on a particular value. If we think about the infinite possibilities, we can see why: even with a limited variable, such as one that only takes on values in the range $[0,1]$, there are infinitely many different values! Since the whole probability distribution must sum to 1, each individual value's probability must be infinitely close to 0.
This… does not sound useful. But it turns out there is a different, related question we can ask. We can count how many observations have values less than or equal to a particular value! And we can ask how likely it is to see a value less than or equal to a particular value.
mean(day_weather$avg_temp <= 50)
mean(day_weather$avg_temp <= 82)
This is called the cumulative probability. We can also ask about the probability of seeing a value in a particular range by subtracting: $P(a < X \le b)$ = $P(X \le b) - P(X < a)$
mean(day_weather$avg_temp <= 82.5) - mean(day_weather$avg_temp <= 81.5)
There is a 1.2% chance of a day having an average temperature within 0.5 of 82. Cool!
Counting things up turns out to not be a very friendly way of thinking about continuous probability distributions. Instead, there are a bunch of standard probability distributions defined with mathematical functions.
One of these is particularly common and important: the normal or Gaussian distribution. Let's see its CDF:
ggplot(data.frame(x=seq(-4, 4, 0.01)) %>% mutate(p=pnorm(x))) +
aes(x=x, y=p) +
geom_line()
We describe the cumulative probability with a cumulative distribution function, or CDF. For a CDF $F$ for a random variable $X$, $F(x) = P(X \le x)$. R supports many probability distributions, and the p
function like pnorm
is the cumulative distribution for that distribution. The CDF of the ‘standard normal’ distribution is
A CDF is monotonic — if $x > y$, then $F(x) \ge F(y)$ — and has a minimum of 0 and a maximum of 1.
Cumulative distribution is not the only way we can think of probabilities, though. This distribution may be more familiar if we look at it like this:
ggplot(data.frame(x=seq(-4, 4, 0.01)) %>% mutate(p=dnorm(x))) +
aes(x=x, y=p) +
geom_line()
This is the probability density function (PDF), and it corresponds to the probability distributions we saw in the discrete case. However, it is important to note that probability densities are not probabilities! CDF values are probabilities, but PDF values are not - they are the density of the probability in a small region surrounding the value.
For a continuous probability distribution, the cumulative function $F$ and density function $D$ have the following relationship:
$$\begin{align*} D(x) = \frac{d}{dx} F(x) \\ F(x) = \int_{-\infty}^{x} D(t) dt \end{align*}$$That is, the probability density is the derivative of the cumulative distribution function. $D(x)$ indicates how much of the probability gets added at $x$, roughly. But it is meaningless to ask about $P(X=x)$, as we discussed above, and the probability density can be quite large (greater than 1, even!). The probability density does, however, integrate to 1.
Many natural phenomena follow normal distributions. Unfortunately, our weather doesn't quite. But we can try - let's look at some general statsitics of the weather data:
mean(day_weather$avg_temp)
sd(day_weather$avg_temp)
These are the mean and standard deviation — a measure of how diffuse the values are — of the data. In its general form, the normal distribution is parameterized by mean and standard distribution, and we can use that to overlay a normal distribution on top of the data:
ggplot(day_weather %>% filter(origin == 'EWR')) +
aes(x=avg_temp) +
geom_histogram(mapping=aes(y=..density..)) +
geom_line(mapping=aes(y=dnorm(avg_temp, 55.199, 17.158)), color="green")
Not a great fit, but not abjectly awful. Many natural processes will follow a normal distribution more or less well.
Each probability distribution in R has 4 functions:
p
, the cumulative distribution function (p
is for probability)d
, the density functionq
, the quantile function. This takes a probability and returns the value $x$ such that $F(x) = p$.r
, the function to draw random valuesSo, for the normal distribution, we have already seen p
and d
; with q
, we can ask for the value at the 75% point:
qnorm(0.75, 0, 1)
If we have a standard normal variable (its mean is 0 and its standard deviation is 1), we expect 75% of its observed values to be less than or equal to 0.6745.
We can also draw a number of values:
rnorm(10, 0, 1)
We can use this to plot things!
ggplot(data.frame(v=rnorm(1000, 0, 1))) +
aes(x=v) +
geom_histogram(mapping = aes(y=..density..)) +
geom_line(mapping=aes(y=dnorm(v, 0, 1)), color="green")
We can see that our random values are, in fact, distributed as more or less as we would expect.
The exponential distribution is one of several probability distributions over non-negative integers where most values are small, and values become more likely the larger they get. Here's one exponential CDF:
ggplot(data.frame(x=seq(0,10,0.01)) %>% mutate(p=pexp(x, 1))) +
aes(x=x, y=p) +
geom_line()
And its PDF:
ggplot(data.frame(x=seq(0,10,0.01)) %>% mutate(d=dexp(x, 1))) +
aes(x=x, y=d) +
geom_line()