Variable Types and Continuous Probability¶

In the previous lesson, we looked at discrete, and particularly categorical, probablity distributions. In this lesson, we're going to talk about more types of variables, and continuous probability distributions.

Download notebook

Let's grab that NYC data again.

library(tidyverse)
library(nycflights13)

Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats

options(repr.matrix.max.rows=20, repr.plot.height=4.5)

Types of Variables¶

We can categorize random variables (or observations, or parameters) into four types:

Categorical variables take on distinct values with no relationship between different values. Origin airport and airline from the previous example are both categorical variables.
Ordinal variables take on distinct, ordered values. For example, if we describe rain intensity as ‘none’, ‘light’, ‘moderate’, and ‘heavy’, that is ordinal: the values are distinct, but are ordered in increasing amounts of rainfall. Likert-style survey responses (‘disagree’, ‘agree’, ‘strongly agree’) are a common kind of ordinal data.
Cardinal variables take on distinct ordered values that support basic math (particularly subtraction). The most common example is integers, commonly expressing counts: if we count the number of airplanes, we can compare two counts by subtracting them; we can't, however, say anything numeric about the difference between ‘moderate’ and ‘light’ rainfall.
Continuous variables have values that are real numbers - there are infinitely many possible values, even if the range is limited, and for any two different values observed, it would be possible to have a value between them; further, they support all of our usual mathematical operations. But thinking of these as real numbers is the right thing to do. Non-count measurements are typically continuous.

Types 1–3 are all types of discrete values: they take on distinct values. Type 4 is just continuous.

Continuous Variables¶

In the previous demo, we counted up the number of flights by a particular airline, or from a particular airport, to estimate probabilities.

The NYC data also includes weather for each of the airports. Let's look at that data:

head(weather)

Dealing with weather by the hour seems a little obnoxious; let's bin it by day.

day_weather = weather %>%
    group_by(origin, year, month, day) %>%
    summarize(avg_temp=mean(temp, na.rm=TRUE), max_temp=max(temp), min_temp=min(temp),
              avg_wind=mean(wind_speed), max_wind=max(wind_gust),
              precip=sum(precip))
head(day_weather)

Let's try to count up average temperatures:

avg_temps = day_weather %>%
    filter(origin == 'EWR') %>%
    group_by(avg_temp) %>%
    summarize(ndays=n()) %>%
    ungroup() %>%
    mutate(prob=ndays / sum(ndays))
avg_temps

Individual temperatures are not very common! Further, what if we want the probability that it will be 82 degrees?

avg_temps %>% filter(avg_temp == 82)

No entries! But what if we plot the distribution of average temps?

ggplot(day_weather %>% filter(origin == 'EWR')) +
    aes(x=avg_temp) +
    geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Based on this we'd expect 82 to happen sometimes!

It turns out that, when we have a continuous variable, it isn't meaningful to ask about the probability of it taking on a particular value. If we think about the infinite possibilities, we can see why: even with a limited variable, such as one that only takes on values in the range , there are infinitely many different values! Since the whole probability distribution must sum to 1, each individual value's probability must be infinitely close to 0.

This… does not sound useful. But it turns out there is a different, related question we can ask. We can count how many observations have values less than or equal to a particular value! And we can ask how likely it is to see a value less than or equal to a particular value.

mean(day_weather$avg_temp <= 50)

mean(day_weather$avg_temp <= 82)

This is called the cumulative probability. We can also ask about the probability of seeing a value in a particular range by subtracting: =

mean(day_weather$avg_temp <= 82.5) - mean(day_weather$avg_temp <= 81.5)

There is a 1.2% chance of a day having an average temperature within 0.5 of 82. Cool!

Distributions¶

Counting things up turns out to not be a very friendly way of thinking about continuous probability distributions. Instead, there are a bunch of standard probability distributions defined with mathematical functions.

One of these is particularly common and important: the normal or Gaussian distribution. Let's see its CDF:

ggplot(data.frame(x=seq(-4, 4, 0.01)) %>% mutate(p=pnorm(x))) +
    aes(x=x, y=p) +
    geom_line()

We describe the cumulative probability with a cumulative distribution function, or CDF. For a CDF for a random variable , . R supports many probability distributions, and the p function like pnorm is the cumulative distribution for that distribution. The CDF of the ‘standard normal’ distribution is

A CDF is monotonic — if , then — and has a minimum of 0 and a maximum of 1.

Cumulative distribution is not the only way we can think of probabilities, though. This distribution may be more familiar if we look at it like this:

ggplot(data.frame(x=seq(-4, 4, 0.01)) %>% mutate(p=dnorm(x))) +
    aes(x=x, y=p) +
    geom_line()

This is the probability density function (PDF), and it corresponds to the probability distributions we saw in the discrete case. However, it is important to note that probability densities are not probabilities! CDF values are probabilities, but PDF values are not - they are the density of the probability in a small region surrounding the value.

For a continuous probability distribution, the cumulative function and density function have the following relationship:

That is, the probability density is the derivative of the cumulative distribution function. indicates how much of the probability gets added at , roughly. But it is meaningless to ask about , as we discussed above, and the probability density can be quite large (greater than 1, even!). The probability density does, however, integrate to 1.

Many natural phenomena follow normal distributions. Unfortunately, our weather doesn't quite. But we can try - let's look at some general statsitics of the weather data:

mean(day_weather$avg_temp)
sd(day_weather$avg_temp)

These are the mean and standard deviation — a measure of how diffuse the values are — of the data. In its general form, the normal distribution is parameterized by mean and standard distribution, and we can use that to overlay a normal distribution on top of the data:

ggplot(day_weather %>% filter(origin == 'EWR')) +
    aes(x=avg_temp) +
    geom_histogram(mapping=aes(y=..density..)) +
    geom_line(mapping=aes(y=dnorm(avg_temp, 55.199, 17.158)), color="green")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Not a great fit, but not abjectly awful. Many natural processes will follow a normal distribution more or less well.

R Distribution Functions¶

Each probability distribution in R has 4 functions:

p, the cumulative distribution function (p is for probability)
d, the density function
q, the quantile function. This takes a probability and returns the value such that .
r, the function to draw random values

So, for the normal distribution, we have already seen p and d; with q, we can ask for the value at the 75% point:

qnorm(0.75, 0, 1)

If we have a standard normal variable (its mean is 0 and its standard deviation is 1), we expect 75% of its observed values to be less than or equal to 0.6745.

We can also draw a number of values:

rnorm(10, 0, 1)

We can use this to plot things!

ggplot(data.frame(v=rnorm(1000, 0, 1))) +
    aes(x=v) +
    geom_histogram(mapping = aes(y=..density..)) +
    geom_line(mapping=aes(y=dnorm(v, 0, 1)), color="green")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can see that our random values are, in fact, distributed as more or less as we would expect.

Exponential Distributions¶

The exponential distribution is one of several probability distributions over non-negative integers where most values are small, and values become more likely the larger they get. Here's one exponential CDF:

ggplot(data.frame(x=seq(0,10,0.01)) %>% mutate(p=pexp(x, 1))) +
    aes(x=x, y=p) +
    geom_line()

And its PDF:

ggplot(data.frame(x=seq(0,10,0.01)) %>% mutate(d=dexp(x, 1))) +
    aes(x=x, y=d) +
    geom_line()

origin	year	month	day	hour	temp	dewp	humid	wind_dir	wind_speed	wind_gust	pressure	visib	time_hour
EWR	2013	1	1	0	37.04	21.92	53.97	230	10.35702	11.91865	1013.9	10	2013-01-01 01:00:00
EWR	2013	1	1	1	37.04	21.92	53.97	230	13.80936	15.89154	1013.0	10	2013-01-01 02:00:00
EWR	2013	1	1	2	37.94	21.92	52.09	230	12.65858	14.56724	1012.6	10	2013-01-01 03:00:00
EWR	2013	1	1	3	37.94	23.00	54.51	230	13.80936	15.89154	1012.7	10	2013-01-01 04:00:00
EWR	2013	1	1	4	37.94	24.08	57.04	240	14.96014	17.21583	1012.8	10	2013-01-01 05:00:00
EWR	2013	1	1	6	39.02	26.06	59.37	270	10.35702	11.91865	1012.0	10	2013-01-01 07:00:00

origin	year	month	day	avg_temp	max_temp	min_temp	avg_wind	max_wind
EWR	2013	1	1	38.4800	41.00	33.98	12.758648	18.54012
EWR	2013	1	2	28.8350	33.98	24.08	12.514732	21.18871
EWR	2013	1	3	29.4575	33.98	26.06	7.863663	13.24295
EWR	2013	1	4	33.4775	39.92	28.94	13.857309	27.81019
EWR	2013	1	5	36.7325	44.06	32.00	10.836512	21.18871
EWR	2013	1	6	37.9700	48.02	32.00	8.007511	17.21583

avg_temp	ndays	prob
15.49250	1	0.002747253
17.67500	1	0.002747253
17.84750	1	0.002747253
20.11250	1	0.002747253
23.27000	1	0.002747253
24.62750	1	0.002747253
25.07750	1	0.002747253
25.14435	1	0.002747253
25.42250	1	0.002747253
25.46750	1	0.002747253
...	...	...
83.5550	1	0.002747253
83.9450	1	0.002747253
84.2675	1	0.002747253
86.3300	1	0.002747253
86.9900	1	0.002747253
87.3725	1	0.002747253
87.5525	1	0.002747253
87.5825	2	0.005494505
89.5175	1	0.002747253
91.1675	1	0.002747253