This notebook introduces the basic concepts of discrete probability distributions. Thinking in terms of probabilities is an important skill in analyzing data and interpreting statistical analyses.
It is inspired by Dr. Kennington's probability examples from Boise State University CS 597.
You can download the raw notebook.
library(tidyverse)
library(nycflights13)
options(repr.plot.height=4)
The flights
table from nycflights13
contains data on over 300,000 flights leaving New York City in 2013. We'll use it as our example in this worksheet.
head(flights)
Suppose we see a plane leaving the NYC area, and want to know which of the 3 New York airports (EWR, LGA, and JFK) it probably came from. If we know nothing other than ‘a plane left NYC’, then we can look at the relative frequency of flights from the airports: which airport produces the most flights?
We can do this by counting the number of flights from each airport. dplyr makes this easy with group_by
and summarize
:
origins = flights %>%
group_by(origin) %>%
summarize(count=n())
origins
We assign the value to the variable origins
, and then we ask for the value origins
on a new line to see the data we just computed. This is useful to be able to make use of this data later!
Also, this data type is called a data frame. A data frame is like a little spreadsheet - it has named columns of data.
The %>%
business is called a pipeline, and it is the standard way to process data in with tidyverse
(or more specifically dplyr
). It pipes the results of each operation into the next, until we finally have results.
It is often convenient to plot data like this, so we can see it visually:
ggplot(flights) +
aes(x=origin) +
geom_bar()
EWR (Newark) has the most departing flights. However, these numbers aren't very convenient - 120835 flights left EWR, but that is a little unwieldy if we want to estimate the chances of another flight coming from Newark.
Fortunately, there is a way we can make these numbers easier to deal with: make them sum to 1, so each value is the fraction of flights that left that airport. Let's do that by dividing each count by the total count:
origins$prob = origins$count / sum(origins$count)
origins
This new column, prob
, indicates the probability of a flight departing from the specified airport, given the observations that we have. We are making the assumption here that the flights leaving in 2013 are representative of flights leaving New York City in general, so that we can infer things about future flights from this data. We'll explore in more detail later when we can and can't make this kind of assumption.
A probability is a real number between 0 and 1 that expresses how likely something is. (There is a subtle difference between likelihood and probability, but for our current purposes that difference does not matter.)
Origin airport is an example of what we call a discrete value: it has one of a finite set of distinct values (in this case, just 3: EWR, JFK, and LGA). We also call the origin airport a variable: like variables in computer programs, it is one of the parameters that characterizes an observation (one of the flights). When we are trying to reason about the probability of the variable having different values, we call it a random variable: a variable that takes on random values.
Note: We often think of things in terms of random variables and probabilities even when we don't necessarily think that the way they are produced is actually random. Randomness just provides a convenient way for us to think about the uncertainty we have about our knowledge.
Our table forms a discrete probability distribution. A discrete probability distribution associates each possible value of a discrete variable with a probability of the variable having that value. Each probability must be in the range 0 to 1 (inclusive); in addition, all probabilities in the distribution must sum to 1. We can check this sum:
sum(origins$prob)
More formally, a probability distribution $P(x)$ over the values of a random variable $X$ is a function $P(x): X \into \mathbb{R}$ such that:
In R, the $
operator accesses the column of a data frame. It's a lot like .
in Java or Python. Each column of this data frame is a vector, which is R-speak for an array. The sum
function sums up the elements of a vector and returns them.
But above, when we converted counts into probabilities: notice that we did not write a loop! In R, most operations are vectorized: when you apply them to vectors, they operate on the whole vector element-by-element. If we take two vectors and add them, we get the pairwise sum:
c(1,2,3) + c(10, 20, 30)
In R, there is no such thing as a single value - a value is a vector of length 1. And when two vectors have different lengths, R will recycle the shorter one:
c(1,2,3) + 5
This can get us in trouble sometimes if we don't have our vectors quite straight. Fortunately, R warns us in the common error case where we have two multi-item vectors but their lengths aren't compatible:
c(1,2,3) + c(1,2)
Most R computations work over vectors. A basic rule of thumb is to never use loops. R has them, but you won't need them very often at all. Vectorized operations are much faster than manually looping, and are easier to write.
Our distribution above, over origin airports, is what is called a multinomial distribution. That is, it is a distribution over multiple discrete ‘categorical’ values (we'll see what categorical means in the next lesson).
The simplest kind of multinomial distribution is the binomial distribution: a distribution over two values $\mathsf{H}$ and $\mathsf{T}$. This can be parameterized with a single value $p \in [0,1]$ such that:
$$\begin{align*} P(\mathsf{H}) & = p \\ P(\mathsf{T}) & = 1-p \end{align*}$$We use the $P(\dots)$ notation to indicate a probability.
It is easy to check that this distribution satisfies our two probability laws:
I have called our two outcomes $\mathsf{H}$ and $\mathsf{T}$ because we often think of them as corresponding to the flip of a (possibly weighted) coin with two sides, heads and tails. A fair coin has $p=0.5$, so that both heads and tails are equally likely.
Let's see the flips of 20 fair coins (don't worry about the details of the flip
function for now):
flip = function(n, p=0.5) {
sample(c('H', 'T'), n, replace=TRUE, prob=c(p, 1-p))
}
flip(20)
More generally, the binomial distribution is the probability of observing $k$ successes (in our case, heads) in $n$ flips (or trials). Let's count the successes in a series of 20 flips:
sum(flip(20) == 'H')
This will often sum to 10, but not always - it may be 7 or 9 or 11.
R Note: The
==
operator tests for equality, and like most other R operations, it is vectorized - it tests each element of the left-hand vector with the corresponding element of the right-hand vector; when the right-hand vector has length 1, it just reuses that element for all left-hand values. The result is a logical vector (true and false); summing it counts theTRUE
values.
We can take advantage of another couple of R operations, :
to generate sequences and sapply
to apply a function over many sequences, to carry out several trials of 20-flip sequences and see how often we see different values:
repeated_sequences = sapply(1:100, function(t) {
sum(flip(20) == 'H')
})
repeated_sequences
In R, we don't have a return
command - a function returns the value of its last expression. The sapply
function takes a vector and a function f
and returns a new vector that is the result of calling f(x)
for each value x
in the original vector.
Let's plot these values:
ggplot(data.frame(k=repeated_sequences)) +
aes(x=k) +
geom_histogram(binwidth=1)
We can see that values close to 10 are the most common.
What if we have a weighted coin, so that $P(\mathsf{H}) = 0.7$?
repeated_sequences = sapply(1:100, function(t) {
sum(flip(20, 0.7) == 'H')
})
ggplot(data.frame(k=repeated_sequences)) +
aes(x=k) +
geom_histogram(binwidth=1)
Now 13-16 are the most common values, which we would expect, since $0.7 \cdot 20 = 14$.
Now, we can directly compute the probability of observing $k$ successes in $n$ trials without needing to simulate all these trials. The probabiltiy $P(k|n,p)$ (read ‘the probability of $k$ given $n$ and $p$’) can be written:
$$P(k|n,p) = {{n}\choose{k}} p^k (1-p)^{(n-k)}$$R has a built-in definition of this function called dbinom
:
dbinom(14, 20, 0.7)
For fixed $n$ and $p$, this binomial distribution itself is a discrete distribution over the integers $0 \dots n$, and we can also visualize it:
ggplot(data.frame(k=0:20) %>% mutate(prob=dbinom(k, 20, 0.7))) +
aes(x=k, y=prob) +
geom_bar(stat='identity')
We can observe that our flip above has the same basic shape. Randomness means that it won't quite align perfectly, but on average it will be pretty close.
We have now seen how we can start to think about the distribution of a single random variable by counting; often, though, we care about more than one variable.
Let's look at the carrier airlines for our NYC flights:
flights %>%
group_by(carrier) %>%
summarize(count=n())
We now have a bunch of carriers, and we can convert this to a probability distribution to estimate the probability of a plan being from a particular airline.
We can also start to think about airlines and flights. Let's do a bit more R trickery! We can group by two variables:
origin_carrier_flights = flights %>%
group_by(origin, carrier) %>%
summarize(count=n()) %>%
ungroup() %>%
mutate(prob = count / sum(count))
head(origin_carrier_flights)
R Note: this contains 2 new functions.
mutate
is thedplyr
way of doing the normalization we did previously for origins; it lets us compute a new variable based on other variables in the data frame.ungroup
removes the grouping data introduced bygroup_by
, so thatsum
sums over the entire data frame.
sum(origin_carrier_flights$prob)
This is probability distribution is called a joint probability distribution: it is the probability of two variables simultaneously taking on the given values. We can write it $P(O, C)$: the probability of a specific origin and carrier. So $P(O=\mathsf{EWR}, C=\mathsf{AA}) \approx 0.010$.
It can be easier to visualize this in a more matrix-like form. The spread
function lets us convert data in this form (‘tall’) into a ‘wide’ format; the select(-count)
operation removes the count
column from the data frame:
origin_carrier_wide = spread(origin_carrier_flights %>% select(-count), origin, prob, fill=0)
origin_carrier_wide
We can then convert this into an R matrix:
origin_carrier_matrix = as.matrix(select(origin_carrier_wide, -carrier))
row.names(origin_carrier_matrix) = origin_carrier_wide$carrier
origin_carrier_matrix
This is our joint distribution: each cell contains the probability of a randomly selected flight being on the particular carrier and from the specifeid airport. We can check its sum again:
sum(origin_carrier_matrix)
Ok, that's better.
One of the things we often want to do with a joint probability distribution is compute the marginal distributions of its variables. If we have a joint distribution $P(A,B)$, the marginal distribution $P(A) = \sum_{b \in B} P(A,B)$. When our joint distribution is a matrix, R makes it very easy to compute the marginals:
rowSums(origin_carrier_matrix)
This is the probability of a randomly selected aircraft being operated by the specified carrier.
colSums(origin_carrier_matrix)
If you compare these with the airport probabilities we estimated at the beginning, you should find them to be the same. This is a useful sanity check - they should be the same! But sometimes we have a joint distribution, and we want to extract the marginal distribution from it.
Joint probability distributions are also symmetric - it makes no difference whether we write $P(O,C)$ or $P(C,O)$.
Another useful kind of probability we can derive from a joint distribution is the conditional probability. For example, the conditional probability $P(O|C)$ is the probability that an airplane left a particular airport given that we know it is operated by a given carrier.
The probability $P(O|C) = \frac{P(O,C)}{P(C)}$.
origin_carrier_cond = origin_carrier_matrix / rowSums(origin_carrier_matrix)
origin_carrier_cond
What's that we just did? We divided a matrix by a vector that has as many entries as the matrix has rows. This divides every entry in the matrix by the value corresponding to its row. Neat, huh? We can check that each row is a probability distribution:
rowSums(origin_carrier_cond)
Each row of our new matrix is a probability distribution over airports, given that we know the carrier of the airline. Cool! If we know that the flight is United (UA), then it is most likely from Newark (EWR) - $P(\mathsf{EWR}|\mathsf{UA}) = 0.78$.
Now, unlike joint probabilities, conditional probabilities are not symmetric: $P(O|C) \ne P(C|O)$. If we want to compute $P(C|O)$, we can use t
to transpose our matrix and normalize rows to be distributions again:
carrier_origin_cond = t(origin_carrier_matrix) / colSums(origin_carrier_matrix)
carrier_origin_cond
rowSums(carrier_origin_cond)
If we want to visualize conditional probabilities, the easiest way is with a faceted plot. First let's convert our conditional distribution to a tall data frame:
carrier_origin_frame = as.data.frame(carrier_origin_cond)
carrier_origin_frame$origin = row.names(carrier_origin_cond)
carrier_origin_tall = gather(carrier_origin_frame, carrier, prob, -origin)
head(carrier_origin_tall)
ggplot(carrier_origin_tall) +
aes(x=carrier, y=prob) +
geom_bar(stat='identity') +
facet_wrap(~ origin) +
theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))
Two variables are independent if $P(A,B) = P(A) P(B)$ — that is, we can compute the probability of $a$ and $b$ happening at the same time by independently computing the probabilities of $a$ and $b$, and multiplying them. What this means in practice is that knowing $A$ tells us nothing about $B$. We can see that our origin airport and carrier are not independent - observing either tells us quite a bit about the other.
But let's go back to our binomial distribution: when flipping a coin, each flip is independent. Knowing I flipped heads tells me nothing about whether the next flip will be heads.
This is the key to making the binomial distribution formula work: the probability of flipping $\mathsf{HH}$ is $P(X_1=\mathsf{H},X_2=\mathsf{H}) = P(\mathsf{H})P(\mathsf{H})$
The same is true of rolling dice: the results of a roll of two fair dice is the product of the individual die probabilities.
Remember that $P(A|B) \ne P(B|A)$? There is, however, a way that we can convert between these two probabilities!
$$P(A|B) = \frac{P(B|A)P(B)}{P(A)}$$That is, with one conditional distribution and both marginal distributions, we can compute the other conditional distribution. To see why this is true, we can expand the definition of conditional probability:
$$\begin{align*} P(B|A) & = P(A,B)P(A) \\ P(A|B) & = P(A,B)P(B) \\ & = \frac{P(B|A)}{P(A)} P(B) \\ & = \frac{P(B|A) P(B)}{P(A)} \end{align*}$$