Discrete Probability Introduction¶

This notebook introduces the basic concepts of discrete probability distributions. Thinking in terms of probabilities is an important skill in analyzing data and interpreting statistical analyses.

It is inspired by Dr. Kennington's probability examples from Boise State University CS 597.

You can download the raw notebook.

Setup¶

This notebook requires the following Conda packages:

conda install r-nycflights13

library(tidyverse)
library(nycflights13)
options(repr.plot.height=4)

Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats

The flights table from nycflights13 contains data on over 300,000 flights leaving New York City in 2013. We'll use it as our example in this worksheet.

head(flights)

What Is a Probability?¶

Suppose we see a plane leaving the NYC area, and want to know which of the 3 New York airports (EWR, LGA, and JFK) it probably came from. If we know nothing other than ‘a plane left NYC’, then we can look at the relative frequency of flights from the airports: which airport produces the most flights?

We can do this by counting the number of flights from each airport. dplyr makes this easy with group_by and summarize:

origins = flights %>%
    group_by(origin) %>%
    summarize(count=n())
origins

We assign the value to the variable origins, and then we ask for the value origins on a new line to see the data we just computed. This is useful to be able to make use of this data later!

Also, this data type is called a data frame. A data frame is like a little spreadsheet - it has named columns of data.

The %>% business is called a pipeline, and it is the standard way to process data in with tidyverse (or more specifically dplyr). It pipes the results of each operation into the next, until we finally have results.

It is often convenient to plot data like this, so we can see it visually:

ggplot(flights) +
    aes(x=origin) +
    geom_bar()

EWR (Newark) has the most departing flights. However, these numbers aren't very convenient - 120835 flights left EWR, but that is a little unwieldy if we want to estimate the chances of another flight coming from Newark.

Fortunately, there is a way we can make these numbers easier to deal with: make them sum to 1, so each value is the fraction of flights that left that airport. Let's do that by dividing each count by the total count:

origins$prob = origins$count / sum(origins$count)
origins

This new column, prob, indicates the probability of a flight departing from the specified airport, given the observations that we have. We are making the assumption here that the flights leaving in 2013 are representative of flights leaving New York City in general, so that we can infer things about future flights from this data. We'll explore in more detail later when we can and can't make this kind of assumption.

A probability is a real number between 0 and 1 that expresses how likely something is. (There is a subtle difference between likelihood and probability, but for our current purposes that difference does not matter.)

Origin airport is an example of what we call a discrete value: it has one of a finite set of distinct values (in this case, just 3: EWR, JFK, and LGA). We also call the origin airport a variable: like variables in computer programs, it is one of the parameters that characterizes an observation (one of the flights). When we are trying to reason about the probability of the variable having different values, we call it a random variable: a variable that takes on random values.

Note: We often think of things in terms of random variables and probabilities even when we don't necessarily think that the way they are produced is actually random. Randomness just provides a convenient way for us to think about the uncertainty we have about our knowledge.

Our table forms a discrete probability distribution. A discrete probability distribution associates each possible value of a discrete variable with a probability of the variable having that value. Each probability must be in the range 0 to 1 (inclusive); in addition, all probabilities in the distribution must sum to 1. We can check this sum:

sum(origins$prob)

More formally, a probability distribution over the values of a random variable is a function such that:

In R, the $ operator accesses the column of a data frame. It's a lot like . in Java or Python. Each column of this data frame is a vector, which is R-speak for an array. The sum function sums up the elements of a vector and returns them.

But above, when we converted counts into probabilities: notice that we did not write a loop! In R, most operations are vectorized: when you apply them to vectors, they operate on the whole vector element-by-element. If we take two vectors and add them, we get the pairwise sum:

c(1,2,3) + c(10, 20, 30)

In R, there is no such thing as a single value - a value is a vector of length 1. And when two vectors have different lengths, R will recycle the shorter one:

c(1,2,3) + 5

This can get us in trouble sometimes if we don't have our vectors quite straight. Fortunately, R warns us in the common error case where we have two multi-item vectors but their lengths aren't compatible:

c(1,2,3) + c(1,2)

Warning message in c(1, 2, 3) + c(1, 2):
"longer object length is not a multiple of shorter object length"

Most R computations work over vectors. A basic rule of thumb is to never use loops. R has them, but you won't need them very often at all. Vectorized operations are much faster than manually looping, and are easier to write.

Names and Math¶

Our distribution above, over origin airports, is what is called a multinomial distribution. That is, it is a distribution over multiple discrete ‘categorical’ values (we'll see what categorical means in the next lesson).

The simplest kind of multinomial distribution is the binomial distribution: a distribution over two values and . This can be parameterized with a single value such that:

We use the notation to indicate a probability.

It is easy to check that this distribution satisfies our two probability laws:

Since , both and are probabilities.
.

I have called our two outcomes and because we often think of them as corresponding to the flip of a (possibly weighted) coin with two sides, heads and tails. A fair coin has , so that both heads and tails are equally likely.

Let's see the flips of 20 fair coins (don't worry about the details of the flip function for now):

flip = function(n, p=0.5) {
    sample(c('H', 'T'), n, replace=TRUE, prob=c(p, 1-p))
}
flip(20)

More generally, the binomial distribution is the probability of observing successes (in our case, heads) in flips (or trials). Let's count the successes in a series of 20 flips:

sum(flip(20) == 'H')

This will often sum to 10, but not always - it may be 7 or 9 or 11.

R Note: The == operator tests for equality, and like most other R operations, it is vectorized - it tests each element of the left-hand vector with the corresponding element of the right-hand vector; when the right-hand vector has length 1, it just reuses that element for all left-hand values. The result is a logical vector (true and false); summing it counts the TRUE values.

We can take advantage of another couple of R operations, : to generate sequences and sapply to apply a function over many sequences, to carry out several trials of 20-flip sequences and see how often we see different values:

repeated_sequences = sapply(1:100, function(t) {
    sum(flip(20) == 'H')
})
repeated_sequences

In R, we don't have a return command - a function returns the value of its last expression. The sapply function takes a vector and a function f and returns a new vector that is the result of calling f(x) for each value x in the original vector.

Let's plot these values:

ggplot(data.frame(k=repeated_sequences)) +
    aes(x=k) +
    geom_histogram(binwidth=1)

We can see that values close to 10 are the most common.

What if we have a weighted coin, so that ?

repeated_sequences = sapply(1:100, function(t) {
    sum(flip(20, 0.7) == 'H')
})
ggplot(data.frame(k=repeated_sequences)) +
    aes(x=k) +
    geom_histogram(binwidth=1)

Now 13-16 are the most common values, which we would expect, since .

Now, we can directly compute the probability of observing successes in trials without needing to simulate all these trials. The probabiltiy (read ‘the probability of given and ’) can be written:

R has a built-in definition of this function called dbinom:

dbinom(14, 20, 0.7)

For fixed and , this binomial distribution itself is a discrete distribution over the integers , and we can also visualize it:

ggplot(data.frame(k=0:20) %>% mutate(prob=dbinom(k, 20, 0.7))) +
    aes(x=k, y=prob) +
    geom_bar(stat='identity')

We can observe that our flip above has the same basic shape. Randomness means that it won't quite align perfectly, but on average it will be pretty close.

Joint Distributions¶

We have now seen how we can start to think about the distribution of a single random variable by counting; often, though, we care about more than one variable.

Let's look at the carrier airlines for our NYC flights:

flights %>%
    group_by(carrier) %>%
    summarize(count=n())

We now have a bunch of carriers, and we can convert this to a probability distribution to estimate the probability of a plan being from a particular airline.

We can also start to think about airlines and flights. Let's do a bit more R trickery! We can group by two variables:

origin_carrier_flights = flights %>%
    group_by(origin, carrier) %>%
    summarize(count=n()) %>%
    ungroup() %>%
    mutate(prob = count / sum(count))
head(origin_carrier_flights)

R Note: this contains 2 new functions. mutate is the dplyr way of doing the normalization we did previously for origins; it lets us compute a new variable based on other variables in the data frame. ungroup removes the grouping data introduced by group_by, so that sum sums over the entire data frame.

sum(origin_carrier_flights$prob)

This is probability distribution is called a joint probability distribution: it is the probability of two variables simultaneously taking on the given values. We can write it : the probability of a specific origin and carrier. So .

It can be easier to visualize this in a more matrix-like form. The spread function lets us convert data in this form (‘tall’) into a ‘wide’ format; the select(-count) operation removes the count column from the data frame:

origin_carrier_wide = spread(origin_carrier_flights %>% select(-count), origin, prob, fill=0)
origin_carrier_wide

We can then convert this into an R matrix:

origin_carrier_matrix = as.matrix(select(origin_carrier_wide, -carrier))
row.names(origin_carrier_matrix) = origin_carrier_wide$carrier
origin_carrier_matrix

This is our joint distribution: each cell contains the probability of a randomly selected flight being on the particular carrier and from the specifeid airport. We can check its sum again:

sum(origin_carrier_matrix)

Ok, that's better.

Marginal Probabilities¶

One of the things we often want to do with a joint probability distribution is compute the marginal distributions of its variables. If we have a joint distribution , the marginal distribution . When our joint distribution is a matrix, R makes it very easy to compute the marginals:

rowSums(origin_carrier_matrix)

This is the probability of a randomly selected aircraft being operated by the specified carrier.

colSums(origin_carrier_matrix)

If you compare these with the airport probabilities we estimated at the beginning, you should find them to be the same. This is a useful sanity check - they should be the same! But sometimes we have a joint distribution, and we want to extract the marginal distribution from it.

Joint probability distributions are also symmetric - it makes no difference whether we write or .

Conditional Probability¶

Another useful kind of probability we can derive from a joint distribution is the conditional probability. For example, the conditional probability is the probability that an airplane left a particular airport given that we know it is operated by a given carrier.

The probability .

origin_carrier_cond = origin_carrier_matrix / rowSums(origin_carrier_matrix)
origin_carrier_cond

What's that we just did? We divided a matrix by a vector that has as many entries as the matrix has rows. This divides every entry in the matrix by the value corresponding to its row. Neat, huh? We can check that each row is a probability distribution:

rowSums(origin_carrier_cond)

Each row of our new matrix is a probability distribution over airports, given that we know the carrier of the airline. Cool! If we know that the flight is United (UA), then it is most likely from Newark (EWR) - .

Now, unlike joint probabilities, conditional probabilities are not symmetric: . If we want to compute , we can use t to transpose our matrix and normalize rows to be distributions again:

carrier_origin_cond = t(origin_carrier_matrix) / colSums(origin_carrier_matrix)
carrier_origin_cond

rowSums(carrier_origin_cond)

If we want to visualize conditional probabilities, the easiest way is with a faceted plot. First let's convert our conditional distribution to a tall data frame:

carrier_origin_frame = as.data.frame(carrier_origin_cond)
carrier_origin_frame$origin = row.names(carrier_origin_cond)
carrier_origin_tall = gather(carrier_origin_frame, carrier, prob, -origin)
head(carrier_origin_tall)

ggplot(carrier_origin_tall) +
    aes(x=carrier, y=prob) +
    geom_bar(stat='identity') +
    facet_wrap(~ origin) +
    theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))

Independence of Variables¶

Two variables are independent if — that is, we can compute the probability of and happening at the same time by independently computing the probabilities of and , and multiplying them. What this means in practice is that knowing tells us nothing about . We can see that our origin airport and carrier are not independent - observing either tells us quite a bit about the other.

But let's go back to our binomial distribution: when flipping a coin, each flip is independent. Knowing I flipped heads tells me nothing about whether the next flip will be heads.

This is the key to making the binomial distribution formula work: the probability of flipping is

The same is true of rolling dice: the results of a roll of two fair dice is the product of the individual die probabilities.

Bayes' Theorem¶

Remember that ? There is, however, a way that we can convert between these two probabilities!

That is, with one conditional distribution and both marginal distributions, we can compute the other conditional distribution. To see why this is true, we can expand the definition of conditional probability:

year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
2013	1	1	517	515	2	830	819	11	UA	1545	N14228	EWR	IAH	227	1400	5	15	2013-01-01 05:00:00
2013	1	1	533	529	4	850	830	20	UA	1714	N24211	LGA	IAH	227	1416	5	29	2013-01-01 05:00:00
2013	1	1	542	540	2	923	850	33	AA	1141	N619AA	JFK	MIA	160	1089	5	40	2013-01-01 05:00:00
2013	1	1	544	545	-1	1004	1022	-18	B6	725	N804JB	JFK	BQN	183	1576	5	45	2013-01-01 05:00:00
2013	1	1	554	600	-6	812	837	-25	DL	461	N668DN	LGA	ATL	116	762	6	0	2013-01-01 06:00:00
2013	1	1	554	558	-4	740	728	12	UA	1696	N39463	EWR	ORD	150	719	5	58	2013-01-01 05:00:00

carrier	count
9E	18460
AA	32729
AS	714
B6	54635
DL	48110
EV	54173
F9	685
FL	3260
HA	342
MQ	26397
OO	32
UA	58665
US	20536
VX	5162
WN	12275
YV	601

origin	carrier	count	prob
EWR	9E	1268	0.003765114
EWR	AA	3487	0.010354063
EWR	AS	714	0.002120104
EWR	B6	6557	0.019469915
EWR	DL	4342	0.012892843
EWR	EV	43939	0.130469511

carrier	EWR	JFK	LGA
9E	0.003765114	0.043503694	7.545074e-03
AA	0.010354063	0.040926313	4.590291e-02
AS	0.002120104	0.000000000	0.000000e+00
B6	0.019469915	0.124937644	1.782194e-02
DL	0.012892843	0.061468157	6.849360e-02
EV	0.130469511	0.004180820	2.620733e-02
F9	0.000000000	0.000000000	2.033993e-03
FL	0.000000000	0.000000000	9.680025e-03
HA	0.000000000	0.001015512	0.000000e+00
MQ	0.006758201	0.021358410	5.026486e-02
OO	0.000017816	0.000000000	7.720265e-05
UA	0.136847638	0.013462955	2.388531e-02
US	0.013079911	0.008893152	3.900515e-02
VX	0.004649975	0.010677721	0.000000e+00
WN	0.018374231	0.000000000	1.807433e-02
YV	0.000000000	0.000000000	1.784569e-03

	EWR	JFK	LGA
9E	0.003765114	0.043503694	7.545074e-03
AA	0.010354063	0.040926313	4.590291e-02
AS	0.002120104	0.000000000	0.000000e+00
B6	0.019469915	0.124937644	1.782194e-02
DL	0.012892843	0.061468157	6.849360e-02
EV	0.130469511	0.004180820	2.620733e-02
F9	0.000000000	0.000000000	2.033993e-03
FL	0.000000000	0.000000000	9.680025e-03
HA	0.000000000	0.001015512	0.000000e+00
MQ	0.006758201	0.021358410	5.026486e-02
OO	0.000017816	0.000000000	7.720265e-05
UA	0.136847638	0.013462955	2.388531e-02
US	0.013079911	0.008893152	3.900515e-02
VX	0.004649975	0.010677721	0.000000e+00
WN	0.018374231	0.000000000	1.807433e-02
YV	0.000000000	0.000000000	1.784569e-03

	EWR	JFK	LGA
9E	0.06868906	0.79366197	0.1376490
AA	0.10654160	0.42112500	0.4723334
AS	1.00000000	0.00000000	0.0000000
B6	0.12001464	0.77012904	0.1098563
DL	0.09025151	0.43028476	0.4794637
EV	0.81108670	0.02599081	0.1629225
F9	0.00000000	0.00000000	1.0000000
FL	0.00000000	0.00000000	1.0000000
HA	0.00000000	1.00000000	0.0000000
MQ	0.08622192	0.27249309	0.6412850
OO	0.18750000	0.00000000	0.8125000
UA	0.78559618	0.07728629	0.1371175
US	0.21450136	0.14584145	0.6396572
VX	0.30337079	0.69662921	0.0000000
WN	0.50411405	0.00000000	0.4958859
YV	0.00000000	0.00000000	1.0000000

	9E	AA	AS	B6	DL	EV	F9	FL	HA	MQ	OO	UA	US	VX	WN	YV
EWR	0.01049365	0.02885753	0.005908884	0.05426408	0.0359333	0.36362809	0.000000000	0.00000000	0.000000000	0.01883560	4.965449e-05	0.38140439	0.03645467	0.01295982	0.05121033	0.000000000
JFK	0.13166006	0.12385985	0.000000000	0.37811267	0.1860279	0.01265288	0.000000000	0.00000000	0.003073356	0.06463933	0.000000e+00	0.04074444	0.02691433	0.03231517	0.00000000	0.000000000
LGA	0.02427815	0.14770404	0.000000000	0.05734651	0.2203952	0.08432860	0.006544878	0.03114789	0.000000000	0.16173970	2.484187e-04	0.07685693	0.12550878	0.00000000	0.05815864	0.005742294