Week 4 — Inference (9/12–16)#

These are the learning outcomes for the week:

Understand the elements of probability
Interpret and write conditional probabilities for events
Understand the key relationships between discrete and continuous probability
Compute and interpret a confidence interval

🧐 Content Overview#

Element	Length
🎥 Inference Intro	10m36s
🎥 Probability	13m46s
🎥 Joint and Conditional	11m50s
🎥 Continuous Probability	11m48s
📃 Notes on Probability	349 words
🎥 Distributions	9m24s
🎥 Sampling and the DGP	17m53s
🎥 Confidence	9m44s
📃 Having confidence in confindence intervals	225 words
🎥 The Bootstrap	7m26s

This week has 1h32m of video and 574 words of assigned readings. This week’s videos are available in a Panopto folder.

This week is at the upper end for total video of any week in the course, and also has some of the trickier concepts. The next week — Week 5 — is significantly lighter in terms of new material, and we’ll take a step back to try to solidify the things we’ve learned so far in the class before proceeding to Week 6.

📅 Deadlines#

Week 4 Quiz on 9/15 at 8AM

🎥 Introduction#

Video (10m36s)

Slides

In this video, I'm going to introduce this week's topic of probability and inference.
So we're learning outcomes. This we could understand probability notation to be able to express formulas using probability.
Understand the difference to an estimation and testing and to be able to compute
confidence intervals and other estimates of precision and significance of an effect.
So to review just a little bit, we've talked about the concept of a statistic, a measurement that we take from a collection of data,
for example, and the Penguins dataset, which we're going to be using a lot this week.
The Gentoo penguins have an average weight of five thousand seventy grams of mean weight.
So but we have the question of we're trying to see how averaged how heavy on average are adult Gentoo penguins?
One of the things we want to know is how precise is this estimate of five thousand seventy six?
Is this likely to be close to the the actual average way far off?
How can we measure how precise this estimate is? We can also start to look at comparisons.
The Gentoo penguins in our data have longer flippers than the chinstrap penguins.
But does this mean Gentoo penguins have longer flippers than chinstrap penguins usually?
Or do we randomly get some long flipper Gentoo penguins and some short flipper.
Chinstrap penguins? And our data doesn't actually tell us anything about the relative flipper length of chinstrap and Gentoo penguins.
So what we're going to be developed, one of the things we're going to be developing this week is some tools to be able
to start to answer some of these kinds of questions and go from this statistic,
we call it. We have some data.
We compute a statistic over it to being able to say things about the underlying constructs from which the data are collected in this case.
The body size characteristics of different species of penguin.
So inference is learning about data to be able to go from the data that we have and learn things about
its structure to learn the values of different parameters that describe its underlying existence.
And so there's a couple of thing.
There's a variety of things you can do it this, too, things are going to be doing is estimating the value of a parameter.
So if there's an underlying parameter, which is the average length of a Gentoo penguin slipper,
can we estimate the value of that parameter by observing penguins in the wild?
And then the testing, the data support for a hypothesis, maybe the hypothesis that.
Gentoo penguins have longer flippers than chinstrap penguins. To define a term, though, an estimate is an estimated value of some underlying quantity.
Often we talk about a point estimate. So if we take our penguins and we compute the mean of our penguins flipper lengths.
That gives us a point estimate for the mean flipper length.
But often we'll have a confidence interval or something similar that describes how precise and how confident we are in this estimate.
And estimates [INAUDIBLE] is a procedure which is the metric and the mechanism that we're gonna use for applying that metric,
for computing an estimate and an s demand is a thing we're trying to estimate.
So an estimate. So we have an estimate. And that's the estimate of an S demand.
And if we're using a statistic from a sample to estimate a parameter.
So we're where we have our penguins and we compute the mean mass of our Gentoo penguins.
And we say, well, we're gonna use that to estimate. The typical mass of a Gentoo penguin.
Then with the typical mass as a parameter and the parameter or the statistic in this case is the estimate and the parameter is the estimate.
The value of this statistic is the estimate, the process of computing.
The statistic itself as an object like meet the mean of five point of five thousand seventy six is the estimate.
The mean as a concept is the estimate here. And the parameter in this case is the estimate and.
So, in effect, size is the size of the difference between two groups or treatments.
A lot of the things that a lot of the basic principles of statistical tests and statistical analysis come from the idea of controlled experiments.
So if in a controlled experiment, if if we had a bunch of penguins, I'm not going to try to draw penguins.
But I'll say penguins. Still a bunch of penguins.
And we want to see if penguins who eat one kind of fish have better growth.
Maybe we have baby penguins, baby penguins who have won access to one kind of food.
Have better growth than access to another. So we'll take our penguins randomly split them into two groups.
We'll have food. One. We'll have food to.
And then we will measure growth. And then we want to compare.
The growth of these two of penguins that are given the two different kinds of food.
There's much more sophisticated experimental designs. But the effect size is how much more or less the food one penguins grew.
And the food to penguins because we want to see it.
We can see, OK. Did they grow more? But the effect size as well. How big is that growth?
And. Even when we're not actually performing a controlled experiment like this,
a randomized controlled trial, we're using the statistical tools and other contexts.
This is the this is the underlying theoretical construct in which they're easiest to understand.
And we're going to understand a lot of them in other contexts in terms of.
This kind of a treatment setup, maybe we just have two groups like this.
This penguin is a June Gentoo and this penguin is a chinstrap.
We can't assign those conditions. We can't take a bunch of penguins and say, OK, this half, we're going to make them Gentoo.
In this half, we're going to make them chinstrap. We're gonna use the same math to compare, say, the flipper size of the chinstrap and the Gentoo.
But this is the kind of experimental construct in which it's which a lot of these kinds of all of these statistical techniques arise.
So a little bit about my perspective. So, I mean, we all come to these topics with our prospect, with our own perspectives.
I'm going to provide access. I'm going to provide links to know a couple of articles about my that have informed my perspective.
These are not required reading. And you're not going to be tested on those articles. But one is that there are few bright line rules.
So it's not as simple as we computer test p less than two point zero five.
Great. We know he found an effect instead. Multiple pieces of evidence.
The statistical significance from a test in one experiment. The confidence interval, the precision of the estimate in another context.
Together increase our confidence in the.
In what we're understanding from the data. No one experiment or analysis is the end of the story.
I'm going to be teaching you how to do all of the classical statistical well, not all.
I'm going to be teaching you how to do some of the classical statistical techniques,
like how to do a test and compute its p value, how to compute P values in other ways.
But there is not.
These individual pieces on their own are evidence that we used to paint a bigger picture and gain car, increasing confidence in our results.
Second is that probability is meaningful, even without a specific random process to discuss.
So some take the approach that we can only.
That the problem, the mathematics of probability, only apply when we're talking about actual random processes and the outcomes of random processes.
I do not take that approach. I take an approach in which probability can also quantify degrees of belief.
And so we can talk about the probability. We can talk about probabilities of actual parameters, not just the values that arise from them.
But I'll probably make some more sense later. Also effect size.
It estimates in the seat the. Precision of our estimates are often more important and more useful than hypothesis tests.
We're gonna do hypothesis tests, but in many cases.
The questions we ask are better answered by estimates, by estimates.
These are not universally held perspectives and I'm going to be teaching you tools.
Regardless of my perspective, but I just wanted to be upfront with some of the perspective that I bring to the
table and how I think about this material and it's informed how I present it.
And the choice of the which material I've chosen to present to you.
So to wrap up,
we're going to go move beyond just computing and comparing statistics to actually be able to start reasoning about the magnitude of the differences.
We see the significance of those differences if they are for lack of a better term,
real or if just they're just artifacts of the data and the data collection process.
The foundations of this are going to be probability theory, which I'm going to get into in the next video.

🎥 Probability#

Video (13m46s)

Slides

Also in this video, I went to talk with you about the basic principles of probability to lay a foundation for some
of the reasoning we're going to need to do this week and throughout the rest of the semester.
So we're learning outcomes for this video are for you to know the fundamentals, set operations.
We're going to use sets to discuss logical events that become the basis of probability
and to know the fundamental public concepts of probability theory for single events.
So to introduce a set, a set is an unordered collection of distinct elements.
There's no duplicates in it. So if A were to somehow be twice in a collection, it's not a set.
And also the elements have no intrinsic order. We can impose an order for Elst from elsewhere.
But if we have a set, there's no intrinsic order to the elements in it by default.
We then in a few relationships around sets, if we can say that an element A is a member of a set.
We can say that one set is a subset of or another. We can also compute the size or the cardinality of a set.
Now, sets can be infinite. That is, they can contain infinitely many elements.
There are also multiple types or sizes of infinity.
The smallest infinity is what we call countable infinity.
And the countable infinity is the cardinality of the set of natural numbers.
So zero one or one, two, three, four or five on forever.
There are accountably, many such numbers. That's the smallest infinity.
There are larger infinities. The details of that aren't going to be superimportant. They're going to come up once or twice.
But this is not an advanced. We're not going to dove deep. But I.
I might need the word accountably occasionally. We then have several operations onsets.
So A Union B is the number is the set of items.
That's an either A or B or both. It can be a both. Can be an either of them.
The intersection is the items that are in both of the sets. And then we have the set difference.
The items that are in A but not in B Union, an intersection are both symmetrical.
You can swap the the sets around. Difference is not symmetrical.
And then we have the complement.
If A is a subset of some larger set, some universe than the set of all items, not in a universe minus A is a complement.
Now, with these sets, we can then start to talk about what we call events.
So probability is the probability theory.
We talk about the probability of events. And so we have a set of elementary events and these are the distinct individual outcomes.
For example, if we flip a coin, are elementary events are heads and tails.
If we roll a six sided die or elementary events are one, two, three, four, five, six.
But then an event is a set that is a subset of our set of elementary events and the elementary events themselves.
We don't treat those as events. We make Singleton sets. So we just have the set that just contains H.
And that's going to be the coin is heads. But this way we can talk about events that are combinations of others,
such as E itself, the set of all the elementary events that mean something happened.
It doesn't matter what happened, just something happened.
And the also we can talk about, for example, two four six is the event if we're have our dice, two four six,
the SEC containing those three dicer die rolls is the event road even number because the hole at the elementary events.
So the individual rolls. One, two, three, four, five, six. There are three of them that are even two, four and six.
And so we can say the set of those three is the event rolled even number.
So we think about events. We're going to talk about events in terms of these sets.
And you set notation to talk about the combinations of events.
This is going to build up the basis of our probability theory. So then set operations described logical events.
So A, Intersect B means both A and B happened at the same time.
We have a A as a set of elementary events where some property is true,
say even B is a set where some other property is true, say divisible by three, A Intersect.
B is the set or both properties are true. This set may be empty.
But in this case, even in divisible by three is gonna be a six.
A Union B is either a C, so intersection is equivalent to logical and a union B is either A or B happened or both.
If it's possible for them to both have it at the same time, it might be that both happen, but at least one of them happened.
This is logical. Or we can also we also have A happened, but not B, so if A is even and B is divisible by three A but not B would be three.
Excuse me, A, but not B would be two and four because two and four of the even numbers that are not divisible by three.
As I said, these are equivalent to the logical operators, so A and B, A or B and then A and not B.
So we have these these individual outcomes, the elementary events.
We have the events themselves that are subsets of it. We then decide this D.
Define a set F. That is going to be the set of all possible events.
These are subsets of E! So this is a set of stats. And this set has to has an important property.
It's closed under. So first E is in it. So E is is an F.
And it is closed under complement and countable unions. So if.
If set A is in this field F. Then its complement is as well.
So E minus A is in it. The way we think of that, if a as say, even if even as an event, while odd, is an event to.
Also, if we have. Solemn events, then their union is also an event.
So if and also if if it is infinite, then this is count what accountable union?
Any count? Any any collection of counterplay many events. Their union is an event.
The details of that ring will save for a probability theory course.
But to be complete, there it is. We call this this field f a sigma field or a sigma algebra.
But. But F, is this the setup?
This collection of events? So we the in The Incredibles, you have the guy, the father, he comes home and the kids waiting there and.
What are you waiting for? I know something amazing, I guess. Well, something amazing.
It is an event that set off all elementary events that are amazing for whatever amazing means.
But then that also means that something not amazing, maybe something boring is also an event.
So you get this compliment thing. Like if. If, if. The set of all monetary events with some property say amazing is an event.
Well, the set of all of them that don't have that property also is.
So if we have our six sided die, we have the elementary events of the three or the six different values.
All of the singletons are events. They're all in our field and also all subsets of e wind up being events.
This we call this the powers. The power set is the set of all subsets. So there's any subset of E is an event.
And if you have it, so long as our set of elementary events is discrete and finite, all of the subsets are going to be events.
So let's I want to show you quick a theorem that we can prove already with the values that we have so far.
So if we have two events, a one and a two. The axioms told us it's usually the definition of a sigma field.
Told us that their union is an event,
but also their intersection is an event because the intersection can be turned into a compliment of a union of compliments by the Morgens laws.
If you've taken the logic class, you might remember those.
The compliments are in the hour in the field by club because the field has all the compliments in it.
The union of Compliments, therefore, is also in the field since it's in the field, its complement.
The intersection is in the field. So from the axioms we have, we can prove that the intersection is is an event.
So with these fields, we can now define probability and a probability is a measure of we call it actually a measure.
So it's a probability measure. But it's a measure of how likely the different events are.
So a probability distribution. It's a function over this field F and it obeys a few rules.
The probability of E the set of all elementary events is one. Basically, if we got something happening.
Well, the probability of something happening again, observe something happening is one.
Also, probabilities are not negative. All probabilities are at least zero.
And then if we have a disjoint sets or disjoint events.
So two events are disjoint. If they're mutually exclusive, they can't both happen at the same time.
Their intersection is empty. If we have to decide if we have disjoint events, then the probability of their union.
So if A and B are disjoint, the probability of A or B happening is the probability of A plus.
The probability of B. So for disjoint events, events, that cannot happen.
At the same time, we sum up their individual property probabilities to get the probability of any of them happening.
So if we're going to apply this to our DI, the problem, if we have all our singletons sets set containing just the value one,
two, three, four, five, six, the probability of each of these is one sixth.
The Dyas is fair. All values are equally likely. So the probability of even we're going to stay.
Even the set of even dice is one half because three of the six equally likely values one, two, three, four, five, six are even.
So we have when all the values are equally likely, we can count up to par.
The number of possibilities where our action, where our property happens, divided by the total number of possibilities.
And that gives us the probability. If they're not all equally likely, then things quickly get more subtle.
One consequence of this, another theorem that we can prove is that probabilities are not greater than one.
So. And I'm going to let you study the proof here in the slides, Off-line.
But we can prove from those axioms that probabilities are not greater than one.
And so this gives us a few different facts about probability. Probabilities are in the range zero to one.
If we have the union of two probability of two events there,
the probability of the union is the sum of the individual probabilities, minus the probability of A and B.
I'm going to let you think about why this is true. And we can talk about it in the discussion.
This also gives us that the probability of. A or B?
Is less than or equal to the probability of a plus the probability of B.
You can see that because the probabilities are at least zero, then P of A plus P, a,
B plus P of A or B or minus P of A and B is going to be no more than P of A plus people P plus P of beef.
The probability of a compliment is one minus the probability of the original thing.
We can compute the probability of of a difference. And we can also have an inequality relationship for subsets of fey as a subset of B.
It's probability is no greater than the probability of B. So what does probability mean so far, I've given you these rules.
It's just a function from sets of values. It turns out this is a surprisingly deep philosophical question,
but the depth of the philosophical question doesn't have to stop us from using it basically in two broad views,
which for our purposes are not going to conflict that much are that it's the expected or long run outcome of a random process.
And so. Probability of a Diab being one six means if we go, we roll six, the dice six hundred times.
We expect about a hundred of those to be one. It also can describe a degree of belief or expectation.
So that then that is a more subjective view on it that connects to the probability is is
describing what I know about what's going to happen when I roll the dice to wrap up,
though, we can use we can use sets to describe events or outcomes in their logical combinations.
We can then use probability to quantify the likelihood of different events and probabilities, follow a number of rules with rich emergent properties.
We're going to see more notions of probability and things we can do with it in the subsequent videos.

🎥 Joint and Conditional Probability#

Video (11m50s)

Slides

So, again, in this video, I want to introduce joint conditional and marginal probability.
The previous video we talked about probability in general, particularly the probability of one event at a time.
We're going to start talking about probabilities of that.
Would that involve more than one event?
So are learning outcomes understand the definitions and of the relationship between joint conditional marginal probability?
We're going to introduce Bayes Theorem. I want you to be able to apply it to invert the probability.
And also then we're going to introduce the concept of what it means for two events to be independent.
So to start out, joint probability is the probability of both A and B occurring simultaneously.
And we write it with A and B in columns. And so and it's the same as the and.
So remember, from the previous video, the set Intersect Operator means.
And so we just write P of A Khabab B, that's the joint probability of A and B occurring at the same time.
Sometimes this is written with a semicolon instead of a comma.
That's mean. The same thing. And sometimes you use both because your your events, they fall naturally into different groups.
And so you use commas to separate individual events and use a semicolon to separate the groups.
Not going to see it super often, but if you see a semicolon, that means the same thing as a comma.
It's a joint probability, but it.
It's just used to provide some clarity and grouping.
So, for example, if I roll, if I roll to die, two dice and the problem, the probability that the first one is four and the second one is five.
That's a joint probability. I could roll.
I could just compute the probability of the first one, and that's doesn't it. Second, I have no impact on it, but this joint probability is due.
One is for. And if two is five. Probability of that is one thirty sixth.
This is not the same as rolling a four and a five. Because there is an order here.
If I if the first dice diez five in the second Dice's four dayas for this event didn't happen because we are talking about this order.
If we want a four and a five, I'm going to let you think about what the probability of that would be.
So we can start to think about two dimensional spaces. Probability spaces.
In the last video we saw one dimensional probability where we can think about the role of a single die.
So I rolled a four. We can also think about drawing a single card from a shuffled deck.
Where I can draw an ace. Now, a key difference here, if I want to start thinking about two, I can roll two dice.
I get a one and a five. And they have nothing to do with each other.
Or I can roll one die and I get a five, I can roll it again and I get another five.
But if I draw another card from my deck. I can't draw another ace.
So for the dice, the each. Each one had the same probability, the probability of rolling a one and one on day one.
And the five. On day two is one out of 36. But if I have the probability of drawing.
A jack of hearts on my next draw is different because I've already drawn an ace of spades.
It's going to be one over 50 two. Excuse me, one over fifty one instead of one over fifty three.
drawCard knitted native diamonds. Further, the probability of drawing a diamond now is different from the probability of drawing a club.
Because I've drawn one diamond and I haven't drawn any clubs. So if I want to say the probability of clubs, there's still 13 left in the deck.
But there's only 12 diamonds left. So. The the the cards they change.
Well, since I'm not putting the cards back between draws, the cards change as much as I progressed through.
And there's a relationship between one sample and another. So the conditional probability P of be given a.
That's how we read. That is the probability that B happened. Given that we know A happen.
So when our dice five roll one die for.
That tells you absolutely nothing about what the second die is going to be. It can be a two.
It can be whatever. But if I shuffle my cards. And I draw a card, and that's eight of hearts.
That changes what the next probability is the probability of of a heart, the first time was 13, over 52.
But the probability of a heart the second time is twelve over fifty one because a heart is gone and a card is gone.
The probability of a spade the next time is thirteen over fifty one because I have a heart.
I haven't taken any spades but I have taken a card so close in 13. Over 52 to 13.
Over fifty one. Next card it turns out, is actually a heart.
But we have knowing the first card tells us something about what the next card is going to be.
And we call this a conditional probability. We can decompose joint probabilities into conditional probabilities.
So the probability of A and B is the probability of A given B.
Times the probability of B. And likewise, we can it also works the other way around.
It's been given eight times the probability of a. We can use this to breakdown a joint probability into a conditional.
We sometimes call this a factor, joint probability, but it starts to establish relationships.
The probability of drawing an eight of hearts, followed by a three of hearts is the same as that is the probability of drawing an eight of hearts,
followed by the times, the probability of drawing a three of hearts.
Given that, I have already drawn an eight of hearts.
We can also talk about the marginal probabilities of the marginal probability is the probability of a single event.
And if we know the joint distribution, we can compute the probability by doing what's called marginalizing the joint distribution.
So if we we can think of a card probabilities also as a joint probability.
So the three of hearts is the joint probability of a three and of a heart.
And so we can say, well, the probability of a three.
We can compute that with the sum over our suits of the probability of us three.
And that of or six. And that suit. And this computes this will be called this marginal probability.
The reason it's called a marginal probability is because if we draw out a table of the joint distribution.
So in the rose, we have the first the first value or event set of the set of possible events in the first case,
the set of possible events in the second case.
And this this hat, this requires us to have events where we have a set, I say a set of mutually exclusive events that span each.
So we need to be able to have events that cover everything. And they need to be mutually exclusive.
In order for this sum to work. But the second to die will be one of these six value.
Will be one of these six values. So. We can we have the values of each individual pair.
So the value of one being a six and two being a five is one over thirty six.
And then the margin. So if we sum this row.
We get one over six, which is the probability of the first die being a two.
And that's why it's called the marginal probabilities. It's what happens if we if we compute the margin of this table of the joint probabilities.
And I want to get to independents. I've talked about how the DICER Independent ruling one tells us nothing about the second.
So formally, two events are independent. If no one tells you nothing about another, the other.
So the probability of A given B, so the probability of rolling a set of five on my second day,
given that I wrote a two in the first, is the same thing as just rolling a five.
And so B tells us knowing B happened tells us nothing about A equivalently goes the other way probability knowing A happened tells us nothing about B.
And if two events are independent, then that are joint probability is the product of their marginal probabilities.
I can take A if A and B are independent, then the probability of A and B is the probability of A times the probability of B.
I'm going to let you as an exercise to get more familiarity with these with these distributions.
Prove that they are or with these definitions is to prove that these two definitions are equivalent.
So now then finally, I want to introduce something called Bayes Theorem.
So conditional probabilities have a direction. The probability of being given A and A given B are not the same thing.
But. We can they are related.
So. The probability of a penguin having a flipper length of 217 millimeters,
given that it's a Gentoo penguin, is not the same thing as the probability of that.
It is a Gentoo penguin. Given that it has a flipper length of 270 millimeters.
But if we know the probability of flipper length,
given that it's a Gentoo and we know the probability that it's the marginal probability that it's a Gentoo and that it has that flipper length,
then we can compute the probability that it's a Gentoo. Given the flipper length.
So if a. Gentoo. B.
Flip. Of, I'll say approximately to 17 millimeters.
We can read if we know one, we can add the marginals, we can reverse it and get the other.
This becomes very useful for a lot of kinds of probabilistic inference.
So to wrap up. Joint probability is the simultaneous is the probability of multiple events happening simultaneously.
These can be multiple overlapping descriptions of the same thing.
For example, the probability that my card is a three and the probability that it's a heart.
They can also be probabilities that relates to individual dimensions of a multidimensional space,
like the probability for different die or the probability of of different cards in a sequence.
Conditional probability describes the probability of an event condition and that other information we may have and these things,
these probabilistic building blocks. There's a building blocks for more reasoning.
We're going to use them quite a bit as we start to reason and as we start to describe other things throughout the cement's.

🎥 Continuous Probability#

Video (11m48s)

Slides

In this video, I'm going to introduce continuous probability.
So far, we've been talking about probability over discrete variables, dice rolls, et cetera,
but we're going to now talk about continuous probability and introduce the concept of a random variable.
Learning outcomes are for you to be able to compute an expected value and to understand why continuous events have
probability densities and the relationship between these densities and distribution functions and actual probabilities.
So if X is the value of a six sided Deyrolle, so we roll it, I has six sides,
X is the value one through six of that result, we call X a random variable.
It's a variable that has a random numeric value. Technically of random variable is actually a function from elementary events to real values,
but for our purposes that nuance is not generally going to be very important.
The expected value of the variable is the sum.
Of the different values it can take of the value.
Times the probability of that value. So it's one times the probability of a one and you add two times the probability of a two.
This is equivalent to the mean over many roles. If we roll the dice a thousand times, what would it mean?
Approximately B and also we can think of it in a gambling kind of a gambling or betting setting of the expected return from a roll,
the expectation in the meet. So the expectation is the mean over many points, but it's also in the mean of the random variable.
The random variable has a mean, even if we don't have any observations for it.
And the expected value is that mean. One way to see this,
if we've got a sequence of data points in our process is pick one of them uniformly at random probability one over N and we have it's value,
then we can compute. The expected value is the probability value times.
There is the sum value times, the probability of that. But this probability is one over N.
So when we pull that outside we get one over ten times the sum, which is exactly the sample mean.
That's the formula for the mean. So we can see the expectation and the mean are equivalent concepts here.
But that's a discrete variable, one through six. Rove is excited.
I don't happen to have a continuously value to die, but what's the probability that so we randomly pick agenda Penguin,
what's the probability that it's flipper length is two hundred seventeen. Mm.
It's a continuous variable, so two hundred and seventeen millimeters, there's very values on both sides of it.
What if it's 217 17? But we find a penguin and it's flipper length is to seventeen point one millimeters.
It's 217 plus 10 to the negative 10 mm, I don't have a ruler that precise, but the value these values are continuous values.
There's a value between 217 and 217, plus ten, the minus 10.
So the probability of any individual value of a continue of a truly continuously
valued random variable is zero because it's it's it's effectively never.
Going to be exactly this value might be pretty close, but it's but the probability of it being exactly that value is effectively zero.
So we need a different way to approach probability for continuous variables.
And the way we do this is that we assign probabilities not to individual values of the variable, but to intervals or ranges.
And so are elementary. Events are still real numbers. The penguin that we randomly picked has a flipper length.
Those are the elementary events. But we don't have the singletons. The singletons are not in our set of actual events we care about.
The events we care about are intervals, the sets of the set of intervals and their complements and their countable unions.
There's a lot of different sets in here. Any real value is in.
Actually, infinitely many of any real value we can pick is actually an infinitely many different events
because the intervals are all events and it contains infinitesimally small intervals.
And no matter how small an interval you pick, there's one smaller in there,
but it does not contain the individual singleton values and we assign probability to these intervals.
We do that through a thing called a distribution function.
And so the distribution function F is the probability that the random variable takes on a value
less than or so F of X is the probability that the random variable takes on a value less than X.
And so at X equals zero, so this and this is sometimes called accumulative distribution function or a CDF C D,
I'm going to write that down for you, CDF.
Cumulative distribution function, so what X equals zero, it takes on the value, the probability that the random variable,
it has a value less than X, which for the first three curves, these are the CDs for different parameters of what's called the normal distribution.
It's the probability that it's less than that. We go up to one. X equals one, and the probability that it's less that is higher,
the cumulative distribution function is monotonic, non decreasing, and it has a maximum value.
The limit as X goes to infinity is one.
So the probability is excuse to infinity. The probability that's less than X is one.
And as X goes to negative infinity, the probability is zero.
But this this is the basis for establishing the probability of continuous values is we say,
well, what's the probability that it's less than some value?
And then if we have an interval, so we want to say X, the probability that X one is less than X,
which is less than X to what we can do is we can subtract these two probabilities.
So what we say is so. The what we have here is the events X is less than X two, and we have the event not.
Or X is less than X one compliment, X is not less than one, so it's greater than X one.
And we take the union of those two events excuse me, the intersection of those two events.
What's the probability that they both happen and we get that we need F of X to minus F of X one.
So we call the probability on an overall probability mass how much probability the mass is on this interval.
Also, when we have a discrete event, it's probability is also called a probability mass.
So if you have discrete variable, then the function of what the probability is of different values is called a probability mass function.
How much probability is on this discrete value? But we we can't have mass on discussing on individual discrete values of a continuous variable.
We can have probability mass on an interval and we can have a probability density on an individual value.
So a distribution is often defined.
So I've got us to the distributions of the probability of a of a variable having a particular value through the distribution function.
But the distribution function is often actually defined as the integral of a density
function in the density function is therefore the derivative of the distribution function.
We have a density functions that the distribution function is the integral from negative infinity to X of the density,
and that gives us the probability that it's less than or equal to X.
And the this is the graph of the density function and these are the densities for the same distribution functions you just saw.
So they go back down. They're not monotonic and they're showing how much density is at different values.
So X the purple one has the most density at X, the mean of the the mean of these three.
Is all the same, the expected values, the same, the distribution function at zero point five is the same, but the densities are different,
which means the purple one is more strongly concentrated around zero than the red one or the orange one.
Densities can exceed unlike probability, unlike probabilities,
a density can actually exceed one, they're not going to be negative, but they can exceed one.
You can have density. I've seen densities of 10 and some of my analyzes because the density is not a probability.
Instead, the density is the limit. As we see if we pick a point, the density at that point is the limit.
Of the so we we have a point here, so we've got our curve.
We've got a point, we have a window around it. Of wealth to Epsilon.
And it is the limit as that window gets smaller and smaller of the probability mass in that window divided by the width of the window,
the density because the density is the mass, divided by how much length the mass is spread over.
If you're going to compute the density of the weight of an object, you'd be the the weight over the amount of area.
We're dealing with one dimension here. So the limit as this window gets smaller and smaller of.
The mass divided by the width and because we're dividing by that width, the densities can exceed one.
You might remember this definition of that as the definition of the derivative from calculus that and that that's the relationship here.
The density is the derivative of the distribution function. The distribution function is the integral of the density.
We can also compute continuous expectation and continuous expectation is an integral.
So as the as the the expectation of a discrete variable was the sum of the values, timeliness,
the probabilities, the continuous expectation is just the integral of the values times their probabilities.
When we go from the script, from continuous, we have to go from segments to integrals. But the concept is the same.
It's still the mean. So to wrap up the probability of any individual value of a continuous variable is effectively zero.
It is zero. But so instead we use probability density, distribution functions and we assign probability mass,
not individual points of a continuous variable, but to intervals of it.
And the expectation also then that the expected value is the mean of a random variable.
Couple of things I haven't shown you.
We can also talk about conditional expectation, which is the expectation of the random variable, given some other information.
We're going to be building on this as we go forward. I'm also going to be posting some notebooks that work through computationally.
How do we actually start to compute and how do we count frequencies of events to start to estimate probability from data that we have?

📃 Notes on Probability#

My notes on probability provide a linear, summary treatment of the concepts of probability that we have discussed, along with pointers for further reading.

I expect you will likely need to return to the probability material as we progress through the semester and use it more and more. A few particularly important things you need to be able to understand are:

What does a probability \(\P[A]\) mean?
What does a conditional probability \(\P[A|B]\) mean?
What does a joint probability \(\P[A,B]\) mean?
What does an expected value \(\E[X]\) mean?

In my teaching of later material, I use probability notation a lot, as it is a concise but (relatively) unambiguous way to communicate many important concepts. Also, while the philosophy of probability is largely out of scope of this course, my own philosophy of probability (roughly, instrumentalism) means that I use probabilities to describe things that a strict philosophical frequentist likely would not. One of the most practical implications for this class is that I will use conditional probability as a shorthand for fractions of events or observations:

\[ \P[A|B] = \frac{|A \cap B|}{|B|} \]

You can derive this fraction yourself from \(\P[A] = \frac{|A|}{n}\), where \(n\) is the total number of possible events or observations, and cancelling the \(n\).

🎥 📓 Distributions#

Video (9m24s)

Slides

Hello, in this video, I want to talk about the concept of distribute, particularly named distributions, we introduced the fundamentals of probability.
We've gotten the continuous variables and random variables.
I want to talk about what it means when we talk about something, say, having the normal distribution.
And so our learning outcomes for this video were to understand the idea of a distribution,
to be able to identify parameters and statistics of a distribution.
The notebook and the linked reading have the linked material, has more information, more information on these distributions, many more distributions.
So I encourage you to spend some time exploring that particular distribution.
Families are one place where Wikipedia is a rather fantastic resource.
I've included a link to Wikipedia's list of probability distributions, which is organized by by distribution type.
And it each of the pages provides rather mathematically dense but useful detail on how the distribution is defined.
Some of its key properties and relationships to other distributions.
So but to get started, this idea of working me, focusing on distributions of numeric variables.
These can be both discrete and continuous. So discrete. We might have a distribution of accounts.
It can also be categorical so long as we encode them as a numbers.
So we may encode a a success or failure outcome as zero for failure and one for success.
Speaking of which, the banali distribution does exactly like it does exactly that.
If we have binary outcomes. Success and failure, which we code as one and zero respectively by convention.
And we have a parameter, the success probability. Then we have we can then also compute some statistics of this distribution.
Its mean, which is since the outcomes are just zero and one, it would be the fraction in the long run, run it infinitely many times.
What fraction would you expect to be successes? And then it'll also have a mode, which is the value that's going to occur more often.
For example, when Theta is point three, the mode is failure.
There are a few things we want to do to characterize the distribution. So we have its parameters.
So in the Bernoulli distribution, the parameter was the success probability. There are other parameters as well,
and some of the parameters come in particular types such as location and location
is a parameter that controls where on the x axis the distribution is located.
Sometimes it is the mean, but not always. The key thing is that the location parameter controls where on the x axis.
So it slides it back and forth in the x axis, some distribution. It's also going to change the shape somewhat as it slides.
But the location parameter controls where on the x axis it is the scale parameter controls,
how far, how wide it is, how far out it spreads out its density.
Some distributions called location scale distributions are controlled by location and scale parameters.
And the shape of the distribution doesn't change it at all, doesn't changes all at all as you move it around,
the location just shifts the same curve back and forth on the x axis.
And the the scale just spreads it out or contracts it on the x axis.
And so if you if you do a simple way of plotting it, the curve is gonna be exactly the same.
The X axis is just going to be shifted and smushed or spread out underneath the curve.
And then a shape parameter controls the shape of the distribution.
So this distribution, which is a normal distribution, doesn't have a shape parameter,
but there's a skewed normal which adds a shape parameter to let you get shapes like this.
And in it, that shape parameter allows you to shift between the normal, the traditional normal and this skewed normal.
And so there's other types of parameters as well. But three common kinds of these location, scale and shape parameters,
locations where scale is how wide and then shape is what it's actually shaped like there.
We then to characterize the distribution,
we have a density or a mass function that use probability density or probability mass function that uses these parameters to assign probability,
mass or density to different values. We then have what's called the support, which is the range of values the distribution is defined over.
Not all distributions are defined over the same range.
And there's often we want to it's useful to think about the underlying random process.
Most distributions are described as the result of a particular random process.
It does not mean those processes are the only things they're useful for describing,
but it's mathematically the distribution arises from analyzing that particular process.
We then can compute some statistics of a distribution.
It's mean the expected value of a very of a random variable with that distribution is one of them.
It's also called the first moment of the distribution. Then we can compute the standard deviation or variance.
For some distributions, for example, normal, though,
it has location and scale parameters that are the mean and standard deviation, the variance is called the second central moment.
We can also compute other things, such as the median or the fiftieth percentile of the distribution in other and quintiles.
So the binomial distribution is we take that Bernoulli and we stretch.
We do it multiple times. The binomial distribution is it has the parameters and which is the number of trials.
The number of Bernoulli trials we're going to run. And Theta is the success probability of each individual trial.
The trials are assumed to be independent and the burn the binomial distribution describes the distribution of how many successes.
We'll see if we run that many trials with that success probability.
It's mass function is. The probability of success times the number of successes.
The probability of failure times the number of failures and then multiplied by end, choose why and choose why.
Is the number of possible ways to arrange why failures out of any successes so.
So that's a probability mass function I show here. Also, the DI a graph of it's you can see where how the mass is distributed for N equals 10.
And Theta equals zero point three.
So you can see three. That's the mode, and that's also the mean and the median.
Three. If we flip awaited coin, that comes up heads 30 percent of the time.
Ten times three will be the most common number of heads.
But there's quite a bit of spread. We're going to see a lot of twos and a lot of fours as well.
And we'll see some runs with zero heads. We're not going to it's.
There is going to be very, very few runs that wind up having 10 heads with a coin with that bias.
The normal or the Gaussian distribution that I just showed you has mean and look, has its scale parameter or excuse me,
its location parameter is the means at scale parameter is the standard deviation.
When the mean is zero and the standard deviation is one, we call this a standard normal.
And the stand I'm showing here, the standard normal with both the probability density function and the the distribution function, the CTF.
So you can see how those relate to each other of align them vertically.
The code to generate all of these plots is in their notebooks. You can see how I did them.
You will see we see here that the cumulative probability goes from zero to one.
Cumulative probability always does that. The density, the bell curve goes out to zero on both sides.
The density is concentrated in a particular window. And the.
And it's a density. So it goes up to whatever that the maximum density is.
If we if we contracted that, if if we decreased the scale, the density would increase, the maximum density would increase.
So to wrap up, random variables are often described where probability distributions, which are in turn governed by parameters.
There's quite a few different standard probability distributions.
I encourage you to spend some time with the notebook and the linked resources to learn more about different distributions.
And we're going to be introducing more as they come up for various things throughout the rest of the class.

Resources#

The Probability Distribution notebook
Wikipedia has a good list of probability distributions

🎥 Sampling and the Data Generation Process#

Video (17m53s)

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand SAMPLING AND THE DATA GENERATION PROCESS Learning Outcomes Understand the relationship between a sample and a population Identify when we can make an inference about the population and when it’s just a statistic of the data Determine whether data is likely to be independent and identically distributed Photo by Dan Meyers on Unsplash Sampling The inferential logic of statistics is based on samples From a distribution Generate random numbers! From a population Select them with a representative sampling strategy Sampling the Population Sample of Penguins Statistic Is the sample representative? Does it teach us about population? Ideal Penguin Representative Samples We need a couple of things for a sample: Representative of the population (w.r.t. parameter of interest) Biases affect this (sampling, selection, response, etc.) Large enough to allow inference of parameter of interest This size does not depend on population size! Better data often better than more data! Historically, much statistics concerned w/ efficiently using samples Uniform Sampling All population members equally likely to be sampled Harder than it sounds in practice Resulting statistical analysis relatively straightforward Small subgroups easy to omit! More Strategies Stratified Sampling Make sure different groups are represented, possibly equally Oversampling Sample more from a minority group for in-group data Correct (resample or reweight) for whole-sample inference Penguins We have: 3 species of penguin Measurements for a sample within each What is the population? Can we answer: Distribution of penguin species? Typical measurements within a species? Two Sampling Strategies sample Spec. 1 Spec. 2 Analysis Spec. 1 Spec. 2 Analysis Sample 1 Sample 2 The Data Generating Process How did we get our data? People and movies exist People find movies and watch them Netflix recommends more movies People maybe watch them (feed back into 2) Reasoning about DGP helps identify sample status, I.I.D. Common desiderata for values, and samples: independent and identically distributed Independent: one value does not affect another Identically distributed: all drawn from the same distribution Equal mean, variance, distribution family Uniform at random from large pop is i.i.d. Small pop: sampling removes items! (unless with replacement) German Tank Problem You capture a tank. It has serial number 2089. How many tanks does the enemy have? This is inference for the max. Wrapping Up Our data comes by some process. Classically, we think about this in terms of sampling – how did we pick these items to analyze? The data generation process is how our data comes into existence. Photo by Museums Victoria on Unsplash

This video I want to talk with you about sampling and the data generation process,
we've talked about probability and that's going to give us the foundations to be able
to start reasoning about the process of sampling and how we get the data that we have.
So our learning outcomes are to understand the relationship between a sample and a population we've mentioned.
I've mentioned that before, but we're going to get into it a little bit more on this video,
identify when we can make an inference about a population and when it's just something that we can say
about the data and determine whether the data is likely to be independent and identically distributed.
I'm going to introduce that concept today. So the inferential logic of statistics is based on sampling and samples.
A lot of many of our statistical techniques are designed based on analysis of
what happens if you apply them repeatedly with certain randomization structure.
That randomization structure comes in our sampling to select which data we're going to be analyzing.
We can sample from a couple of different things. We can sample from a distribution.
So we've introduced distributions. I can sample from a normal distribution and I'll get a bunch of values that will be normally distributed.
That's really useful for doing simulation studies and for probing how our statistical techniques work.
We can also sample from a population and if we want to understand something about a population,
we can sample from it with with a sampling strategy that's going to ensure that we have representative sample.
So if we think about the penguin data set we've been working with a little bit and we're going to see more this week.
If we have the population of penguins and they have some flipper length mew fouls, we've got our penguins, they have a flipper length.
And it's important. This is this is this is the population of penguins that exist.
And really, it's the population of hypothetical penguins that could exist.
But we have this population of penguins. We then take a sample of the penguins.
So we have it, but there's a bunch of Gentoo penguins out there or chinstrap penguins.
We take a sample of penguins and then we compute a statistic from the sample.
And there's one key thing we need to know about the sample is whether it is representative.
And by that, we mean that. Studying the sample is going to teach us about the population.
There's not parts of the population that are systematically excluded from the sample.
There's not parts that are disproportionately more likely to appear in a way that we can't control for, etc.
What we want is that the populace, looking at the sample,
gives us things that look gives us results that look like the population scaled down to our sample size.
Particularly for like, if we're going to infer the mean flipper length,
then we want the sample to have the same distribution of flipper lengths as the population.
If there was something that caused longer flipper penguins to be easier for us to pass in our sample, then it would not be representative.
With respect to studying flipper length, depending on our philosophy of statistics and of science,
we may consider that the penguins themselves are instantiations of a distribution governed by some ideal penguin.
In which case we're trying to understand the properties of Penguin Enis.
Platonic ideal penguin.
Not just the population of penguins that happen to exist on Earth, but for the purposes of the vast majority of what we're doing.
Whether or not the ideal penguin is a thing that's that exists and or is reasonable to talk about won't much matter.
But it sometimes might come up as a conceptual entity. So representative sampling mean a couple of things for a sample.
We need to be representative of the population, particularly with respect to the parameters of interest.
It doesn't necessarily have to be representative in every possible way, so long as it's representative for the things we're inferring.
But if it's not representative in some way, identifiable way, that usually should give us pause about the quality of our sample.
It also used to be large enough to allow us to properly and for the do rigorous inference of the parameters that we're interested in.
One crucial thing to note is that the size of the sample does not depend on the sample size or excuse me, on the population size.
Once we start seeing the tools for quantifying the precision of an estimate based on sampling theory.
The size of the population is not going to be a variable in the precision of our estimate.
This is a common. This is a common misunderstanding of sampling.
You'll see it comes up poor sometimes, particularly around political polling.
Raising the question of, well, how is a poor 15 hundred people represented to get accurate representation of of.
US with the opinion of the U.S. populace. But the accuracy, so long as they're sampling strategy, is good.
The accuracy and the precision of the estimate does not depend on the sample size so long as your sample is big.
So does not depend on the population size, so long as your sample is big enough.
So also, better data is often better than more data.
Historically, a lot of that has because it takes time to get data.
It takes time to run experiments. It's expensive. There's a lot of statistical inference.
It's concerned with how do we make efficient use of modestly sized samples.
This is one of the areas where. Data science diverges a little bit from classical statistics.
I don't want to say that they're different, though. What we. The tools of statistics are the building blocks for it.
And it's statistician's that really are leading the way in figuring out what we actually need to do here.
But we have such vast quantities of data. What how we have to handle it changes because we're no longer dealing with relatively small samples.
Uniform sampling is an easy if you can actually do a uniform sample.
It's a good way to do a representative sample for a lot of purposes. All population members are equally likely actually achieving.
This is not as easy as it sounds, but uniform sampling has a lot of nice properties.
The resulting statistical analysis is relatively straightforward as the analysis go.
But one one downside of uniform sampling, if you have a large population, is that your uniform?
There might be subgroups and there might be subgroups of interest that just never show up in your sample.
And so you need if you want to make sure that those subgroups are reflected,
especially if you're looking at at the subgroups, the preference within subgroups.
So if you read the details, say a say a Pew Research survey or poll.
You'll have they'll have their high order results and they'll break them down by a bunch of subgroups.
You have to make sure that your samples. The samples from each of the subgroups are representative in addition to the whole.
So additional strategies are stratified sampling, which can make sure that different groups are represented.
So you have you might sample a fixed number.
You might make sure that you sample and people from each state to make sure you can.
You have data and relatively comparable data on each state. You may over samples.
You may have your first order sample and then go get more samples from a smaller community to
make sure you have enough data to do robust inferences about their particular preferences.
And either these strategy areas you have to re sample or reweight.
You have to somehow correct. To be able to do inference from the whole sample.
But that certainly doable over sampling does not mean that.
The results, the conclusions from the overall sample are biased towards the over the over sampled groups opinion,
because you're going to correct for that when you do the top level, the top level analysis.
This is another thing I see.
I come up with with Ill-founded complaints on political polling is using the fact that a poll oversampled to discredit its top
level conclusions without investigating whether they corrected for the oversampling when they were computing those conclusions.
But if we go back to our penguins, we have three species of penguins. And then we have a set of penguins from each species.
And a key question we want to think about here is what is the population?
And then there's other questions we want to be able to think about.
Can we answer the questions to answer the questions about the distribution of penguin species?
And can we answer any questions about typical measurements within a species?
So if we think about our penguins, there's two different ways the data can come to us.
The sampling strategies we could have the universe of all possible penguins.
And then we sample some penguins and we see, OK, what species? One, what species to take their measurements.
And then we put it in our analysis.
The other one is we have penguins and the penguins have different species and we sample from each species of penguin.
And then we go and we take their measurements, information and smuggled into our analysis and the.
The second one is probably what was happening with this data, because they have a particular place,
they're going and collecting the penguins, they're collecting them by a particular species.
I mean, this this this island or this beach on the island has a particular penguin that's going gonna be more common.
They're getting a global uniform sample of penguins is difficult.
And so if we think about it in terms of having we have samples from each species,
we have a sample of chinstrap, a sample of Idalia and a sample of Gentoo penguins.
Then we can infer about the.
Within the species distribution site, how is the flipper length of a chinstrap distributed this upper length of a deli?
How do those compare? But assuming that our samples are representative of the species of penguin,
but what we probably can't do with this data is say chinstrap or more adult common than Adelie or vice versa,
depending on which one is more common in the data set. The.
By understanding the sampling strategy. And here to make rigorous conclusions, we need to go spend time with the paper documenting the data.
But by thinking about an understanding the sampling strategy, we can start to get out,
what can we actually use to try to infer about the data and what's just a statistic or excuse me?
What can we use to try to infer about the population and what's just a statistic of the data?
We have like we have data on this many of penguins of each species.
But that doesn't mean that's how many penguins there are of each species.
So we need to think about that, what we call the data generating process or the DGP.
And this has this is around how everything that goes into how the data came to us.
And oftentimes what we're trying to do is infer parameters of the data generating process.
And so if we think about the movie data, people and movies exist, we could think about the process by which they come into existence.
We have to scope our investigation so we can say we assume people in the movies exist.
People find movies and watch them somehow. Netflix recommends more movies.
People maybe watch them. This feeds back into two with the but understanding the data generating process and identifying.
Oh, Netflix is recommender. Is in the loop on Netflix's data.
Lets us understand. Well, actually, when someone recommend read when someone decides to watch a movie.
That choice is not entirely what we call exoticness from outside the system.
It's affected by the system because it's in response to recommendations,
because we start to think about what is the debt data generating process that gives us a bunch of data about what movies people watch.
Reasoning about the data's generating process allows us to understand our sample strengths and weaknesses and capabilities and to.
And to find opportunities to do our inference and check where it may have gone wrong.
So I now want to introduce briefly this concept of independent and identically distributed.
A lot of. A lot of. A lot of statistical techniques required data to be either the data itself or the data after
some processing or the errors in the data to be independently and identically distributed.
And what this means is that the values do not affect one another in any way, and they're all drawn from the same distribution.
Particularly, we don't have changes in the mean or the variance as the distribution, as we say, move through time.
If we sample uniformly at random from a large population,
it's we can generally treat it as I.D. Each at each sample is independent of each other as well,
unless there is some some mechanism that causes them particularly to be linked.
If we have a small population and we're sampling a large fraction of it, then they're not necessarily independent because if you sample one,
you're not going to sample that again unless you're sampling with replacement as opposed to if you have a very large population,
you sample your you're not reducing the pool very much at all.
One last thing I want to mention, I'm not going to get into the details of it as an illustration for the power of thinking about
sampling and what it can let us do in our inference is a problem known as the German tank problem.
And this problem arose in the Second World War when the allied military forces were trying to estimate axis production capacity.
And so you capture a tank. And you find a serial number on it, and it's serial number would say two zero eight nine.
You want to know how many tanks have been produced? You want to know how many tanks have been produced?
And so. It turns out if the serial numbers are being assigned sequentially, which they were.
And you have observations of a bunch of, say, tanks,
you can applying sampling theory and the statistics that come about reasoning about how we get samples.
You can infer how many tanks probably exist.
Effectively, what you're doing is you're inferring the max. Oftentimes we infer for the mean, you can infer from many different statistics.
They figured out how to infer from the max based on a number of observations.
And the max of those observations to figure out how many tanks were likely produced after the war,
comparing the statistical analysis to the actual production records recovered from Nazi factories.
The statistical estimates were far closer to the actual production records than traditional intelligence
estimates that were using spies and surveillance to try to figure out how many tanks were being produced.
But the reasoning about the sampling process and pushing through the math of sampling allows us to
then be able to make probabilistic statements that can infer some relatively remarkable things,
such as the number of tanks produced based on the serial numbers that we've seen so far.
There are then countermeasures for that, such as randomizing your serial number so they don't show up in order and you have a much larger pool
of serial numbers and then you can't just use serial numbers to infer how many have been built.
So to wrap up, our data comes to us by some process. And classically, we think about this in terms of sampling.
How did we pick these items to analyze and then what would happen if we did that sample multiple times?
A lot of the properties of our statistical techniques are defined in terms of
what would happen if we ran the sample and computed the statistic many times.
And the data generating process is the mechanism by which our data comes into existence.
And we may model it in some way in order to be able to try to do inference about the human or physical processes that our data reflect.

Resources#

Sampling notebook

🎥 Confidence#

This video introduces confidence intervals.

Video (9m44s)

Slides

This video, we're going to start to getting to actually being able to augment our statistic with come
up with a measure of the confidence we have with regards to how it relates to the mean.
These are going to be extremely subtle to interpret correctly,
but they do get us towards being able to say things more robustly about the relationship of our estimate,
the statistic and the parameter that we're trying to estimate.
So our goals with for this lecture are for you to be able computer confidence interval for the mean,
using the standard error and correctly interpret a confidence interval. So let's return to our sample of the populations.
We've got our penguins and and we've got we're sampling our penguins and we're computing a statistic.
One question we can ask is, how does the statistic relate to the underlying parameter?
So this is important to be able to do inferences that can compute the mean flipper length of chinstrap penguins.
But how does that relate to the main flipper length of chinstrap penguins?
Well, the chinstrap penguins in the world world,
so classical frequenters statistics does this by thinking about what would happen if we repeated the experiment multiple times.
So let's say you want to measure penguins and I'm a penguin measuring robot.
And you tell me to go man, to go collect a measure that the flipper thinks and 50 penguins.
And you tell me to do that a hundred a hundred times each time taking a different random sample of penguins.
This allows us to the.
The mean flipper lengths I come back to you with are going to have a distribution.
So we have the flipper lengths, have a distribution. We go and we've you go have your your penguin measuring robot.
Go sample 50 penguins could measure the flipper lengths and give you the mean hundred times.
That also was going to have a distribution and we call this the sampling distribution of the statistic,
because it's what happens when we compute the sample, the statistic of our sample.
And we do it repeatedly.
And so this chart shows us the blue line, which you kind of can't see very well, because it's right up with the orange line is getting at the.
There you can see they're slightly different. The blue line is the true population.
So here are my samples are being drawn from a random number generator that I've configured.
So I know the population distribution. Precisely. We have the sample.
The sample is the orange histograms and there's an orange density estimate on top of them, so.
That's the sample itself. That's the distribution of values in my sample. And this is this is with 50 a sample size of 50, this green distribution.
That is the sampling distribution of the sample mean.
And so if I were to take one hundred of these orange samples or a thousand of these orange samples.
The green curve is the distribution, the means of those samples.
What have. So, yeah, as I said, we take a sample, we compute the experiment,
and we think about what the distribution is of the statistic from compute, from repeatedly doing this experiment.
So the sampling distribution of the sample mean. So we take a sample of size N and we compute the sample mean X bar.
This the sampling distribution is normal and it has a mean of the population mean.
And a standard deviation. Of the population, standard deviation divided by the square root of the sample size.
So. And also, for this to be true, X does not need to be normal.
And this is starting to get at what I mentioned in the previous video about the
precision and the accuracy of our estimate does not depend on the population size.
So. If the population has a standard deviation of one and we take sample sizes of size one hundred.
Then there's the means from those samples is going are going to be normally distributed with standard deviation.
One tenth, regardless of how big the population is.
The only variable to compute how far our our sample mean is likely to be the only things we need to compute.
How far our sample mean is likely to be from the true mean.
Are this this standard deviation of the population and the size of the sample?
We don't actually need the size of the population.
So but we don't have the population mean and the population standard deviation to be able to use that distribution directly.
But what we do compute is a confidence interval. And so while the sampling distribution lets us say things about the distribution
of the sampling mean and what happens if I compute a thousand of those means?
The confidence interval is also something that we understand in terms of what were to happen
wouldn't if we did it in infinitely many or sufficiently close to infinitely many times.
And so you compute a confidence dance interval with this statistic.
And this piece here asks over the square root of N is the standard error.
So. And our distribution here,
we have the standard deviation over the square root of N and that is that's really the error that what that's doing is it's bounding the error.
We're going to have and trying to estimate the mean or it's not bounding.
It's characterizing the error. We're going to how we expect to have when we're using the sample, mean to estimate the true mean.
So we call an estimate of that statistic using the sample standard deviation, we call this the standard error.
And the standard error is approximately the sample standard deviation of the sampling distribution,
one point nine six comes from the definition of it comes from the normal distribution in a standard normal.
Ninety five percent of the probability Nasse is between plus and minus one point nine six.
But if we computer confidence interval in this way. So we. So if we say X bar, the mean plus or minus one point nine, six times the standard error.
That gives us an interval upper and lower bounds. So for our chinstrap penguins, we have a mean of one in nine five point eight.
Two, we have a standard deviation sample size. And we can computer standard error so that we can say that that was a standard error.
And the confidence interval,
so we can say our chinstrap penguin confidence interval is one point nine five point eight two plus or minus one point six nine.
And what the confidence interval means is it's the result of a procedure that we can perform.
And remember our little penguin measuring robot?
Well, if we had the penguin measuring robot returned the confidence interval instead of the the mean itself.
And we have it go measure confident, we have it go measure penguins, sample penguins and measure them a thousand times.
Nine hundred and fifty of the confidence intervals that it gives us approximately.
This is all probabilistic, so it might be a little more, a little less,
but approximately ninety five percent of the confidence intervals that it gives us contain the true mean flipper length for a chinstrap penguin.
That's what the confidence interval means. And so the width of this interval is an estimate of our precision.
It's really important to understand a couple of things. It's really important to understand a couple of things.
One, the it is not we are not 95 percent confident the mean Lise's in the interval.
Then it is not that the mean lies within the interval with probability point nine five.
That would be another thing related thing called a baozi incredible interval.
It also isn't precisely a statement about X Bar itself. What the confidence interval means is.
It's an interval generated from a using a procedure that ninety five percent of the time will include the mean with the true mean within the interval.
So a brief aside, I mentioned the population standard deviation,
which we usually don't know because we don't have the population and the sample standard deviation,
and there's actually a slight difference in how we compute them in the samples that we're computing, the standard deviation from a sample.
We divide by N minus one instead of N.
And this is the key reason for this is that the samples to the standard sample standard deviation has what we call N minus one degrees of freedom.
Because we've already computed X bar. This is an intermediate statistic.
That we've already computed.
Given X bar and X one through N minus one, you can solve for X and only and minus one of the of the sample values are allowed to vary.
And still have the same X bar. And so because we have this is called degrees of freedom, because we have N minus one degrees of freedom.
We. We can.
We have to divide back and minus one. No, they think of it since X Bar is computed from the X's.
The Xs are not independent of X Bar in the population.
The the the the instances are independent of each other, given the mean.
But in this in the sample they aren't because we could given we can compute the last one.
If we had the mean and we have the first ten minus one. So we can start to compare values based on confidence intervals.
So I computed here the confidence intervals for the flipper lengths for our three types of penguins.
And there's no overlap in the confidence intervals. This is evidence that the flipper that the penguins have different flipper lengths.
If they had the same if they tended to have the same flipper length, then of the confidence interval would probably overlap.
We're gonna see later methods for directly comparing two means.
But we can start by using the confidence interval to see. OK.
Do they overlap? And the confidence interval also allows us to estimate if it's different.
If it if it's far away from a value. If we've.
If. The confidence interval, the low or high end of it, is pretty far from zero.
Then a reasonably far from zero, then zeroes. Probably not.
I mean. So but as I said, we have to be really, really careful about how we interpret confidence intervals.
We take us. The procedure is take a sample of size n compute the statistic infinitely many times.
Ninety five percent. In this case, the statistic is the upper and lower bounds of the confidence interval.
Ninety five percent of the time, the Trumaine will be in this interval.
Ninety five percent of the time this procedure will return an interval that contains the true mean.
We could also have other confidence interval, such as a ninety nine percent one.
But as I said, the confidence interval is not where we're 95 percent sure that parameter is.
We also have a couple of outstanding issues with the confidence interval.
So to wrap up, taking a sample and computing a statistic is a random process that results in a sampling distribution.
And this is the sampling distribution is the probability distribution from the process.
Take a sample and computer statistic. We can use this to try to start estimating the precision of our of our estimates.
So we have the X bar, which is Selke. This is my estimate for the main.
Mina, the penguins I saw were gonna use that as the estimate for the mean of penguins.
But we can use knowledge of the sampling process to develop techniques that let us not only estimate but estimate how far off our estimate is.

📃 Confidence in Confidence#

Read Having confidence in confidence intervals by Ellie Murray.

🎥 The Bootstrap#

Video (7m26s)

Slides

This video, we're going to talk about the bootstrap,
a technique for understanding properties of sampling distributions and we can't take a bunch of samples.
Our goals for this video are to be able to approximate the sampling distribution,
using the bootstrap and compute a bootstrapped confidence interval for a statistic.
So let's go back to our penguins. We've got a population of penguins. We sample of penguins.
The sampling that we compute a statistic over the sample. We repeat this a bunch of times.
And doing so gives us the sampling distribution of the statistic.
So the sampling distribution for the means really well understood, normal with particular parameters.
Estimating this depends on the accuracy of of the standard of the sample standard deviation, usually pretty good.
It's a parametric mach estimate. So it's estimating in terms of distribution with parameters.
Other statistics have other distributions. And we may not know cleanly what all of them are or they might be quite complex.
So or we may violate the assumptions of a statistical method.
So using the the the standard 95 percent confidence interval for the mean is not to going to get you too far off.
But what if you want to compute a confidence interval for a median or you want
to compute the confidence interval of a new statistic that you developed?
So you can do a lot of difficult probability, theoretic calculus.
But what you're trying to get at is what's gonna happen if we take many samples and compute the sampling distribution.
But if we do that, it's very expensive. Going and measuring 50 penguins and computing their mean flipper length a thousand times costs a lot of money.
So. We can cheat, kind of.
We can sample from our sample. And re sampling from the sample allows us to approximate the sampling distribution.
So if we have a sample, we can construct a new sample by sampling from the original with the key thing here is with replacement.
So just because we picked X to our new sample, X one is X two from the old sample, doesn't mean we can't reuse X two.
The idea is that if the sample is drawn evenly and representatively from the distribution, the population distribution,
then the relative frequencies of different items in the sample reflects the relative frequencies of of approximately those values in the population.
And so if we sample from this, if we. Sample each data point from the sample.
The whole sample, then that's comparable to sampling it from the whole distribution.
So we do that multiple times to get an entire new sample, we want to have the same length as the old sample.
We compute from the statistics, from our new sample. Then we do this a bunch of times, a thousand times, 10000 times.
And the distribution of this statistic from doing the boot.
This and this technique is called the bootstrap. The distribution of the statistic from the bootstrap approximates the sampling distribution.
It's not perfect because there's stuff in the population that might not have made it into the sample,
but it's going to approximate the sampling distribution well enough to use it to start to estimate confidence and other properties of sampling.
The statistic. So to bootstrap a confidence interval, one way we can do it.
This is called the percentile method. Is to compute the mean of a bunch of samples.
So the mean we're computing a mean and this I'm doing all this with no high.
So if X is our sample. This can be a panda series.
The choice, the choice random number generator method is going to draw a sample of size N.
And the num pi choice function by default samples with replacements.
We're going to draw a sample of size N where N is the same length as as the Xs so and equals one of X.
S. We're going to do this ten thousand times, this is Arby.
And then we can compute. The two point five percentile and the ninety seven point five percentile to get
a window of where is 95 percent of the probability mass in this distribution.
And so we do this with our penguins and we get with one of our species of penguins and we get a confidence interval.
When you ask seabourne cap plot to plot the mean of the values grouped by category.
The error bars it gives you or the confidence bars it gives you are done using a 95 percent bootstrapped confidence interval like this.
It's the bootstrap itself is very simple. Its power is that we can replace mean with basically any statistic.
And we use Rie's sampling from the sample in order to estimate the sampling distribution.
Also, this syntax I'm using here. This square brackets is how we make a list.
Remember this syntax where we have an expression.
And then we have for I in range or I n anything that's iterable is what's called a list comprehension.
It's a very convenient way to build up a list from a loop. This is a place where we do use a loop.
It might be possible to vectorized the bootstrap. Doing so is difficult.
And also, so long as we vectorized each individual bootstrap vectorized in the bootstrap process
itself isn't as important because the bulk of the computation is within the iterations.
So we're gonna go ahead, do a for loop over our bootstrap intervals and then the bootstrapped the actual.
Bootstrapping itself. What we do within each bootstrap iteration.
That's thoroughly vectorized. So this gives. But doing this process, this boot means this is a distribution of the boot means.
And it shows that we've got the sample mean and it shows it matches up with those Quantico's that we just saw in the previous slide.
So we can estimate the sampling distribution for any statistic and we can estimate arbitrary properties of the sampling distribution,
which notes Quantrill's wants to know its variance. Lots of different things we can do with the sampling distribution by doing the bootstrap.
It's not a perfect method,
but it's a remarkably powerful method that allows us to do quite a few different things to understand the sampling behavior of our statistics.
To wrap up the sampling distribution requires taking multiple sample to the sampling.
Distribution is about what happens. We take a lot of samples from the population.
We can simulate this by re sampling the sample itself using a technique called the bootstrap.

🚩 Week 4 Quiz#

The Week 4 quiz is on Canvas, and is due at 12pm (noon) on Monday, Sep. 20.

📓 Penguin Inference#

The Penguin Inference notebook shows confidence intervals and hypothesis tests on the penguin data.

📚 Further Reading#

If you want to dive more deeply into probability theory, Michael Betancourt’s case studies are rather mathematically dense but quite good:

Probability Theory (For Scientists and Engineers)
Conditional Probability
Product Placement (probability over product spaces)

For a book:

Probability for Data Science
Introduction to Probability by Grinstead and Snell
An Introduction to Probability and Simulation - a hands-on online book using Python simulations

📚 Extra Reading (Philosophy)#

Moving to a World Beyond “p < 0.05”, by Wasserstein, Schirm, and Lazar.
Abandon Statistical Significance, by McShane, Gal, Gelman, Robert, and Tackett. While the title is provocative, this article is not advocating against computing statistical significance measures. It advocates using them as one piece of evidence among many, instead of as an end-of-the-story bright-line rule for establishing discovery.
Interpretations of Probability. I primarily operate from somewhere in the subjective school, with a strong dose of instrumentalism.

📩 Assignment 2#

Assignment 2 is due on 9/25.

CS 533 Fall 2022

Week 4 — Inference (9/12–16)

Contents

Week 4 — Inference (9/12–16)#

🧐 Content Overview#

📅 Deadlines#

🎥 Introduction#

🎥 Probability#

🎥 Joint and Conditional Probability#

🎥 Continuous Probability#

📃 Notes on Probability#

🎥 📓 Distributions#

Resources#

🎥 Sampling and the Data Generation Process#

Resources#

🎥 Confidence#

📃 Confidence in Confidence#

🎥 The Bootstrap#

🚩 Week 4 Quiz#

📓 Penguin Inference#

📚 Further Reading#

📚 Extra Reading (Philosophy)#

📩 Assignment 2#