In this video, I introduce single-variable regression.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
SINGLE REGRESSION
Learning Outcomes
Model a linear relationship between two variables
Estimate the parameters of this model
Identify the assumptions needed for the model’s inferential validity
Photo by Alessio Lin on Unsplash
Variables and Model
This is univariate (one-variable) regression
Regression Model
Regression Model
Note: I have extended this plot to have 0 at the left end of the x axis to highlight the intercept. Intercept is where the line crosses 0, not where it crosses the left Y axis.
DGP
Movie: *exists*
Critics: this is ok, I guess
Audience: 👏
Critic ratings do not cause audience ratings
Relationship does not need to imply causality
Fitting Lines
We can (almost) always fit a line
Resulting line is least squares
Can be used to predict
Inference makes assumptions:
Linear relationship
Independent observations
Normal residuals
Equal variance of residuals
Result: residuals are i.i.d. normal
Penguins
Explains 76% of variance
1mm flipper ⇔ 49.7g mass
Stats models warns us about condition number – not actually a problem here. But we can still make it go away.
Penguins
Wrapping Up
Linear regression predicts one variable with another using a linear relationship.
Inference makes several key assumptions.
Standardizing variables puts coefficients in units of standard deviations.
Photo by Gemma Evans on Unsplash
- Oh, and this video, I'm going to talk more about how single regression actually works.
- There's a notebook going with this sequence of videos that's going to include the regressions and is going to give you the codes.
- You can actually see how to run them and how to set up the data in order to get them to work.
- So our goal here is to be able to model a linear relationship between two variables,
- estimate the parameters of this model and identify the assumptions that are needed for the model to be what we call inferentially valid.
- So remember, we've got our dependent, variable, independent variables, we're trying to predict the outcome with the predictors.
- And this is this is a univariate regressions. We're trying to predict our audience ratings.
- They're all critic rating. This won't be the only example I show you today.
- So when we do this with a regression model, we use what I'm using here, the stats models formula interface,
- which lets me write a little a little formula here that says I want to predict the audience ratings.
- This is the outcome. And then we've got this separator here and then I'm trying, so I'm trying to predict that outcome with predictors or features.
- So that's how the that's how the code is set up there.
- You've got fertility, that's our. And it means predict the stuff on the left hand side of the tildy with the stuff on the right.
- Right now, there's just one variable, because as I said, this is univariate or single variable regression.
- And this gives us an intercept.
- Beta Sub-Zero is the intercept. It gives us a coefficient beta sub one.
- For each of these coefficients, we get the coefficient itself.
- So our point zero point one eight three eight.
- The intercept, we can think of the intercept as a coefficient for a variable that's always one and actually internally.
- That's what Stats Models does. It augments your data with one more variable. That's one.
- And. And the intercept is the coefficient for that.
- We also have a standard error of the coefficient and a confidence interval of the coefficient.
- That is, it's just like the confidence intervals that we've had before. It's an estimate of of the precision of this this coefficient.
- There's also a P value which tests the null hypothesis that the coefficient is zero.
- We have a P value for the overall model. And we have a.
- And R-squared, which as I said previously, this is the percent of the variance.
- This is the fraction of the variance that's explained. So. Audience ratings have some variance.
- Forty percent of that variance can be thirty nine point four percent of that variance can be explained with the top critic rating.
- So if we take away the effect of the top critic rating effective, what this means is the residuals,
- the reason, the variance of the residuals will be 60 percent of the variance of the original data.
- Because we've explained 40 percent of the variance. So we we can draw our line here and here, I spread out the x axis.
- We see it across the whole place, the whole the whole frame. So our intercept it or our intercept is right here.
- Where the line crosses zero and that's two point two eight, and then we have a slope of 0.01 eight.
- So if we look at the where it crosses 10. And.
- So if we look at where it crosses 10 and we look at where it crosses the intercept,
- there should be a difference of one point eight because it's going to be 10 times eight point one eight.
- So that's the that's the the structure of this line. We have a slope and we have an intercept.
- And then the variance of the data itself, the variance is.
- The variance is based on this whole height.
- So that's the variance of the data. But if you were to tilt your head until the data so that.
- It's around this line. That's a smaller variance.
- And so that's what the linear route and that's what we're saying when we were talking about the explained variance is that R squared,
- the explained variance is the difference between the overall variance of the data and the variance.
- After we have accounted for the effect that we're modeling, you can see that.
- You can see that the variance is going to decrease if you if you look at it centered around this other line instead of just centered vertically.
- So we get the model. As I said, two point two, eight plus point one eight times the the x axis value.
- So the data generating process we're looking at here as the movie exists, critic gives it a rating, audience gives it a rating.
- Audience is not. This is the Dagg should look Dagg.
- Directed a cyclic graph, we don't deal with cyclic causality.
- We've got the movie producers, critic ratings and producers, audience ratings.
- There might be some slight causal pathway, like if people go watch more movies that are rated highly.
- But what we're measuring is this correlation, though, between between critic rating and audience rating.
- But this is the underlying DGP. We're not saying they cause audience ratings, but the relation.
- But we don't have to have causality for it to be a useful predictor or for there to be a useful
- and meaningful relationship when you want to take out other effects is complicated and subtle.
- So to talk just a little bit about fitting these lines, we can almost basically always fit a line.
- There's a couple of degenerate edge cases where it doesn't work.
- But given two variables, you can fit a line through them and the resulting line is the least squares.
- It's the best linear predictor under the measurement least squares of the resulting error.
- And we can use it to make predictions if we want to do inference. We want to use the model to tell us.
- Oh. An increase of critics' ratings by one star increases audience ratings by point one eight star.
- That's a reliable effect. Then we need to understand the inferential assumptions of the linear regression model.
- Therefore, it assumes that there is a linear relationship. It assumes that our observations are independent.
- It assumes that the residuals are normally distributed. So after we and they have equal variance.
- So effectively these last three, what it manifests says is after we've after we've we have our linear effect,
- then the residuals should be independent and identically distributed from a normal distribution.
- We can have tests for pieces of that later. So if we look, though, at the penguin data.
- So if I'm going to try to predict a penguins body mass using the length of its flipper.
- So we've got our penguin. He has some feet.
- It has a flipper. It has a head and a little beak penguin kind of.
- I'm not very good at drawing, but there is a penguin and we're trying to use its flipper length.
- And we want to predict its mass. We can do that here and we're explaining 76 percent of the variance are square squared is point seven five nine.
- And we have a. We had a coefficient of forty nine point six, eight or six nine.
- What that means is. If a penguin has one millimeter longer flipper than it probably has fit almost 50 grams more mass.
- When we run this code, stats models also warned us about a condition number.
- The specific problem it's talking about, about multicore linearity is not a problem here because we only have one variable.
- We can still make it go away, though. So right now, we're we're regressing our raw values.
- We're regressing. And we we call this we regress against. We're regressing body mass.
- Against the flipper length, that's just the way we talk about it.
- But we're regressing. The body mass and Graham's its original units against the flipper lengthened millimeters,
- which the resulting coefficient is interpretable in the original units.
- So the resulting coefficient is grams per millimeter. But we can also rerun the model doing it normalized with what we call Z scores.
- And a Z score is the value minus the mean.
- And then divide it by the standard deviation. And these are called ze normalize or ze or standardized variables.
- And the result is that they have a mean of one. So zie.
- Z. I. Bar. Excuse me, not zii z.
- BAA equals one. The standard deviation of Z is one.
- Excuse me. Z bar is zero. There I mean of zero and a standard deviation of one.
- And now the coefficients are in standard deviations. We have this coefficient of point eight seven.
- And what that means is an increase in flipper length of one standard deviation results in an increase in body mass.
- Of point eight seven standard deviations. Depending on your particular inferential needs,
- regressing Z's standardized variables can be more interpretable because you're
- talking about in terms of standard deviations rather than in terms of raw units.
- So wrap up linear progression, linear linear regression predicts one variable with another using what we call a linear relationship,
- a sum of scalar multiplications inference. Using this makes several key assumptions.
- We'll be talking more about in a later video. And we can standardize variables that result in a model where the coefficients are in
- units of standard deviation rather than in units of the underlying raw measurements.
Slide Clarification
On slide 6, where I show the slope, intercept, and variance in a model, I have extended the plot to include 0 at the left end of the x-axis.
This is to highlight the meaning of the intercept. It is important to note that the intercept is where the line crosses zero, not where it crosses the left Y-axis.
Also, when discussing this slide, I am imprecise but make it sound like the unexplained variance is the remainder after projecting the data onto the line.
It is the variance remaining after subtracting the line.
A video in Week 9 provides more clarity on this relationship.