# Week 8 — Regression (10/10–14)

In this week, we are learning about linear regression with StatsModels.
All the examples will use the StatsModels OLS (ordinary least squares) model, generally with the
formula interface.

## 🧐 Content Overview

This week has **1h23m** of video and **2914 words** of assigned readings. This week’s videos are available in a Panopto folder.

## 🎥 Introducing Regression

CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
INTRODUCING REGRESSION
Learning Outcomes (Week)
Compute single- and multi-variable linear regressions
Test for violations of linear model assumptions
Measure the predictive accuracy of a regression model
Photo by John Rodenn Castillo on Unsplash
Variables and Model
What Is a Model?
A mathematical representation
Of the essential dynamics
Of a system of interest
Purposes of Modeling
Understand predictor/outcome relationships
Correlational, not causal (at least alone)
Predict future outcomes
Estimate hypothetical outcomes
Correct for effects for further analysis
Regression predicts (or estimates) a continuous variable
Linearity
Wrapping Up
Models summarize data generating processes and let us predict future outcomes.
Linear models are a particularly simple but flexible class of models.
Photo by Drew Graham on Unsplash

- Hello. This week, we're going to talk more about regression. We introduced it when we started talking about two variables.
- We'll be studying a lot more, how to use it and how to evaluate it. In this week's material,
- so are learning outcomes are for you to be able to compute single and multivariable regressions to
- test for violations of their assumptions and to assess the predictive accuracy of a regression model.
- So if you'll remember, we have a dependent variable y called the outcome.
- This case, if we're going to try to predict the Rotten Tomatoes audience rating with the critics rating,
- the regression version of what you did in assignment to. We have our independent variables, X predictors.
- In this case, the all critics rating, and our goal is to try to estimate compute an estimate which we call Y hat.
- That'll be our standard notation for an estimate that tries to.
- We try to estimate the audience rating using the critics rating.
- And so we can learn a linear model, a line where we have an intercept. We have a slope.
- And then we have some residuals. The goal is that we learn these parameters in order to minimize the least squares.
- We're going to be talking more in subsequent videos about what precisely that means.
- But this is the setup. We have this outcome and we want to try to predict it.
- This is a very common setup and data science tasks. A lot of times what we're going to try to do is predict some outcome.
- This is also a really common task in machine learning.
- If you take the machine learning class in the spring, you'll be doing a lot of this.
- So but talk about this. We're gonna be using the term model quite a bit.
- What does that mean? Or model for our purposes? There's lots of different words in the model meetings, the word model.
- It's a mathematical representation of the essential dynamics of a system of interest.
- It's a simplification, but it captures the piece of the system that we're interested in measuring.
- For example, if we. If critics are doing their job of assessing the quality of movie and in fact recommending effectively to audiences,
- we would expect critic ratings to match audience or to map in some respect with audience ratings.
- And so we can look at the relationship to model. It's a gross simplification.
- We can look at the relationship to model. The pathways by which critic ratings and audience ratings might be related.
- So there are several different purposes of modeling. One is to understand the relationships between our predictor and our outcome variables.
- The model itself, these are correlational. They are not causal. You have to have great care when ascribing causality to what we learn.
- But another purpose is to be able to predict future outcomes.
- And this doesn't require the model to be causal or even to be what we call inferentially valid.
- It just means there has to be a strong enough and reliable enough relationship that you can use it for the prediction.
- You can also use it to estimate hypothetical other outcomes. You have to be careful with this.
- But if you've got a model, you can use it to estimate, well, what would happen if the data were like this, but a little bit different.
- And then you can also use them to estimate of the effect of something then you can correct for that effect in later analysis.
- Regression is a type of modeling that predicts or estimates a continuous variable.
- So you've got one or more. We have a continuous outcome variable and we have one or more.
- We have one or more explanatory variables. Or features and that we're going to use them to predict that continuous outcome.
- So a linear we're particularly focused here on linear models, so.
- And because it's linear regression in a linear equation is a variable as an equation where you have a sum.
- This whole thing is a sum of individual scalar multiplication. So X K times a constant.
- X one times a constant. To be zero batur zero times one.
- And so it's a linear model, is a linear equation, is the sum of these together multiples.
- Linear models are incredibly capable. Useful for a wide range of things.
- And also, we can transform many nonlinear problems into a linear problem.
- Either transform our features and now the problem's linear, transform our outcomes and other problems, linear viotti of techniques for doing that.
- So for a large range of problems, linear models are very good first thing to try.
- And there's also a very good baseline comparison for more sophisticated models.
- So we're gonna be learning this week about linear regression models were then later in later weeks, we're going to see linear classification models.
- So to wrap up models, summarize data generating processes and let us predict future outcomes, given that we might not have observed yet.
- And linear models are particularly simple but flexible class of models that we're going to be spending time with over the next weeks.

## 🎥 Statistical Modeling

CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
MODELING
Learning Outcomes
Understand the relationship of models to reality
Photo by Andrew Neel on Unsplash
What Is a Model?
A mathematical representation
Of the essential dynamics
Of a system of interest
Things
“Raw” Data
Data Set
Inferences
Answers
Phenomena / Experiment
Data Pipeline
Data Generating Process
Things
“Raw” Data
Phenomena / Experiment
Statistical Modeling
Simplify the DGP to predict an outcome
Correlational models look at relationships between observables
No causal claim unless experiment design enables it
Causal models model DGP more deeply to estimate causality
Subtle and quick to anger
Simplifications
Models necessarily simplify the problem
They hopefully capture enough to infer or predict usefully.
Unpredicted value is residual or error.
Movie Genres
Movie
Genre(s)
Reviews
?
Parameters
Model parameters are numeric values as part of the model.
We estimate them from the data.
Mean critic rating for each genre (how to handle multiple membership?)
If the model captures the DGP, these are also underlying population parameters.
Example (adapted from Wikipedia)
Wrapping Up
Models simplify a process so we can understand or predict features of interest.
They are simplifications of the data generating process, to learn about it.
Photo by Scarbor Siu on Unsplash

- Blow, the previous video introduced the idea of modeling, and in this video, we're going to talk more about it.
- So the on the learning outcome is to understand the relationship of models to reality.
- And here we're not talking about walking down runways, wearing various articles of fabric, created things that make ordinary people say,
- how much do they have to pay you for you to put that on and pretend it was clothes or building ships and airplanes and various other things?
- We're going be talking primarily about statistical models.
- So as I said in the previous video, a model is a mathematical representation of the essential dynamics of a system of interest.
- But we have ours. We have our.
- If we think about our data pipeline and our data generating process, we have some things that are out there happening in the world.
- These give rise to either some observable phenomenon or an experiment that we can run that will elucidate.
- These are observable phenomenon. We then collect data from these that are going to be our raw data.
- What are our observations either from the experiment or observations collected from
- the phenomena that then we process into a data set that's useful for our task?
- We have inferences and then finally we hopefully get some answers. But we can think about this top piece here as this is our data generating process.
- The D. G. P. How did our raw data come into existence?
- Things happened that were observable phenomena. We collected the data. How did that happen?
- So in statistical modeling, we simplify the data generating process to be able to predict an outcome.
- There's two, broadly speaking, two kinds of models that we can think about.
- Correlational models, look at relationships between observables and so in our movie data.
- If we want to keep using the movie data as an example, you've got critic.
- And you've got the audience.
- And we can look at correlation between what the critic thinks and what the audience thinks or what the critic thinks and how much.
- Our. Box office revenue or our.
- How many people are rating the movie? But it doesn't make a causal claim.
- We don't have this causal claim that the critic rating causes the audience.
- It's just that the critic rating and the audience are related. There may be a little causation, but we're not making a causal claim here.
- A causal model models the DGP more deeply to estimate causality.
- You have to be really careful with these. There's an entire field called called causal inference that it's looking at.
- How do you rigorously do this when you can't run an experiment? The best way is when you can run an experiment, if you can run an experiment.
- You can often get causality through what's called a randomized controlled trial.
- If you're able to set up an experiment where you've got a group that a group that receives the treatment, you call it a treatment,
- the treatment you want to test and a group that doesn't and they aren't all other way, all other ways equal.
- Randomization is a good way to achieve that. Then you can conclude that your intervention or your treatment caused the outcome.
- But. Randomized controlled trials. Are expensive and often oftentimes they're not very.
- They're not even possible to run feasibly due to the nature of the phenomenon or
- due to the do the nature of the phenomenon or due to the non replicability of it,
- which is a part of the nature. So you can't it's difficult to do randomized controlled trials, say, of national or even state level health policies.
- You can do some state level thing by comparing one state to another. But but if you've got like there's only one United States of America.
- So you can't you can't say, well, what happens in the US with one policy and without one policy because or replicates of the United States.
- Cause you don't really you can't duplicated or even more so the world.
- But there are ways and there are two contexts and times in which you can infer causality from observables and causal inference.
- Is the study of that. So the model necessarily simplifies the problem, we isolate the pieces we care about,
- we might need to have some other pieces involved in order to to what we call control for other effects.
- But. The simplified model, hopefully capture like the goal is not to completely represent the underlying reality.
- The goal is to capture enough that we can predict or infer usefully.
- And then we have the unpredicted value, which is what we call the residual or the error.
- And so if we think about our movies. Movie. Movie. Example. The movie data.
- And again, maybe we think about the genres. So a movie exists and it has some genres.
- And it's in some genres. And we have a difference.
- And it also gets reviews and. The genres might cause some of the reviewing behavior,
- but they're probably both caused by the movie itself and the movie is a Western and the movie gets them reviews.
- And so we can look at the correlation withdrawn under to see if there's a difference.
- But that's not the same thing as claiming the genre causes the review difference.
- What we need to go more deeply is to understand what causes the movie to be in a particular genre.
- Now, when we're modeling, we're also going to be talking about what we call parameters.
- And we've introduced parameters before in the statistical sense,
- in modeling a model parameter is a numeric variable as a part of the model, in a linear model, they're going to be our coefficients.
- We estimate these parameters from the data. And for example, in the before, we might estimate the mean credit rating for each genre.
- We then have a little challenge of how do we handle multiple Shomrim membership's?
- But if the model if the model captures the data generating process and it's an
- accurate reflection of a simplified version of the data generating process,
- then the model parameters will hopefully also correspond to underlying population parameters.
- We don't necessarily get that guarantee, but the goal of a good model for inferential purposes is to get these correspondences
- to underlying population parameters to describe how the mechanism works on the inside.
- So for an example that adapted from Wikipedia. Children grow as they age and older children.
- Therefore we're gonna. If one child is older than another, we expect it to probably be taller.
- There's a lot of variation. There's a lot of variation in the height of five year olds.
- But if we randomly pick a five year old and a six year old, we probably expect the six year old to be a little taller.
- And so we can model this as modeling height as a base offset plus a multiplier of the age.
- And this isn't perfect. I get it. As I said, there's a lot of variation in the heights of five year olds, but it's a starting point for prediction.
- And the residual is what the model can't account for.
- And we model these as we model the residuals as random.
- That doesn't mean they are random. But we can treat these unknowns as a random variable.
- And so we we treat these these residuals as a random variable.
- We're going to have random variable ways of dealing with them. But this is is what we get when we have a model.
- Older children are taller. So we can try to predict it a child's height with their age.
- And then we have this leftover that we can't predict with the age.
- There may well be other things that we could throw in that would allow us to predict that. But we can't predict it with age.
- So to conclude, models simplify a process so we can understand at least pieces of it and predict them.
- They're simplifications, but they're related to the data generating process that we can try to learn about what is actually happening.
- That's giving rise to the data that we see so we can use the data to learn about the world.

## 🎥 Single Regression

In this video, I introduce single-variable regression.

CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
SINGLE REGRESSION
Learning Outcomes
Model a linear relationship between two variables
Estimate the parameters of this model
Identify the assumptions needed for the model’s inferential validity
Photo by Alessio Lin on Unsplash
Variables and Model
This is univariate (one-variable) regression
Regression Model
Regression Model
Note: I have extended this plot to have 0 at the left end of the x axis to highlight the intercept. Intercept is where the line crosses 0, not where it crosses the left Y axis.
DGP
Movie: *exists*
Critics: this is ok, I guess
Audience: 👏
Critic ratings do not cause audience ratings
Relationship does not need to imply causality
Fitting Lines
We can (almost) always fit a line
Resulting line is least squares
Can be used to predict
Inference makes assumptions:
Linear relationship
Independent observations
Normal residuals
Equal variance of residuals
Result: residuals are i.i.d. normal
Penguins
Explains 76% of variance
1mm flipper ⇔ 49.7g mass
Stats models warns us about condition number – not actually a problem here. But we can still make it go away.
Penguins
Wrapping Up
Linear regression predicts one variable with another using a linear relationship.
Inference makes several key assumptions.
Standardizing variables puts coefficients in units of standard deviations.
Photo by Gemma Evans on Unsplash

- Oh, and this video, I'm going to talk more about how single regression actually works.
- There's a notebook going with this sequence of videos that's going to include the regressions and is going to give you the codes.
- You can actually see how to run them and how to set up the data in order to get them to work.
- So our goal here is to be able to model a linear relationship between two variables,
- estimate the parameters of this model and identify the assumptions that are needed for the model to be what we call inferentially valid.
- So remember, we've got our dependent, variable, independent variables, we're trying to predict the outcome with the predictors.
- And this is this is a univariate regressions. We're trying to predict our audience ratings.
- They're all critic rating. This won't be the only example I show you today.
- So when we do this with a regression model, we use what I'm using here, the stats models formula interface,
- which lets me write a little a little formula here that says I want to predict the audience ratings.
- This is the outcome. And then we've got this separator here and then I'm trying, so I'm trying to predict that outcome with predictors or features.
- So that's how the that's how the code is set up there.
- You've got fertility, that's our. And it means predict the stuff on the left hand side of the tildy with the stuff on the right.
- Right now, there's just one variable, because as I said, this is univariate or single variable regression.
- And this gives us an intercept.
- Beta Sub-Zero is the intercept. It gives us a coefficient beta sub one.
- For each of these coefficients, we get the coefficient itself.
- So our point zero point one eight three eight.
- The intercept, we can think of the intercept as a coefficient for a variable that's always one and actually internally.
- That's what Stats Models does. It augments your data with one more variable. That's one.
- And. And the intercept is the coefficient for that.
- We also have a standard error of the coefficient and a confidence interval of the coefficient.
- That is, it's just like the confidence intervals that we've had before. It's an estimate of of the precision of this this coefficient.
- There's also a P value which tests the null hypothesis that the coefficient is zero.
- We have a P value for the overall model. And we have a.
- And R-squared, which as I said previously, this is the percent of the variance.
- This is the fraction of the variance that's explained. So. Audience ratings have some variance.
- Forty percent of that variance can be thirty nine point four percent of that variance can be explained with the top critic rating.
- So if we take away the effect of the top critic rating effective, what this means is the residuals,
- the reason, the variance of the residuals will be 60 percent of the variance of the original data.
- Because we've explained 40 percent of the variance. So we we can draw our line here and here, I spread out the x axis.
- We see it across the whole place, the whole the whole frame. So our intercept it or our intercept is right here.
- Where the line crosses zero and that's two point two eight, and then we have a slope of 0.01 eight.
- So if we look at the where it crosses 10. And.
- So if we look at where it crosses 10 and we look at where it crosses the intercept,
- there should be a difference of one point eight because it's going to be 10 times eight point one eight.
- So that's the that's the the structure of this line. We have a slope and we have an intercept.
- And then the variance of the data itself, the variance is.
- The variance is based on this whole height.
- So that's the variance of the data. But if you were to tilt your head until the data so that.
- It's around this line. That's a smaller variance.
- And so that's what the linear route and that's what we're saying when we were talking about the explained variance is that R squared,
- the explained variance is the difference between the overall variance of the data and the variance.
- After we have accounted for the effect that we're modeling, you can see that.
- You can see that the variance is going to decrease if you if you look at it centered around this other line instead of just centered vertically.
- So we get the model. As I said, two point two, eight plus point one eight times the the x axis value.
- So the data generating process we're looking at here as the movie exists, critic gives it a rating, audience gives it a rating.
- Audience is not. This is the Dagg should look Dagg.
- Directed a cyclic graph, we don't deal with cyclic causality.
- We've got the movie producers, critic ratings and producers, audience ratings.
- There might be some slight causal pathway, like if people go watch more movies that are rated highly.
- But what we're measuring is this correlation, though, between between critic rating and audience rating.
- But this is the underlying DGP. We're not saying they cause audience ratings, but the relation.
- But we don't have to have causality for it to be a useful predictor or for there to be a useful
- and meaningful relationship when you want to take out other effects is complicated and subtle.
- So to talk just a little bit about fitting these lines, we can almost basically always fit a line.
- There's a couple of degenerate edge cases where it doesn't work.
- But given two variables, you can fit a line through them and the resulting line is the least squares.
- It's the best linear predictor under the measurement least squares of the resulting error.
- And we can use it to make predictions if we want to do inference. We want to use the model to tell us.
- Oh. An increase of critics' ratings by one star increases audience ratings by point one eight star.
- That's a reliable effect. Then we need to understand the inferential assumptions of the linear regression model.
- Therefore, it assumes that there is a linear relationship. It assumes that our observations are independent.
- It assumes that the residuals are normally distributed. So after we and they have equal variance.
- So effectively these last three, what it manifests says is after we've after we've we have our linear effect,
- then the residuals should be independent and identically distributed from a normal distribution.
- We can have tests for pieces of that later. So if we look, though, at the penguin data.
- So if I'm going to try to predict a penguins body mass using the length of its flipper.
- So we've got our penguin. He has some feet.
- It has a flipper. It has a head and a little beak penguin kind of.
- I'm not very good at drawing, but there is a penguin and we're trying to use its flipper length.
- And we want to predict its mass. We can do that here and we're explaining 76 percent of the variance are square squared is point seven five nine.
- And we have a. We had a coefficient of forty nine point six, eight or six nine.
- What that means is. If a penguin has one millimeter longer flipper than it probably has fit almost 50 grams more mass.
- When we run this code, stats models also warned us about a condition number.
- The specific problem it's talking about, about multicore linearity is not a problem here because we only have one variable.
- We can still make it go away, though. So right now, we're we're regressing our raw values.
- We're regressing. And we we call this we regress against. We're regressing body mass.
- Against the flipper length, that's just the way we talk about it.
- But we're regressing. The body mass and Graham's its original units against the flipper lengthened millimeters,
- which the resulting coefficient is interpretable in the original units.
- So the resulting coefficient is grams per millimeter. But we can also rerun the model doing it normalized with what we call Z scores.
- And a Z score is the value minus the mean.
- And then divide it by the standard deviation. And these are called ze normalize or ze or standardized variables.
- And the result is that they have a mean of one. So zie.
- Z. I. Bar. Excuse me, not zii z.
- BAA equals one. The standard deviation of Z is one.
- Excuse me. Z bar is zero. There I mean of zero and a standard deviation of one.
- And now the coefficients are in standard deviations. We have this coefficient of point eight seven.
- And what that means is an increase in flipper length of one standard deviation results in an increase in body mass.
- Of point eight seven standard deviations. Depending on your particular inferential needs,
- regressing Z's standardized variables can be more interpretable because you're
- talking about in terms of standard deviations rather than in terms of raw units.
- So wrap up linear progression, linear linear regression predicts one variable with another using what we call a linear relationship,
- a sum of scalar multiplications inference. Using this makes several key assumptions.
- We'll be talking more about in a later video. And we can standardize variables that result in a model where the coefficients are in
- units of standard deviation rather than in units of the underlying raw measurements.

Slide Clarification

On slide 6, where I show the slope, intercept, and variance in a model, I have extended the plot to include 0 at the left end of the *x*-axis.
This is to highlight the meaning of the intercept. It is important to note that the intercept is where the line **crosses zero**, not where it crosses the left Y-axis.

Also, when discussing this slide, I am imprecise but make it sound like the unexplained variance is the remainder after *projecting* the data onto the line.
It is the variance remaining after *subtracting* the line.
A video in Week 9 provides more clarity on this relationship.

## 🎥 Prediction and Inference

CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
PREDICTION AND INFERENCE
Learning Outcomes
Distinguish between prediction and inference
Understand different meanings of the word ‘inference’
Photo by Mark Boss on Unsplash
Prediction and Inference
Prediction tries to predict the future with the model.
Model does not need to be valid
Quality & strength: accuracy of predicting unseen data
Evaluation does need to be valid
Inference uses the model’s structure and parameters to learn.
Validity depends on assumptions
Quality & strength: R2, p-value, assumption checks, coefficients
Inference
Assumptions:
Linearity — that this is actually a linear relationship
Independent observations
Normal residuals
Homoskedastic residuals (equal variance)
Prediction
Fitting assumptions: none
Testing assumptions: test data excluded from training
Train-Test
Data
Train
Test
Model
Split
Predict
Split before exploratory analysis of features.
Supervised learning – we have thing to predict
Splitting Data
Split before exploratory analysis
If testing multiple versions: split in 3
Train – train the model
Tune – optimize model, compare modeling decisions
Test – final validation, compare with current / previous system
Other Uses of Inference
Here: using analysis to understand (“infer”) things about the underlying world
Sometimes elsewhere: using a trained model to make predictions about specific instance (the “inference” stage of a machine learning pipeline)
Wrapping Up
Inference uses a model’s structure and parameters to understand an underlying phenomenon.
Prediction uses a model to predict future observations.
Inference has stronger requirements.
Photo by NeONBRAND on Unsplash

- For this video, we're going to talk about prediction and inference,
- learning outcomes for you to be able to distinguish between prediction and inference and understand different meanings of the word inference.
- We've talked about prediction and inference a little bit before. I talked about inferential validity.
- This video, I'm going to talk more about what we mean by these concepts.
- So in prediction, we're trying to predict other values from the model with our penguins.
- We might be able to we might be looking at, OK. That can I predict.
- If I get a if I if you give me the flipper length of a penguin in the future, can I accurately predict its bodyweight for our movies?
- If you give me some critic ratings, I predict its audience response.
- Or can I predict its box office? Proceeds. Porten thing about prediction is that models don't need to be inferentially valid.
- And the quality and strength we estimate the quality and strength of the prediction by actually trying to use it to predict data.
- We haven't allowed it to seen to see. It's important to note the evaluation of the prediction.
- Accuracy does need to be statistically valid. Inference uses the models, structure and parameters to learn about the world.
- So we might want to learn about fundamental relationships, about penguin physiology.
- Its validity depends on assumptions that we introduced last week or in the last video,
- we talked about single very regressions and then we have various measures of its quality and strength.
- R squared is how much variance that explains P value is statistical significance.
- We're going to have checks for its various assumptions and we've got the coefficients which look at the strength of the relationship.
- So what we're trying to do inference we can make a statement of credit rating increases
- as the critic rating in crace of one point increases audience ready by point one.
- There'll be an inferential statement. It's making a claim about the relationship of these things.
- And we measure this with we've got our squared, as we saw in the last video of point three nine four.
- We've got a very small P value and we assume that it's a linear relationship.
- We assume that our observations are independent.
- That we've got these normal residuals and also that the residuals have equal variance or what we call Hommel Skip Mastec.
- I'm never not going to think that homo scare tactic and it's opposite hetero skin mastic are very strange words.
- So that's an inferential kind of claim. We're claiming a relationship and not necessarily causal.
- But we're claiming this correlational relationship prediction looks really similar.
- We can predict audience ratings using the model to point to eight plus one point one eight.
- Our top. And we measure this by saying, oh, what's mean absolute error?
- So if we take the the errors, those Epsilon's residuals,
- we take the absolute value and take the average is two point is point to eight six or the root mean squared errors
- point three five nine root means squared error is a very common measurement to the accuracy of a predictive model.
- And we're not making any assumptions.
- We don't even have to assume linearity. We can for this to be a useful predictor, it'll hopefully be a better predictor if we if it's actually linear.
- But if if all we care about is how accurate it is at predicting, then.
- We don't have to worry about all of our influence, inferential validity assumptions,
- but we also have to be very, very careful about the claims we make.
- We can claim it's useful for predicting, but we can't, from an invalid model, make claims about the underlying phenomenon that we care about.
- It does make the assumption that we have tested it on data that were not available for training.
- So what does that mean? So in a machine, what we call it, a machine learning setting.
- We often have what we call a trained test split. And so we take our input data.
- And we split it into two pieces. This is our input.
- We say 80 percent of the data. 80 percent of comment split goes into our training pool and 10 percent.
- Goes into our testing pool. We then fit our linear regression or whatever other model we're doing.
- We fit or train at. On the training data, it does not get to see the test data.
- So we're letting your model does not include the test data in the fitting process.
- And then we ask the model to predict the test outcomes using the test features.
- And what this does is it tests. It's easy to predict data data.
- I've seen. I just memorize all the data. I've seen it test, though.
- Can I learn a relationship from training data that generalizes to data that was not in the training data?
- And we called particularly we call the kind of learning that we're doing this week supervised learning because we have a thing,
- this continuous outcome variable that we're trying to predict. It's also important, though, to split before the exploratory analysis of features.
- I'm going to be talking more about train test splits in a later video dedicated to experimental setup.
- But this is the basic idea of how we how we set up data to be able to test predictive accuracy because we don't want to be able.
- We don't want to test the model's memory. We want to test its ability to predict things that's never seen before.
- We need to split, as I said before, exploratory analysis,
- because we don't want knowledge of the distribution of our data and our tests of our variables
- in our test data to affect our choice of variables in the model development process.
- And if we're going to be, say, trying different features,
- trying different variants of a model before we settle on the final one, we actually need to split into three partitions.
- We have our training data. We have tuning for optimizing the model, testing the predictive accuracy of different candidates,
- and then we have a test data that's saved for our final validation.
- Again, we're gonna be talking more in detail about why each of these splits is necessary in a later video.
- But I also said that I want you to understand different uses that determine for it.
- So here we're using the word inference to talk about the use of a data analysis of the
- use of a statistical model to understand or infer things about the underlying world.
- So we're trying to understand how movie preference works.
- We're trying to understand penguin physiology or we're trying to understand the hydrodynamics of.
- Of a river or of an a quarter of a marine processing plant.
- And we're trying to use the data we collected and the statistical models and
- analysis that we do of it to draw conclusions about the underlying real world.
- There's some other uses of inference, though, particularly one that comes up in data science,
- in machine learning is sometimes when we have a model we've trained.
- It's we've take her. We fit our linear regression. And we're using it to make predictions for additional for new data points.
- Like maybe you've you've trained up a statistical model that estimates the likelihood that a particular transaction is fraudulent.
- Now you're using that model as transactions are coming in to your online shopping portal and to determine how likely they are to be fraudulent.
- Sometimes that last stage is called in France,
- especially in the deep learning literature on some other machine learning literature is that that's the first stage of the model training stage.
- And the inference stage is if you actually go use the model to make inferences about new new data points that are coming in.
- So to sum up, inference uses a model structure and parameters to try to understand things about the world and draw generalizable knowledge.
- Prediction just uses a model to predict future observations,
- particularly though inference has stronger requirements for its validity than the usefulness of prediction.
- It is important to note, though, that just because you can yolo the statistics inside a predictor if all you care about is its predictive accuracy,
- you have to pay attention to the validity of the statistics you use to test if the predictor is actually a better predictor than something else.
- You can't cheat on those, but if your only goal is predictive accuracy,
- the linear model you're using for it doesn't need to need to meet all the requirements that it needs to meet.
- If you're trying to use it for inference.

## 🎥 Categorical Predictors

CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
CATEGORICAL PREDICTORS
Learning Outcomes
Use categorical predictors
Understand dummy coding
Photo by v2osk on Unsplash
Categorical Variables
Unordered, discrete values (called levels)
Cannot do arithmetic
What’s gentoo * 3?
Even when integer-encoded, cannot use as model predictors!
Linear model coefficients measure change in outcome
Why should increasing user ID by 1 change outcome predictably?
One-Hot / Dummy Encoding
Convert levels to separate integer columns.
Encode with a 1 in the level’s variable, 0 elsewhere
N-1 Dummy Coding
Drop one level (typically first)
All 0 encodes the dropped level
Other levels encoded with a 1 in the proper place
Results
Categoricals are encoded as numeric variables
Can use in models (e.g. multiply by coefficients)
Wrapping Up
To use categorical variables in linear models, we need to encode them numerically.
Dummy-coding converts each level into a 0/1 variable indicating whether the observation has that value of the variable.
Photo by Hamed Daram on Unsplash

- This video, I want to talk about how to use categorical variables as predictors in our linear models,
- learning outcomes are for you to be able to use a categorical predictor and understand what dummy coding is and why we use it.
- Categorical variables, if you'll recall from earlier in the class or unordered discrete,
- the variables that take on an unordered set of discrete values that are called levels.
- And one of the things about these is we can't do arithmetic. You can't say like Gentoo Penguin Times three.
- That's not a meaningful concept to talk about.
- And even when we're using integers to encode our categorical variables, for example, a movie idee or just any kind of an I.D.,
- we can't use those as model predictors because then they don't have numeric meaning.
- It doesn't mean anything to multiply a movie idea by a coefficient.
- And so we need another way to encode categorical variables if we're going to use them in our linear models.
- And the solution is something called dummy coding. So what we do is we convert each level into a diff separate integer column.
- So our penguin's dataset has three different penguin species, Adelie chinstrap and Gentoo.
- And each of those becomes a column.
- The pandas get dummy's function will get will convert Assiri a categorical series into a data frame of Dumi coded values.
- And so this will give us the. So give us the dummy codings for the penguins.
- And we have a we have one for each for each species.
- This is also called one coding. Particularly when we're doing this, we have one for every level.
- It's call we. It's one hot coding or dumi coding.
- Another way that we can have a one where we have a one in the Adelie caller.
- If that penguin is in a deli and we're gonna have a zero everywhere else because we only have one value of the categorical variable perone.
- Often, though, we need to drop one of the categorical or one of the variables.
- Typically the first one and we do this with the drop first option to get dummies.
- And the Adelie column now, you'll see, is gone. We just have chinstrap and Gentoo.
- And what how it's encoded is at a deli.
- Penguin is all zeroes, but it chinstrap Penguin will have a one in the chinstrap column in a agentive will have a one in the Gentoo column.
- This is going to be particularly useful for linear regression because if we have all of the levels,
- we're going to have problems of fitting our linear regression models. If we just have we have all but one of them,
- then effectively we're treating one of the levels as the default and the other levels are getting the code for what
- happens when your when it's one of the other values instead of the first or the default value that you chose.
- So the results of this are that the category of calls are encoded as numeric variables that we can then use in models.
- We can multiply that, my coefficients, et cetera. We're going to see that in a model coming up.
- Wrap up to use categorically in our linear models or other Neumar models, or we're going to need to do numeric computations on a categorical value.
- We need to encode them numerically dummy coding.
- Let's do this by converting each level into a zero one variable that indicates whether or not the observation has that value of the variable.
- And often we drop one so that four end levels we have N minus one dummy variables and all zero is the.

## 🎥 Testing Assumptions

CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
TESTING ASSUMPTIONS
Learning Outcomes
Know the assumptions of linear models
Identify violations of assumptions regarding residuals
Photo by Austin Kehmeier on Unsplash
Linear Model Assumptions
Linearity — outcome and predictor have linear relationship.
Independence — observations are independent of each other
Normal errors — residuals are normally distributed
Equal variance — residuals have constant variance (called homoskedasticity; violation is heteroskedasticity)
Last three result in i.i.d. normal residuals
Linearity
Fundamental assumption – we wouldn’t run a linear model for things we think aren’t linear.
Scatter plot shows the line seems to fit.
Independence
Property of data collection
Not directly checkable
Commonly violated by:
Time series
Groups or common environments
Common environments
Repeated individuals
Solutions: correlated error models, hierarchical models, mixed effects models
Normal Errors
Errors must be (approximately) normally distributed.
Check with a Q-Q plot of residuals.
Violations mean:
Line still fits
P-values and CIs are unreliable
Equal Variance of Errors (homoskedasticity)
Residuals must have constant variance.
Check with residuals vs. fitted scatterplot or regplot.
Vertical positions should look like random noise.
Violation means model is failing to capture a systematic effect.
Heteroskedastic Errors
What to Do
Prediction? Might not matter.
Throw away the model – this might not work
Add, remove, or transform predictors
Transform outcome
Go to non-parametric, such as bootstraps or nonparametric regression (won’t have time for NPR)
Hierarchical or mixed-effects model
Wrapping Up
Linear models make four key assumptions necessary for inferential validity.
Plotting the residuals allows us to detect violations of some of them.
Photo by Neil Thomas on Unsplash

- Hello. And this video, I want to talk about how to test the assumptions that we make in a linear model.
- So we're going to review the linear assumptions. And I want you to be able to identify violations of these assumptions,
- particularly with regards the assumptions about how variables are distribute residuals are distributed.
- So to review the assumptions that a linear model makes are first, that it is linear.
- The outcome of the outcome and the predictor have a linear relationship.
- The next assumption is that we are our observations are independent of each other.
- And then that the of the residuals are normally distributed and have equal variance.
- We call equal variance. Homos get asked to study same variance across the range of fitted values.
- A violation of the equal variance condition is called hetero US city.
- Or we say that the residuals are hetero scholastic.
- The last three of these are supposed to result in independent and identically distributed normal residuals.
- So let's start with linearity. This is the fundamental assumption of a linear model.
- We wouldn't run a linear model if we didn't think that there was a linear relationship to study.
- And so if we want to check this what we are or see whether a linear model is even going to be reasonable to try.
- What we can do is we can do a scatterplot with quite possibly the regression line to see that seems to fit.
- And that looks vaguely linear ish. Independence is difficult to check.
- It's a property of data collection. We don't have tests or plots to say this is independent.
- There are ways certain violations of it will manifest when we go and we plot the results of training a linear model.
- But common ways that it's violated or if we've got time series.
- If you're collecting data over time, it's often what we call auto correlated, which means one day is correlated with the day before.
- They're not independent.
- If you've got data that comes in groups or comes from common environments, for example, if you've got data on patients from five different hospitals,
- patients within a hospital are going to be correlated because that hospital has its own practice, its own doctors, et cetera.
- Also, if you've got multiple measurements for repeated individuals, that is another example of non independence.
- For example, this is what came up when we were working with the paired t test or its related bootstrap tests,
- where you have two observations and there say you've got the midterms and the mid be from the same student.
- They're not independent of each other because a student who's doing well in the class is probably going to do well in both midterms.
- And that violates the independence, the independence. That that violates the independence of those two variables.
- We have the student linked, but if we've got.
- But the students just going to show up as one row in our list of observations that are trained and linear model from.
- But if we have students showing up as multiple rows, for example, the ratings that a user gives to books or movies,
- the ratings by the same user are not fully independent of each other.
- The solutions to this are out of scope for this week for sure.
- But they involve correlated error models, sometimes hierarchical models or mixed effects models to be able to deal with the
- kinds of grouping dynamics that can cause some of the independents violations. But then the nor the errors are supposed to be normally distributed.
- And so the way we do this is we do a Q Q plot of the residuals.
- When you fit a linear model, the resulting object gives you access to the residuals.
- This plot was generated by the notebook that I've linked to in the in the weeks content information.
- So you can go see the code that I used to generate this residual plot from from a linear model fit.
- But you get a Q Q plot of residuals. And this looks pretty good. We've got and the the values are pretty much tracking with the pretty straight line
- right along there of a violation of normality means we still have a line that fits,
- but our P values and our confidence intervals are gonna be unreliable. We have to be careful using the line for using the model for inference.
- And it's not these these assumptions are not all or nothing. The data often aren't perfectly normal, but they're pretty close.
- Also then the next one. The last one is the equal variance of errors or homeless get as density.
- And so the residuals are supposed to have constant variance.
- And the way we check this is we make a scatterplot where the Y axis are the predicted or fitted values.
- So this is the results of our linear regression, linear regression.
- And the that's the x axis, the y axis is the residuals, so this is our EP.
- So this is our Y hats. And this is our epsilons.
- And this lets us see if there are what we're looking for is no patterns, no visible patterns, particularly in the Y axis.
- We might have patterns in the X axis just because there are.
- If there's a categorical variable involved, then the predicted values, the fitted values are going to fall into some chunks.
- But we don't want is any patterns in the Y axis in response to those.
- This looks pretty much like noise. A couple of these have higher ones.
- Another one has lower. But, you know, you expect an outlier or two here and there.
- The main body of these is pretty much the same width all the way across.
- There aren't bands or other patterns that look or we aren't seeing a particular shift, which isn't necessarily a violation of the variance thing.
- What we want. We don't want to see angular shifts in our residuals either violation of this.
- So if we see distinct patterns, what in this plot?
- What that means is that the model is failing to capture a systemic effect because there is a systematic error when you predict low the variance.
- The residuals do something different than when you predict high,
- which indicates there's a important feature that's not being taking a taking care of in the model or taking a cap
- taken account for in the model that is affecting the relationship of your prediction to the actual resulting value.
- Remember, Epsilon is Y. So the the the error epsilon.
- Equals Y. The actual value outcome, variable minus Y hat.
- Our estimated outcome variable from the linear regression. So where we can see some violations of this.
- So this. This variable or so these residuals from another linear model, they have just a little bit more curve.
- To the residuals than we would like to see, they're pretty close to normal.
- But there's a little more curve indicating that the tails aren't quite fitting, the normal distribution.
- But also, if we look at the residual versus fitted plot, we see kind of this funnel shape.
- So for high fitted values, the variance of the residuals is much higher than for low fat values.
- And that's that looks like a larger change and variance than can be explained just by the occasional outlier.
- So we're going to say that these residuals are exhibiting hetero sked activity and therefore we're not going to trust the model that these came from.
- So what do you do if you're doing just doing prediction.
- If your own goal is to predict an outcome, then it might not matter because these assumptions, violations, they're a problem for.
- They're a problem for in France. Not necessarily for prediction.
- You still want to check your prediction and also check the distribution of your prediction errors.
- It might be that it's mostly doing a pretty good job. But when you get to some edge cases, it pretty egregiously fails you.
- You might just not be able to make a linear model work.
- You might need to add or remove or transform some of your predictors or possibly transform the outcome, like take a log, take a square root.
- Do some more feature engineering in order to get features that are going to produce a better a better model.
- You might need to go to some non parametric techniques, such as boot straps.
- There's a thing called a non parametric regression that we're not going to have time to get to.
- But you can go. You can go study that if it might be useful for your problem.
- Sometimes hierarchical or mixed effects models are going to take care of some of the additional effects that are causing your assumption violations,
- particularly independents. But they can help in some other cases as well.
- So to wrap up linear models, make four key assumptions that are necessary for inferences from them to be valid.
- These assumptions aren't all or nothing. There's not a binary. It is or it isn't.
- It's a judgment call that comes in part from experience and careful study of whether it's a violation is too strong and then plotting the plotting,
- the residuals, both plotting a coocoo plot. We can test their normality and plotting the residuals versus the predicted values.
- Allow us to detect violations of some of these assumptions and check the validity of our inferences.

## 🎥 Multiple Regression

CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
MULTIPLE LINEAR REGRESSION
Learning Outcomes
Build models with 2 or more predictors
Compare models using the Adjusted R2 and AIC
Photo by Jen Theodore on Unsplash
Base Model
Assumption Checks
Adding Species
Do different species have different flipper lengths?
Formula API automatically dummy-codes species.
Intercept: baseline Adelie mass
Species: increase/decrease in baseline mass for each species
Adding Species
Adjusted R2 for multivariate regression
Adjusts for multiple variables
AIC compares models for same predictor – lower is better fit
Combines estimation power with model complexity
Plot & Residuals
Interaction Effects
Interaction effects apply to products of variables
Categorical + Numeric: adjust slope by category
Plot & Residuals
Specifying Main and Interaction Effects
Main effect – effect of the variable(s) on their own
Specified with V1 + V2 + … + Vn
Interaction effect – interaction (product) of variables
Specified with V1:V2
Full effects – main and interaction
Specified with V1 * V2
Expands to V1 + V2 + V1:V2
Sexual Dimorphism
Flipper/species interaction not significant
After accounting for species and sex differences, data does not support different flipper/mass slopes for different species.
Simplified
Slightly better AIC
Same Adj. R2
Species & Sex have subtle interpretation
Most of the chinstrap difference is that male chinstraps are smaller
Gentoos and Adelies likely have the same mass dimorphism
Assumption Checks
Beware Correlated Predictors
Correlated predictors cause poor model fits
Problem is called multicollinearity
X1
X2
Y
X1 causes X2?
X2 causes X1?
X? causes X1 and X2?
Which coefficient gets the common effect?
Can even invert signs!
Pair Plot – useful for detecting correlations
Also use a correlation matrix.
Occam’s Razor
Given two explanations, prefer the one with fewer assumptions
Application to modeling:
Prefer simple models
Features and complexity need to earn their keep
AIC is one embodiment of this.
Wrapping Up
Linear models can extend to multiple variables.
We can look at separate (main) and interaction effects.
Prefer simple models.
Photo by Luís Eusébio on Unsplash

- Oh, and this video I'm going to introduce you to doing multiple linear regression where we have more than one predictor variable in our regression.
- So the goal, learning outcomes are for you to be able to build models as two or more predictors and compare models using the adjusted our squared.
- And I see we're also going to see some of the challenges that come about when we're trying to build multiple regression models.
- So we have our base model that I'm using, the penguin model that I've used earlier in the week,
- where I'm predicting the penguins masked by their flippers length.
- And I'm using the normal. And this model, I'm using the normalized version, the scatterplot showing showing the raw flipper length and body mass.
- But in my back in my linear model, I'm using the standardized version.
- So this this coefficient is in units of standard deviations.
- But. When we look at our assumption checks, we see a little bit more of a curve than we might like to see.
- It's pretty much on the line. But we can see. We can see there's definitely a curve to it.
- Also, when we look at our residual versus fitted plots.
- It does seem that there is a decrease in the variance there and also it almost looks like these two blocks are rotated a little bit.
- Like if we rotated this block on the left counterclockwise just a little, we would get better.
- We'll get more equal variance, so it indicates you there's quite possibly something in here that we're not taking advantage
- of when we're trying to do or we're trying to predict the body mass of a penguin.
- So one way we can go looking at this is we can start to add more variables.
- So what if we add species so we can ask you to do different?
- So different species might have different different flipper lengths and different
- species might have different relationships of their flipper linked to their body mass.
- They've got shorter flippers for heavier bodies or longer flippers like the model fit remarkably well.
- But there's still more work to do. So what we can do is we can do this by by regressing the mass against the flipper and the species with the plus.
- The plus says regress against both of them. It's going to create a linear model where there's a term for Flipper and there's a term for species.
- It's automatically going to dummy code the species for us when we use the Formula API.
- So this is coming from the stats models, Formula API as off.
- I imported it from stats models, that formula, that API. You can see that in the notebook where I've got the code that does all of this.
- So it's going to dummy code the species. And so we're going to actually have to it it's going to drop one.
- We're gonna actually have two species terms, one for chinstrap, one for Gentoo.
- If they're both zeros, the default it is in Delhi. And so what we can interpret is that the intercept is the baseline, a mass.
- And then the other two coefficients are the increase or decrease in baseline mass for each of these species.
- So the chinstrap tend to be lighter than the dallies and the Gentoos tend to be heavier.
- And then the flipper coefficient is capturing what's left over in the flipper length.
- We're trying to predict mass by flipper length. Now, we can also see there's the adjusted R squared is no longer equal to the R squared.
- So R squared only is great for one variable.
- But it when we add variables are squared, just keeps going up.
- Adjusted R squared compensates for that to allow us to get a more accurate estimation of the effect of the effectiveness of our overall model.
- We also have this variable, the AIC.
- This is the keiki information criterion and it allows us to compare models for the same predict or for the same outcome and a lower.
- Is a better fit. So the lower the AIC for an outcome we're trying to predict with the same data.
- Only works for models trained in the same data. The better a fit that we.
- It's one way of assessing that the model is probably better, and it combines looking at the bottle's estimation power.
- How much of the variance is it? Is it extracting with the earth the.
- The degree to which the model can explain the data and it discounts it by model complexity, so it prefers models with fewer terms.
- So if we compare that our previous model had an AIC of four eighty seven.
- And this model has an AIC of four fifty five. So the AIC is it is indicating that this is a better fit for our data.
- If we look at our residuals plot, if we look at a plot of this model and I'm going ahead and plotting three different lines,
- one for each species, since that's the way we've broken it down, we can see the the they move the intercepts.
- The slopes are the same. We haven't done anything to change slope. And the residuals versus fitted plot is looking a little bit better.
- We've got a gap in the middle because species are going to cause.
- So when we're since we have the breakdown of the categorical variable, it's going to predict a value, an average value,
- say, around here for a dallies and then over here for Gentoos or Chinstraps because they're smaller.
- And over here for Gentoos. And so it's going to cause things to cluster in the X axis.
- What we want what we want to see is, is are things pretty independent, the y axis.
- And, you know, they aren't really quite there. Also, it looks like these orange ones really might have a different slope.
- So let's go look at interaction effects and interaction effects apply to products of variables.
- And I've changed the plus here to a star to say I want flipper star species and what this does.
- Since one of them's categorical is it allows the slope to adjust by category.
- So what it expands to is we've got our intercept. We've got a more we've got a coefficient for chinstrap and a coefficient for Gentoo.
- We've got a coefficient for Flipper. That's these first three things.
- But then we have coefficients for an R actions of arms, of our dummy variables for the category and the flipper length.
- So we've got another coefficient here that is. Chinstrap.
- That is applied to Chinstrap Times Flipper, since chinstrap is categorical, what this is going to.
- It is the dummy code for a categorical what this is going to do is it's going to add an
- additional coefficient times flipper when it's a chinstrap and not when it's a Gentoo.
- And likewise. Or in a deli. And likewise, this one's going to apply when you've got a Gentoo penguin.
- If if you can also have an interaction between two numeric variables, in which case you're not going to have the expanded dummies.
- But the numeric, the product.
- One of the tricks was one of the nice things with dummy coding, since one means yes is that multiplying that by something basically becomes an if.
- If yes, then you include the other variant values. Otherwise, zero.
- Nice little trick, but it expands this expanded linear model.
- And if we do this now, we can see that our slopes are changing and it is plotting a different slope for those Gentoo penguin's.
- The residuals versus fit and plot is I think it's getting better,
- but we still have it looks like it possibly not as much variance there in the middle.
- Our AIC also went down just a little bit. Down to two 448.
- Now, when we want to specify these effects are the main effect is the separate effects of the individual values,
- the effect of species, the effect of flipper length. And we specify these.
- We have our variables.
- We put plus signs between them and the stats, models, formula, language, the interaction effect is the product of the variables.
- And if we just want to specify the interaction effect, then we can use V1, Colon,
- V2, and then to get the main and the interaction we can specify V1 Star V2.
- And this expands to be one plus V two plus the interaction. So a lot of a lot of animals exhibit sexual dimorphism.
- There is different difference in body sizes for different sexes of the animal.
- And so what if we incorporate sex so we can say, well actually males might be larger or smaller,
- male penguins might be larger or smaller than female penguins. And so what I'm going to do here is I'm going to add I'm going to do a larger model.
- We're going have our flippers. We're gonna have species in our species,
- star sex or species sex and the interaction to allow some species to have a larger
- difference in body mass between male and female penguins and other species.
- And then I'm also going to interact of flipper with species.
- I'm not interacting Flipper with sex here. That we get a lot of variables blowing up.
- And also my own testing, it didn't work very well. But. What what we see when we do this, we can look,
- it's we get p values on these coefficients and we get standard errors and we see for both of the flipper intersect our interaction with species.
- We've got a large P value and we have confidence interval that includes zero.
- So this indicates that after we account for sex differences in penguin body mass.
- There is no longer a difference in the slopes between the the different species of penguin.
- We're relating the flipper length to the to the body mass.
- Also, we're seeing our flipper coefficient go down quite a bit.
- So we also see that a very high P value in a an almost non-existent coefficient of one of our species sex interactions.
- If we were not using the Formula API, if we were building up all of our matrices ourselves, which is annoying.
- I'll let you go look at the stats models, documentation to see how to do that.
- We're gonna see a version of it later on when we're using another library to get rid of that.
- But also, though the coefficient it's learning is basically is almost zero. So it's not really having a significant effect in our model.
- And the complexity of getting rid of just that particular level interaction for our categorical variables is probably not worth it.
- So we drop the we. But also, though, let's look at our AIC here.
- I see here is down to two sixty nine. This is indicating a model that's doing a substantially better job of explaining the data.
- And so. If we. But let's go ahead and drop the flipper species interactions, it's it's not significant.
- And this is a way. This is one part of how we start to do feature selection. How do you figure out what features should be in your model?
- Well, one way is you put a bunch of them in and then you drop ones that aren't significant.
- And you also compare models with them without a particular feature using a metric such as
- I see in order to pick the one that's gonna be doing the better job of fitting your data.
- Those are both valid strategies. So we're gonna drop that interaction.
- AIC went down slightly, adjusted our square to the same, and it's better than the ones we had it for the simple models at the very beginning.
- We've got the slightly better AIC. This made we made a good decision to drop that particular interaction.
- So now we have Flipper and then we have species, sex and these species sex interaction.
- Now the interpretation of this, you have to be careful with interpreting.
- And remember, for each of these categorical there's one that was picked as default.
- In this case, it's the Adelie penguins and it's the female penguins. So all zeroes and are categorical means female Adelie penguins.
- So what we can interpret here is that chinstraps tend to be larger than Adelies, as do, except that's not statistically significant.
- We have to be careful that one's not really a meaningful effect. So Gentoos tend to be larger than a deli's.
- Male penguins tend to be larger than female. And then we've got this not this significant chinstrap and male has a significant effect.
- Gentoo and male does not. And what the way we should try to interpret this is that most of the chinstrap,
- the difference between chinstraps and Adelies seems to be that male chinstraps.
- Are smaller than male Adelies, but the female are more comparable.
- And Gentoos in a deli's likely have the same mass dimorphism.
- So the difference be in mass between a male and a female Gentoo penguin is this is
- probably the same as the difference in mass between a male and a female Adelie penguins,
- because the interaction coefficient is not significant.
- So it takes some subtlety in order to correctly interpret these models.
- But you can start to learn from the coefficients here. Now, let's go look at our assumption checks.
- Now, these assumption checks look very, very good. We don't see that ability to rotate our residuals and fitted.
- We kind of have everything's falling in about. There was some outliers.
- We have a very, very nice straight fit there. And the long the Q Cube plot D assumption checks hold.
- From what we can tell, it looks like we have a relatively good model for predicting a penguin's body mass.
- Now another thing we have to be careful with are correlated predictors.
- We have two or more predictive variables that are correlated with each other in addition to being correlated with the outcome variable.
- And so this problem is called a multikulti charity.
- We've got X when we're trying to predict Y using X one and X two and we're going to learn coefficients.
- Beta one. Beta two. But we've got the correlation between them and there might be a few different things, we might have the X one causes X two.
- It might be the X to causes X one. It might be that there's some third component that causes both of them.
- But the question is, regardless of which way it is. There's a common effect here.
- So there's a component of that is common to both X1 and X2.
- And the problem is, which coefficient beta one or beta two should get this common effect.
- And the ordinary leased squares does not have a way to make that decision.
- It might split it. It might put it all on one is that the other coefficient is small.
- It might put too much on one and then make it possibly even make the other coefficient negative to compensate.
- We have multiple linearity. It's very difficult to interpret your coefficients.
- You might have a fine predictive model, although it's generalizability might be hurt,
- but it's going to be very difficult to interpret your coefficients when you have significant code.
- You have substantial correlations between variables and it's a judgment call as to what substantial is because you may well not have zero.
- But if you've got a correlation of point eight point nine, then you're probably going to have a problem.
- The solution to this? There's a few solutions to this. One is to drop one of the variables.
- Just pick the one that to better predictor and just use it since it's correlated with the other one.
- Another way is you can try to pull out the common component into a separate variable and then remove its effect from X one and X two.
- And then there's another technique called regularization we'll talk about in a week or two that can also help with that.
- When your purpose is prediction, regularization, Hertz interpreted ability for inference.
- So a pair plot. This is a seabourne pair plot. It's useful for detecting correlations.
- So when you're doing the multiple linear regression, it's often useful to do a pair plot between all of your predictor variables.
- And this is what one looks like. The X is sometimes called a scatterplot matrix.
- The X's are also our roads and our columns are all variables.
- And on the diagonals, we've got our distributions of these variables.
- And then in the cells, we have a scatterplot. So this is the flipper length on the Y and the body mass on the G, on the X.
- This is the transpose of that.
- We can also look at a correlation matrix, which shows us numerically the correlation coefficients between pairs, multiple pairs of variables.
- We have the diagonal. It's one because every variable is perfectly correlated with itself.
- And we have a correlation of point five nine between body mass and bill length.
- Point eight, seven between body mass and flipper length.
- And so if you do this and you put your different you put your different predictors into a scatterplot matrix, you put them in the correlation matrix.
- Then you can see which ones are. If you've got some correlations there that you need to potentially be careful of.
- Now or selecting our models, I talked about how the adjusted R-squared and particularly the AIC discount complex models and prefer simple ones.
- And there's a good principle for this.
- So Ockham's Razor says that given two explanations, we should prefer the one that requires fewer assumptions, sometimes gets generalized.
- So we should prefer simple explanations for things. So our application of this to modeling is that we should prefer simple models, fewer variables,
- less weird functions on the variables, et cetera, and that features and complexity need to earn their keep.
- They need to provide a better prediction or a better estimate of the outcome in order to be worth keeping around.
- As I said, AIC encodes is the definition of the value so that you can if two models have the same ability to explain the data.
- It's AIC is going to give a lower value to the one that has fewer features, fewer predictors in it.
- To wrap up, linear models can be extended to multiple variables. There are entire class.
- You can you can take an entire class just on multiple regression analysis.
- So we're only having time to really get into just introduce.
- You can start to use it. You can look separately at May. You can look at the separate or main effects.
- And also interaction effects between different predictors. We want to prefer simple models, though.
- Too much complexity. Our model is less likely to be valid.
- Increases compute complexity and increases the complexity of interpreting it.

## 🎥 Measuring Prediction Accuracy

CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
PREDICTION ACCURACY
Learning Outcomes
Compute the accuracy of a regression’s predictions
Understand the benefits of RMSE over MAE
Photo by Sheri Hooley on Unsplash
Prediction and Inference
Prediction tries to predict the future with the model.
Inference uses the model’s structure and parameters to learn.
Train-Test
Data
Train
Test
Model
Split
Predict
Split before exploratory analysis of features.
Supervised learning – we have thing to predict
Train-Test
Create test sample:
test = predictable.sample(frac=0.2)
Filter other data for training:
train_mask = pd.Series(True, index=predictable.index)
train_mask[test.index] = False
train = predictable[train_mask]
Evaluate
Fit model on train
Use fit’s predict method on testpreds = pfit.predict(test)
Compare predictions to test outcomes
Mean Absolute Error
Root Mean Squared Error
Translating a Formula
Why RMSE?
Interpreting Prediction Accuracy
MAE and RMSE are both in original outcome variable’s units
“Good enough” depends on application
What’s needed?
What’s the “noise floor” (minimum possible error)?
Can compare error of multiple models on equivalent test data
Wrapping Up
We measure a predictor’s accuracy by comparing predictions to test data that wasn’t seen in the training process.
Photo by Charles Deluvio on Unsplash

- This video, we're going to talk about how to compute prediction accuracy,
- I gave you a couple examples back in the prediction inference video, but we're going to see here how to actually do it.
- So we're going to compute the accuracy of our regressions predictions and understand the benefits of animacy over MAIG.
- The code for this, you're going to find in the updated version of the correlation notebook from the tutorial section.
- I've also linked to it in the in the resources links in this in this week's material.
- So remember, Prediction tries to predict the future with the model.
- And we don't actually care about how well the models, internal structures map to real world structures.
- Acceptance, so far as we might hope that a model that's more directly connected to reality does a better job of prediction.
- That hope doesn't always hold true. But the way we test the prediction accuracy is through a train test split.
- So we so we take our data. We split it into two parts. We've got the train part.
- We've got the test part. We're going to fit the model on the training data.
- And there are one to ask it to predict the test data and see how well it does.
- So to do the split we can do is if we've got a data frame,
- I've got a data frame predictable that has the values and I've pruned out all of the ones where the predictor won't work.
- So one of the predictors is that a or the outcome is N.A because we just can't predict those with with a linear model.
- So I create a test sample. So I use the sample method on the data frame object.
- So this is a D.F. Use the sample method on the data frame object, and I tell it, I want a fraction of a point too.
- This is gonna pick 20 percent of the data. He uses test data. So now I've got this test data and it keeps the index from the predictable.
- So even if it's just arranged next. Now, test has the values from that index that that correspond to the rows.
- It doesn't change the indexes of your rows.
- So what I can do then to get the training data, I'm the training data should be everything that's not in the test data.
- So I'm going create as a mask. I don't use the mask as a name for a boolean value that's going to select Rose and I initialize it.
- I said it was a serious and I initialize it to true to say, yeah, we want the value and I give it an index.
- That's the same index as our original data frame. So it starts out saying we want to select every row.
- And then what I do is I set the. So I use the test index to select items in here.
- And I set the values where I set the values of the match.
- I set the mask to false everywhere.
- We picked a test row. You can also use dot lock here. I'm not it's a shortcut, but you can use dot lock there if you feel that's clearer.
- And then with this mask, I get our training data by asking the training,
- the predictable by passing the train mask is a logical series to index into predictable.
- And that's going to give me my training data. So compute your test sample.
- Create a mask. That's true. Set it false. Everywhere you pick the test row.
- And then if you pick the remaining truths, you get all of the other data. And that's going to be our training data.
- So then we fit the model on trains. We set up our oh well, if we're using it as well as we set up our lesson train, we we call fit.
- It's going to give us a fit object. Then we call predict.
- We call it predict method. So a fit object from a stat's model's model has a method called predict,
- which we give in another set of data, and it's going to try to predict the outcomes for it.
- So we ask it to predict our test data and then we compare the predictions to the test outcomes.
- Now, something that I haven't quite specifically talked about yet is when we're when we're running code.
- We have a model. And so we're gonna say model. Equals.
- Oh, alas, we're going to our formula. And we also give it our data off.
- And then we get fit equals model dot fit.
- That's the stats model's pattern to the model itself, has the data, the the the the setup, the model that we're gonna be fitting.
- And then when we actually fit the model, we get another object that has the parameters that are result of fitting the model.
- No, if you've used Saikat learn before, we're going to be using Saikat learned later.
- This is a different setup. Stats Models uses a little bit different pattern than Saikat learned does it has these separate model and fit objects.
- But the fit object allows us to compute predictions for the test values that we can then compare to our test outcomes.
- One way to compare them is by taking the mean absolute error. So we take the absolute values of our errors.
- And then we take the mean of them,
- we sum them up and we divide them by and I'm going to let you think a little bit about why we don't want to take the mean error.
- I will note that the mean of our residuals, if we've properly Fedeli Squares module model, the mean of our residuals is zero.
- So the root mean squared error squares, the individual errors rather than computing and absolute value, we compute the square.
- We then take their mean and we take the square root of all of that. The square root puts it back in units of the original of the original values.
- This. So why do we do the root mean square or.
- So to see how to compute the root mean squared error, I have broken it down and I've ever written the code here.
- So if we've got our outcome, if we have a variable called outcome in our test frame,
- we subtract the predictions from the outcome and that computes Y minus Y hat.
- And I've color coded the code to match the part of the formula.
- We then take the square, we multiply the error by itself and that's going to compute the squares.
- We then take the mean of all of that. And that's one that's the sometimes one minus enter the sum divided by the count.
- And then finally, we take the square root. And you can see here, especially as you move past being able to use existing models,
- you're going to need to be able to implement formulas in terms of num pi or PSI Pi or pandas.
- And you've done that some already. But this, I hope, helps you see.
- How do you take a formula like we can write out,
- like our messy and you can break it down into the different num pi operations that you're going to need to perform in order to compute the formula.
- So now why do we use are messy instead of energy? One is that it gives higher weight to larger errors.
- So one is less than two point one is less than two point one squared is much less than two squared.
- Also, though, and this is particularly important, is that it's continuous and differentiable the absolute value.
- If we have X and we have Y, the absolute value is discontinuous at zero.
- You can't take its derivative. You can take the derivative of the square and for a lot of optimization techniques in
- order to learn models that are going to fit well and are going to have low error,
- they rely on the differential ability of the errors. And so are messy.
- Gives us an error metric that has a derivative.
- So to interpret the prediction accuracy, both of these metrics are their units are in the original outcome variable.
- But what is good enough depends on your application. How good do you actually need to be for the purposes you're deploying the model for?
- Also there for a lot of problem, for a lot of cases.
- There's a minimum possible error just due to the intrinsic variance of the data and the intrinsic noise of the data.
- You can't predict. Well, you can maybe try to predict noise.
- But if you've got part of the data that is just random noise of random variation that
- puts a floor on how low your prediction accurate your prediction error can get.
- But with equivalent test data, we can use MAIG in our messy to compare the prediction of effectiveness of multiple different models.
- And that's where they really become useful as they allow us to say,
- is this model better than our current model or better than some simple baseline model?
- The linear models that we're seeing, like they're going to be really hot that we're using this week.
- They're really powerful in their own right. They're also a really good baseline if you are using some more sophisticated model.
- It's useful to include a linear model and the see. Can you beat a linear model? If not, maybe use the linear one.
- So to wrap up, we measure of predictors, accuracy by comparing predictions to the actual outcome variables on test data.
- And it's important that the test data was not seen during the fitting process.

## 🎥 Instances and Sampling

CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
INSTANCES AND SAMPLING
Learning Outcomes
Identify the instances in a data set
Determine the correct level at which to sample for the bootstrap or test data
Bootstrap a CI for a linear regression’s coefficients
Photo by Mael BALLAND on Unsplash
Instances
An observation of a single thing
Has one or more variables
The variable values go together – row 0 is one penguin
Covariance / correlation measure intra-instance relationships
Sampling
Goal: bootstrap a confidence interval for flipper/mass correlation
Sample: penguin instances
Each bootstrap sample has n of them
Compute: correlation coefficient
CI is 95% interval of the bootstrap sample correlation coefficient
Bootstrapping Linear Models
We can bootstrap any statistic
Say, a linear model coefficient!
Sample rows (penguin instances)
Fit the model to each sample
Extract statistic
Can do multiple statistics simultaneously
Bootstrap Code
Related Quantities
We often compute relationships between variables of an instance
Paired t-test
Correlation coefficient
Linear regression
Keep instances together is the common rule
Bootstrap Components
Two pieces to a bootstrap design:
Statistic(s) to compute
Sampling strategy
To bootstrap a CI, sample instances
For single-variable statistic, can just sample values
Sampling a variable & taking variable from sample are the same
Two or more variables – must sample instances to keep them together
Bootstrapping P-values
Sampling Review
Wrapping Up
A cleaned and ready-to-model data set consists of instances.
Most of our sampling procedures should sample instances.
Photo by Dušan S. on Unsplash

- Oh, and this video, I want to talk about instances in about sampling's,
- so learning how come to be able to identify the instances that's common term for the machine learning we've been calling them observations about.
- So ready to determine the correct level, which the sample for bootstrap or test data and bootstrap code.
- A confidence interval for linear regression. So. And machine learning contacts and data science contacts we often talk about.
- We often call the individual rows of our data from the individual pieces of data.
- We're going to learn from an instance.
- So, for example, we've got our penguin data is one instance is an observation of a single thing, our penguin data.
- Each of these rows represents one penguin, an instance of the penguin, and an instance has one or more variables and the variables go together.
- So the first row is one penguin and we've got it species. We've got it's bill length.
- We've got its body mass coral covariance and correlation measures.
- The relationship between variables on a penguin by penguin or on an instance by instance basis.
- So when we're sampling, for example,
- for a goal is to bootstrap a confidence interval for the correlation between a penguin's flipper length and its body mass.
- We just sample penguin instances and each instance is going to come with a flipper length and a body mass and maybe some other things,
- because if we just sampled flipper lengths, then we would estimate the sampling distribution of flipper lengths.
- But all we could do the same for body masses. But that loses the relationships between flipper links and body masses.
- It samples them as if they were independent variables.
- If we knew if we were trying to test things about the relationship, we need to keep the relationships.
- So we need to keep each body mass with its corresponding flipper length.
- And so we sample penguin instances. Each bootstrap sample will have enough of them.
- We then say computer code, the correlation coefficient. And we compute the ninety five percent interval of this correlation coefficient.
- So we can also bootstrap linear model parameters. We can bootstrap any statistics.
- We could bootstrap the coefficient from a linear model. So we sample rows.
- We fit a model to the sample, we extract the parameter. We can also do multiple statistics simultaneously.
- You'll see that in the code I've posted online.
- But for this one, I'm just gonna show you the code for sampling a single for for using it, for using bootstrapping to estimate a single coefficient.
- In this case, we're going to estimate the flipper coefficient from the model that we developed in the multivariate regression video.
- So I have a bootstrap rose function that takes a data frame and it takes a statistic.
- And it it computes that statistic for our observations. And then computes our bootstrap samples that computes a sample and uses the data frame sample
- method to create a sample with replacement passes that offset to the statistic function,
- and it gives us that set. And then it does our usual thing. We're going to return the observed value and we're going to return our percentiles
- for the bootstrap to get the confidence interval from the bootstrap distribution.
- So we then going to write a function that given a data frame.
- It will fit a model and return the flipper length. So we're going to we're gonna set up our model or our role as model.
- We're gonna give it our formula. We're going to give it the data. And it's going to give us our model object.
- Remember, stats models have a separate model and fit objects.
- We're then going to call fit to get our our results object or are that contains the results of fitting this linear model.
- And then it has a at attribute per rams.
- And we ask, do we get the flipper parameter out of parameter brams the series, we're gonna get the flipper parameter out of it.
- And that gives us. The parameter values, if we do this 10000 times, we're going to get we're going to get 10000 versions of this parameter.
- And if we do this, we get our our coefficient of point three five nine.
- And we get a confidence interval. Point to sixty eight point four, five, four.
- And if you cross-reference with the results of just fitting the model once,
- you'll find that that coefficient matches pretty closely because this model fits relatively well.
- But this lets you bootstrap a coefficient in a way and get confidence intervals that don't make all of the assumptions that a confidence
- interval in a normal linear regression makes gives you some more flexibility and shows the flexibility of being able to do a bootstrap.
- But, you know, we're again, we're bootstrapping the rows, the instances in this case penguins from our data frame.
- Now, we often we often are going to be computing relationships between variables of an instance.
- We're doing that with the linear regression. We did that with the correlation coefficient. We did that with the paired t test.
- Whenever we're trying to compare. Variables on an instance by instance basis, we're trying to compare heights to weights, for example.
- And the variables, the two variables,
- they come from the same instances keeping the instances together as the common roll or sampling supplies to bootstrapping confidence intervals,
- as applies to creating test data. And so but there's two components to doing a bootstrap.
- We talk some about this in our interactive time, but you have the statistic that you need to compute.
- And you have the sampling strategy for doing your bootstrap.
- Solving any bootstrapping problem really comes down to those two things, sort of bootstrap a confidence interval,
- you're always going to sample instances for a singles variable statistic.
- I mean, there's no difference between sampling instances and just sampling out of the column because except the columns faster,
- because you're only getting one variable for it, per instance, anyway.
- But as soon as you have two or more variables, you're doing a correlation coefficient. You're doing a difference between two variables.
- You have to sample instances to keep the variables together so that you preserve the
- relationship because we're trying to bootstrap something about the relationship, such as students tended to do better on tests two or.
- Audience radio audiences tend to be more favorable towards movies than critics or there's a correlation between.
- Flipper length and body body mass.
- So the distinction, though, is when you're bootstrapping to get a p value, then you have to change it because P values require confidence intervals.
- Just bootstrap, just sample instances. That's the rule for P values, though.
- You need to sample from a simulation of the null hypothesis. So for example, H zero is two variables are uncorrelated.
- So X one is no more related to Y one than it is to Y seven. And so then remember, I said.
- When the reason we sample instances is because we need to keep the relationship preserved.
- We can sample independently to destroy the relationship, and it's pretend there is no relationship, which is the null hypothesis.
- So that allows us to simulate the null hypothesis by sampling from the flipper length
- and the body mass independently rather than sampling on a penguin by penguin basis.
- We pretend that there is no relationship. The null hypothesis.
- And if we compute correlations between these resulting samples, what we're going to get is the distribution of correlation coefficients under the no
- hypothesis that there's no relationship between the penguins flipper lengthen its body mass.
- And so then we can we can see. Compare our observed our observed correlation.
- We can compare p value. But we only do this independent sampling here because we're trying to sample from a normal hypothesis.
- If we have one set of instances and we're trying to do a confidence interval or we're trying to do train test sampling for accuracy,
- then we want to sample the instances. So sample instances in general.
- The exceptions are. When you have a P-value or sometimes you're going to need a group level.
- You're going to need a multi-level sampling of some kind when you have values that fall into coherent groups.
- We haven't needed that yet. Also, we're talking.
- We have one collection of the same instances.
- If you do have independent things like you're trying to compare the flipper length between Gentoos and between and the deli's,
- you're not trying to compare two variables on the same instances. You're trying to compare a variable on two different instances.
- Then they are independent and you sample from this from them separately.
- But think about what are you trying to compare here to variables on the same set of instances.
- You're going to want to sample instances unless you're looking at a piece that you're looking for a P value.
- So to wrap up. We've got data. It's ready for us to do modeling on it.
- We usually think of it in terms of it's a set of instances that have different variables.
- Most of our sampling procedures should generally sample instances and then practice.
- I find confidence intervals are often more useful than P values any way.

## 🚩 Week 8 Quiz

Complete the Week 8 quiz in Canvas.

## 📃 StatsModels Examples and User Guide

The following StatsModels pages document its OLS model: