Week 10 — Classification (10/24–28)
This week we introduce classification as a prediction task, with methods for evaluating classifiers.
🧐 Content Overview
This week has 1h40m of video and 5650 words of assigned readings.
🎥 What is Classification?
In this video, I introduce the week and what classification is.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
WHAT IS CLASSIFICATION?
Learning Outcomes (Week)
Predict binary outcomes or classes for observed instances
Evaluate the accuracy of a classification model
Use Scikit-Learn for predictive modeling
Photo by Photos Hobby on Unsplash
Outcome Variables
Continuous — regression
Categorical — classification
Two-level category (or ordinal) is binary classification
Multi-level ordinal — ordinal regression
Multi-level categorical — multi-class classification
Sometimes we’ll simplify to binary problem!
“mostly fresh”: % Fresh greater than 50%
Groups and Classes
Inference
What is the difference?
t-tests etc.
Classification
Where does a new animal go?
Photos by The Lucky Neko and Jametlene Reskp on Unsplash
Classification and Probability
Wrapping Up
Classification allows us to predict discrete outcome variables.
There are many models. We will start with linear ones.
Photo by Adam Nieścioruk on Unsplash
- Hello and welcome to the tenth week of CSI 533. This video and one introduce our week's topic,
- where we're going to pivot from regression to classifications where learning outcomes for this
- week are for you to be able to predict binary outcomes or classes for observed instances to
- evaluate the accuracy of a model for doing this classification and also to use psychic learn
- for predictive modeling in addition to the stats models code that we've been using so far.
- So if we think about our different outcome variables,
- we talked at the beginning of the class about different types of variables and that applies to outcomes as well as all the other variables.
- What we saw in the last couple of weeks is using is trying to predict continuous outcome variables and we call this regression.
- It also applies to other numeric outcome variables like integers,
- but regressions where we're trying to we're trying to predict these continuous numeric outcome variables.
- We're trying to predict a categorical variable. We typically call it classification.
- And if we have a two level category or a two level ordinal,
- we call this binary classification because our goal is to classify instances into one of two different categories.
- There's also methods for dealing with multi-level ordinals, which are called ordinal regression.
- And if you have multi-level categorically, you have more than two categories.
- Then you then we have what we call multiclass classification.
- We're going to be focusing, particularly for now, on binary classification, where we have a yes or no outcome that we're trying to predict.
- We call it yes, no positive, negative, true, false. We could call it a B. if we don't want to assign positive value to one.
- But our usual classification setup is going to be to estimate. Is this in the positive class?
- Sometimes we'll simplify another problem to a binary problems. We can apply a classification, for example, in the Rotten Tomatoes data.
- We may say we want to we want to classify movies as mostly fresh.
- If they have a percent fresh, greater than 50 percent or some other threshold.
- You might say categorized a rating as positive if it's greater than three or greater than or equal to three other ways in which we
- will sometimes take outcomes that are richer than a binary class and convert them into one for the purposes of using a classifier.
- Now, to relate this to the inference that we saw back in the fourth week.
- So if we've got two different groups or two different classes here,
- an inference that we saw is looking at what what is different about these two groups?
- One's in a basket. One's on stairs. One's kittens. One's puppies.
- We use things like T tests, other pair Y or other tests to look at the difference between two groups.
- But our goal is to understand the groups and their relationships. That's what we're doing with inference.
- Just like with with inference in the regression context,
- trying to understand the structure of the space and what drives changes with classification, which is a type of prediction.
- What we're trying to do is we're given a new animal. We're trying to figure out which group it goes in.
- So inference is, given these groups of animals, what makes them different classification is given a new animal.
- Is it a cat or a dog? Now we're assuming everything is cats or dogs for now.
- If you bring in a rabbit or a ferret or a durable classifier, it's going to have a hard time.
- But. The got the idea here is that the classification is trying to figure out which group it belongs in,
- whereas in France is trying to figure out what is the differences between the groups.
- These are gonna be very related because the differences between the groups are the basis for figuring out which group something belongs in.
- But the outcome is different. And the way we assess what we're doing is different classification, as with prediction.
- We assess on the basis of its ability to correctly predict or classify future unseen data.
- So we often encode our binary classes in zero one where one is the positive class.
- If you use a logical value, if you use a logical and pandas, it will automatically convert to an integer zero one where one is true.
- And remember that when we have a vector or a series of zero ones, then if you can take you take the mean of that,
- you get the fraction of trews or the estimated probability of true.
- It's the probability of drawing an item from your data and it being true.
- But also, if your data is a reasonable as a as a representative sample, it is a estimate of the probability of data being true.
- When you drop in the population so we can think about taking the mean of why for some fixed X.
- So when we're doing a regression, what we do is we compute the the the expected y the expected outcome variable
- for some fixed X regression is gonna find an estimate of the mean outcome.
- And if the regression properly captures the the what's going on in the data, then the output of the regression is going to be the mean outcome.
- Because remember, our residuals are normally distributed. Heteros, you're homeless, get ascetics centered at zero.
- So everywhere along of everywhere along your line, the mean zero, which means that the the lie is the mean residual zero,
- which means that the line is the expected value of the outcome variable for that particular value of X.
- What we do here, well, we often do here as we try to compute the conditional probability.
- What's the probability of Y being true, given a particular value of X?
- So it's gonna have this relationship there to regression. In both cases, we're trying to compute the dis expectation or this probability.
- Fun sidebar fact probability is the expected value of the characteristic function where the characteristic function is one.
- If something is true and zero if it's false. But we're trying to compute this conditional probability, which we as a which.
- Conclusion to this we can also frame as a continuant, as a conditional expectation.
- So to wrap up classification allows us to predict discrete outcome variables.
- There are many different models for doing this. We're gonna start with linear ones as we were using for regression.
🎥 Log-Odds and Logistics
In this video, I introduce log odds, along with the logistic function and its inverse,
the logit function.
Log odds are a useful concept in many situations!
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
LOG ODDS
Learning Outcomes
Convert between probability, odds, and log odds.
Understand the function of the logistic and logit functions.
Photo by Madara Parma on Unsplash
Odds
Probability and Odds
Odds
Logarithms
Very useful for compute
Avoid floating point over/underflow
Be careful though:
Addition and subtraction can compound errors
Odd Logs
Photo by Jennifer Ekstrand
Log Odds
Logistic
Diagram from Wikipedia
Wrapping Up
Odds are another way of representing probabilities.
The logistic and logit functions convert between probabilities and log-odds.
Photo by Praveen kumar Mathivanan on Unsplash
- Hello. This video, I want to introduce the concept of odds and log odds,
- learning outcomes are to be able to convert between probability odds and log odds
- and understand the role and the purpose of the logistic and logic functions.
- So if we have a probability of success. Probability is not the only way to quantify a probability.
- We can also quantify it. And what's her called? The odds. So if you've heard of odds of like ten to one against odds of two to one.
- This is what it's talking about. So if we have a probability of success.
- And remember, this is going to be in the range zero less than or equal.
- So probability is going to be in the range zero to one.
- We can't actually have probabilities of zero and one when we're talking about odds four.
- I'll get to that in a moment, but the odds is the probability of success.
- Divided by the probability of failure or the probability of success, divided by the probability of one minus success.
- And it gives us this ratio of the likelihood of one outcome versus the other.
- And how much it expresses odds expresses how much more likely.
- Success is than failure or failure is then success if we're quantifying the odds of failure.
- So a couple of examples. Star Wars robots.
- A couple of them talk differently, some in some in probability and some odds.
- K2 Esso says there's a ninety seven point six percent chance of failure.
- So our probability of failure is ninety seven point six. Point nine, seven, six.
- And so our odds are point nine seven six divided by pointing to four.
- Which is equal to forty point six seven. So we affect it.
- We approximately have a forty one to one odds. The odds against success are the odds of failure are forty one to one.
- That if we're going to translate this ninety seven point six percent chance into OD's, we can also translate the other way around.
- See through Apio is being a little imprecise in his speech.
- He should say the odds. The odds against successfully navigating an asteroid field.
- But he says the possibility of successfully navigating will forgive him because he's talking about a ratio 37 20 to one.
- He's clearly talking about odds. And so if the odds are thirty seven, twenty over one, that's equal to P. over one minus P.
- If we solve that equation, what we're going to get is that the probability of failure, if we converting this odds of thirty seven,
- twenty to one, that we get a probability of failure of point nine nine nine seven or ninety nine point nine seven.
- And so we can convert back and forth between probability and odds.
- So long as our probability is not zero or one, we can't.
- If probability is zero, then the odds will be zero.
- But if the probability is one, then the probability of failure will be zero.
- And so the odds will be infinite. Another thing about odds is that P is equal to zero point five, then odds is equal to one.
- So one is even odds. Just as likely as it is unlikely.
- Now, hopefully you may remember from high school algebra, logarithms and logarithms allow us to convert between exponents and.
- In normal numbers, and so if X is the log base B of Y, that means Y is equal to be to the X logarithms undo exponentiation.
- A few common bases. We have E the natural log. That's the one we use.
- By far the most. We have log 10. That's really useful for doing orders of magnitude.
- And when we plot on a log access, that log is almost always log.
- Ten logs and different bases are proportional to each other, just a constant multiplier.
- And so we have the we can have E for a lot of computations based 10 logs for displaying log axes and plots.
- Sometimes you're gonna see base two logs when we're dealing, particularly dealing with low level operations and computer programing.
- NUM Pi has the function log, which is the natural log and log 10, which is the base 10 log.
- So a couple logarithms have a variety of identities.
- If you're going to multiply two numbers, you can that's equivalent to adding their logarithm.
- So log of X Y is equal to log of X plus log of Y. Exponentiation is multiplying the logarithm to the log of X to the Y as Y times the log of X.
- These identities are useful for reasoning about what's happening with your logarithms.
- Logarithms are only defined for X greater than zero.
- And so. In a moment, we're going to get to log odds in which you so your.
- The probability can't be won and allow you to convert to odds, because then you're going to have infinite,
- infinite odds and it can't be zero and convert to law gods because you have zero and you can't take the log of zero.
- One common operation when we do have some zeroes is to take the log of one plus X and this and num PI has a function that does that very,
- very precisely. If you need to do log of one plus X, don't just do log of one plus X two log one P floating point is a weird beast.
- I'm going to post a reading or two free to learn more about what's going on with floating point.
- But logarithms are very useful for compute because they will help us avoid floating point under or overflow.
- And so very, very small probabilities. We'll have logarithms that are still within the reasonable range that we can store.
- Also, though, you have to be a little bit careful because in flooding point arithmetic.
- So in addition and subtraction are more efficient than multiplication and division.
- So if you do the logarithms in your computations, the more efficient they stay in a more useful range of values many times.
- But you have to be a little bit careful with particularly subtraction,
- but also addition and floating point to avoid problems with precision and cancelation.
- Again, I'm going to post a couple of reading or two about that.
- But we have our odds and we have our logarithms, so now we can get to law gods or add logs.
- But we can go to log odds. And so the log odds is the log of the odds.
- And so it's the log of p0 ovei times one minus PMA, which is equal to log of P of a minus the log of one minus P of A.
- Because multiplication is addition. Division is subtraction. This function has a name.
- This log God's function is called the logic and the logic of P is equal to
- log P minus log of one minus P or log the P of log of P over one minus P and.
- This converts odd or this converts probability into odds to go back.
- It would convert K to S.O. indices through Pyo. The logistic function is the inverse of the logic.
- So the logic is the log, the logistic is one over one plus E, the minus X, equivalently, it's either the X over each of the X plus one.
- And it has a curve like this. The logic, the logistic of zero is point five as logistic goes to infinity.
- This approaches one. As it goes to negative infinity, it approaches zero.
- So it maps unbounded real values into the range zero one.
- This is very useful because we have zero one outcomes.
- It allows us to have an unbounded value that predicts it and it gets us into zero
- ones without having to do clamping or anything else that's not differentiable. It also often when you.
- You might have a zero one outcome variable. OK?
- That's not not the 01 isn't normal or Pritam probabilities often aren't normal because they're bounded in this range zero one.
- But it's not uncommon to find a situation where the logic of probabilities is normal.
- And so you can if the logic of your probabilities is normal, then that lets you apply that like the law gods is normal.
- So that lets you apply things that, like normal numbers,
- normally distributed numbers to the log odds and then convert them back into probabilities with the logistic function.
- When you actually need a probability, the logistic function is also an example of what's called a sigmoid curve.
- S shaped curve. As I said, it converts log odds in the probability.
- So odds are another way of representing probabilities and the logistic and logic functions convert between probabilities and the log odds and odds.
- The odds of one is even log odds of zero is even log odds.
- Zero. When odds equals one.
- So and also they're more their balance. So. The odds of two to one is two, odds of one to two is point five.
- Odds of long odds of two to one is just a little under one.
- And the drug odds of one to two is just a little over negative one.
- So it gives you more symmetry, more balance in the law. Become very computationally useful to work with.
🎥 Logistic Regression
We’re now ready for our first classification model: logistic regression.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
LOGISTIC REGRESSION
Learning Outcomes
Compute a logistic regression to predict a binary value
Understand generalizing linear regression
Photo by Vishnu Prasad on Unsplash
Linear Regression
Classification Setup
General Linear Models (GLMs)
Logistic Regression
Training a GLM
Generalized linear model
Binomial family (logit link)
p-values test significance
Predicting with GLM
scores = lg_res.predict(test)
Results in predicted scores (estimated probabilities)
Convert to binary class w/ threshold:
decisions = preds > 0.5
Terms
Outcome — the true outcome of the data we seek to predict
Called ground truth
Score — our prediction score (logistic: estimated probability)
Decision — our binary decision
Usually done by thresholding score
Some models & software directly output predictions
Wrapping Up
General linear models allow linear predictions of other quantities.
Logistic regression lets us use linear models for binary classification.
Photo by Michèle Eckert on Unsplash
- This video, we're going to take the combination, the concept we introduced in the last video,
- the law guides the logistic and introduced the logistic regression learning outcomes are for you to be able to compute logistic,
- which are great aggression, to predict a binary value and understand and generalize logistic regression.
- So recall linear regression where we have we're trying to generate predictions y hatam.
- We do that with an intercept. And then the sum of scalar multiples of our feature values.
- We can generalize this in one way. We can transform feature values with which generally we can think of as applying a
- function F of JD each feature X of IJA rather than just multiplying it by a car,
- by a coefficient. The full version of this is called what's called a generalized additive model.
- But that's it's transforming our input features. We can also do that as a part of our data cleaning process.
- But in addition to transforming input features, we can transform output features.
- We're going to see that as we always we look at our classification setup. So if Y is a zero one logical random variable,
- one is going to be what we call our positive class is admitted is spam is fraud, whatever it is we're trying to detect.
- We're going to say one will get Y equals. One is the positive class and Y equals zero is the negative class in some cases numbers.
- There's not going to be any hierarchy or moral value.
- We just have to pick one and say it's the positive class and then we have our predictive variables X just like we did in the linear regression.
- And our goal is to compute Y hat that predicts y just like we did in linear modeling,
- except now we don't have these continuous values and it's not meaningful.
- Subtract. What happens if you subtract zero like true from false.
- So one way to do this is to rather than estimate the value of why we can estimate the probability that Y is one.
- Now, remember probability of one.
- And when we have zero. One probability of one. And the mean.
- Are the same thing we can try to estimate the probability that Y is one.
- In a way, we can do this as well with what's called a general linear models with general linear model wrapped the whole model and a functions.
- We have a link function G and we wrap the model and its infers inverse G to the minus one.
- And so there's different ways we can do this. We can do this for count data with what's called a Poisson.
- Regression in the link function is the log. We can do it with binary data or zero one outcomes.
- That's what we're going to be focusing on here. This is called a logistic regression and the like function is the logic.
- So G minus one. So this is G. So G to the minus one is the logistic.
- It's called logistic regression, because we wrap call, we wrap.
- The results of our linear model, better zero plus the sum of the better.
- I better JS X JS in the logistic function.
- So we get the probability of Y equals one. For a particular ax.
- Is why hat our predictive value, which is equal to the would?
- Which is equal to the logistic. Made an error in the slide.
- There it is equal to the logistic of BITA zero plus the sum of our beta JS.
- So in this case, why can be our variable admitted grad school X can be and then we have X one, X two, x three as our GRV GPA and school rank.
- I'm going to provide a notebook that does this and we can try to predict that with our Y hats now.
- So we can do this. We can not. We have this code here that we call G.L. AM instead of oh,
- I'll ask for the general linear model we want to predicted met with our three variables and then we tell it that we have a family.
- So general linear models have what are called the code usually caused in families.
- We're going to say it's the family binomial, which gives us the logic link function.
- And then we we fit the bottle when we get the results. And this is going to look kind of familiar.
- We don't have an R squared because we're predicting zero one outcomes. It's not really meaningful to talk about their variance.
- We do have a lot of likelihood. We're going to see in a later video what that means.
- We also have our column of P values, which gives us significant tests on our different coefficients.
- We can use them. We can drop non significant predictor variables like we did and like we would in linear regression.
- But this gives us that. But there's now so then builds up this linear model.
- The coefficient is the coefficients are harder to interpret.
- They're not impossible to interpret and logistic regression, but they're harder to interpret.
- The important the first important thing is that they are all in terms of log odds.
- And so a increase in school. An increase in rank of one decreases the law gods of admission by by point five.
- One point five to. But this is the output of our logistic regression.
- We can then predict with a logistic regression using the predict method, just like we do in with a linear regression stats,
- models that predict method gives us predicted scores which are going to be and these are after calling logistic.
- So these are estimated probabilities and they get. We can then convert them into a binary class so we can actually make a decision with a threshold so
- we can say we're going to accept everything where the predicted accept is greater than point five.
- So it's more likely than not that we're going to accept that we that this one would be.
- Yes, according to our model. Different models and different tasks may require different thresholds because there are going
- to be different costs for false positives and false negatives depending on our application.
- So a couple of terms that are going to be useful as we talk more about these.
- First, the outcome that this is the true outcome of the data we seek to predict as it the linear model.
- We're going to call it Y. We call this the ground truth as well, and this is also why it's called a supervised learning problem,
- because we have this ground truth outcome data that we're trying to learn to predict.
- And for the purposes of building and evaluating our model, we assume the data is correct.
- Ground troops data can be biased in various ways as well, that can affect our that can affect what we.
- That affects what we learn from it, that affects our ability to predict.
- It also is a really, really important to note that the outcome variable, this needs to be the actual outcome we're looking for.
- Because you're predicting your outcome variable and if your outcome variable isn't what you think it is.
- Then you're not predicting the thing you think you're predicting.
- I mentioned this in class at one point, but Cathy O'Neill, author of Weapons of Mass Destruction,
- pointed out on Twitter a few earlier this fall that in most cases we don't actually have crime data if we want to predict crime,
- say, high crime area, or whether a crime is going to happen in a particular area or particular time.
- We don't usually actually have crime data. None of our data tells whether a crime happened or not.
- Our data says whether a crime was reported. Our data says what police do.
- And so if we're trading off of that data, we're not predicting crimes. We're predicting crime reports or police activity, which is not the same thing.
- And so it's important to know what our outcome variable,
- what our observed outcome variable actually is and how that relates the task that we're actually trying to solve.
- We then have a score, which is the prediction score that comes out of our model and the logistic regression.
- It's the estimated probability if we just take the linear part of the logistic regression, that it's the estimated log odds.
- And then we have our decisions. We use the score to make a decision, often by thresholding it.
- And that's what we decide to do.
- So if we got a spam detector, our outcome is whether or not the message is spam, as it's been labeled, maybe by our users, maybe by our spam experts.
- Our decision is whether or not our model says it's spam. And then what we're going to look for when we start to look at the accuracy of these models
- for their predictions is the extent to which those decisions match those outcome variables.
- When we say it's spam, is it spam? So to wrap up general linear models, allow linear predictions of nonlinear quantities such as sometimes counts.
- Such as. Particularly for our purposes. Binary outcomes, yes or no, true or false.
- Logistic regression is the particular way that we use a linear model in order to do binary classification.
🎥 The Confusion Matrix
The confusion matrix describes the outcomes of a classification model and is the basis for computing effectiveness metrics.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
CONFUSION AND ACCURACY
Learning Outcomes
Count true and false positives and negatives
Understand the confusion matrix
Compute precision, recall, and accuracy
Photo by marianne bos on Unsplash
Decisions and Outcomes
Confusion Matrix
True positives & negatives — correct decisions.
False positive — incorrectly decide yes
False negative — incorrectly decide no
False Negative
True Negative
False Positive
True Positive
1
0
Truth
1
0
Decision
Accuracy
False Negative
True Negative
False Positive
True Positive
1
0
Truth
1
0
Decision
Precision
False Negative
True Negative
False Positive
True Positive
1
0
Truth
1
0
Decision
Recall
False Negative
True Negative
False Positive
True Positive
1
0
Truth
1
0
Decision
Why Not Just Use Accuracy?
Not all errors have equal costs
False positive in a preliminary cancer screen: run another test
False negative: miss a life-threatening illness
Imbalanced classes skew accuracy
90% cases positive: always guessing 1 has 90% accuracy
Confusion: any of these metrics sometimes called accuracy metrics.
Precision [P(Y=1|D=1)]
Captures correctness of positive decisions
When we say spam, is it spam?
When we say fraud, is it fraud?
When we say good loan, is it a good loan?
Also called Positive Predictive Value
Useful when FP cost is high, getting it right is important
Sensitive to prevalence of positive outcomes
Recall [P(D=1|Y=1)]
Captures detection rate of true outcomes
If it’s spam, do we find it?
If it’s fraud, do we find it?
If it’s cancer, do we find it?
Also called Sensitivity or True Positive Rate
Crucial when FN (missed detection) cost is high
False Positive Rate
False Negative
True Negative
False Positive
True Positive
1
0
Truth
1
0
Decision
False Negative Rate
False Negative
True Negative
False Positive
True Positive
1
0
Truth
1
0
Decision
Specificity
False Negative
True Negative
False Positive
True Positive
1
0
Truth
1
0
Decision
Metrics
We can derive many metrics from the confusion matrix
Metrics reflect different needs and consequences
Metrics reflect different stakeholder perspectives
FPR: how likely am I to be falsely accused?
Wrapping Up
The confusion matrix categorizes classification results and errors.
Ratios of various counts produce metrics for classifier accuracy.
Photo by Jon Tyson on Unsplash
📓 Logistic Regression Demo
The demo notebook for our initial logistic regression videos.
🎥 Baseline Models
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
BASELINE MODELS
Learning Outcomes
Understand the role of a baseline predictor
Compare a prediction quality metric against a suitable reference
Photo by Diana Polekhina on Unsplash
Reference Point
RMSE = 0.8567
Precision = 0.925
Are these good?
Unknown – we need more information!
Application specific
State-of-the-art specific
Photo by Diana Polekhina on Unsplash
How Good is Possible?
Noise inherent in observable outcomes may limit how good a predictor can get!
Noise component of bias/variance tradeoff
Sometimes called “magic barrier”
May be in:
Actual outcomes
Observation of outcomes
Or both
May not actually know this value.
How Bad is Possible?
Majority-Class Classifier
Classifies everything as having the most common label
Can you beat it?
If not, maybe use constant policy?
Works well for accuracy – less well for other metrics (depending on which class is majority)
Photo by Mahmud Ahsan on Unsplash
Random classifier
Randomly picks outcomes
Uniformly, or
Proportional to observations in training data
Another example of how bad you can get!
Works for all our confusion matrix metrics
More Sophisticated Baselines
Linear models (w/ a few predictors)
Decision trees
Widely-used, “simple” models
Application-specific, but there’s often something between “do nothing” and “state of the art”.
State of the Art
Many problems have an existing best practice
Model currently used to solve the problem
Good models from the research literature
Can you do better?
Recommended Comparison Set
Naïve baseline (mean, majority, or random)
Simple baseline models
Current state of the art
Your thing
Not all levels make sense for all problems
Wrapping Up
Most effectiveness metrics need context to be interpreted.
Baselines and state-of-the-art can provide that context.
Photo by Anne Nygård on Unsplash
📃 Floating Point
This is provided for reference.
📃 StatsModels Documentation
The following StatsModels page documents its logistic regression:
This is not an assigned reading - it is here for your reference.
🎥 Log Likelihood
This video describes the log likelihood that is the objective function used by logistic regression.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
LOG LIKELIHOOD
Learning Outcomes
Compute the log likelihood of data from a model estimating probabilities.
Understand the objective function of logistic regression.
Photo by Devin Avery on Unsplash
Logistic Regression
Probability of Observed
Conditioning on Parameters
Likelihood Function
Proportionality
Log Likelihood
Maximum Likelihood Estimator
Expanded Log Likelihood
Example
Can compare on same data
Different data ⇒ different likelihood
Likelihood and Posterior
Maximizing
Wrapping Up
Logistic function is trained by maximizing the log likelihood of the training data given the model.
Photo by Geran de Klerk on Unsplash
- Oh, and this video, I'm going to introduce the log likelihood measure that you see when you're training a logistic regression.
- We're going to see how that's computed. And we're going to talk about what it means to to estimate parameters with a maximum likelihood estimate.
- So our our learning outcomes for this video,
- for you to be able to compute the log likelihood of data and understand the objective function of logistic regression.
- So recall and logistic regression, we're learning a model y hat equals the logistic function of a linear model.
- And that's why Hat is trying to predict the probability that our outcome variable will be one or the probability of yes.
- That's what we're doing in our logistic regression. But we can also think about the probability of the data.
- So what if we want to not just compute, what if we want to compute,
- not the probability that Y is one, but the probability that it equals our observed value?
- So what probability does our model, our logistic regression model, assign to the outcomes that we actually observe in the data?
- That's what's being captured in the log likelihood. So we can compute this by computing, by using the scores.
- These estimated probabilities. And the actual values.
- So. We have here. We're going to take the score and we're going to raise it to the power of the outcome variable, which is in zero one.
- We're gonna have one minus the score. And remember, the one minus the probability of something is the probability of not something.
- So if y hat sabai is, the probability of Y is one than one minus Y has survived, the estimated probability of Y is zero.
- We're going to raise that to the power of one minus Y to the eye. So.
- This is a trick where if you're going to multiply things.
- So we saw in previous examples, when we're trying to add things, we can use multiplication by a one or a zero effectively as an.
- If if you multiply something by one, you get something. If you multiply it by zero, you get zero.
- So if you're adding those together, what you get it basically turn the ones and zeros turn on and off.
- Different pieces of the of the computation and multiplication.
- We can use exponentiation to do that or raising something to a power.
- And so because because remember that X to the zero is one no matter what axis.
- So if our power is a one or a zero.
- X to the one equals X. So. If we have, why I as one.
- Then why hat's a bye to the one is why hat survived and one minus Y hats a bye to the one minus Y.
- Is going to be. Zero. So when Y equals one, then this is one.
- And this is zero. See, this is why hat sub I.
- And this is zero. And if. Why have I?
- If Y. Supply equals zero, then we get zero over here and we get why we get.
- One minus Y hats have I here. And so it the zero over the one picks, which of these two scores we actually use.
- Because if the observed value is one, then it's probability as the result of our logistic regression.
- Passes the logic for the logistic function, so it's actually a probability.
- But if the observed value is zero, then we need the negation of that probability one minus it.
- And the exponentiation and the multiplication turns into precisely the switching, the conditional that we need there.
- It's a neat little trick, because if you if you multiply a variable by a one or a zero, you get the additive identity.
- And if you raise a variable to the power of one or zero, you get the multiplicative identity one.
- When the very when. When you raise it to the zero.
- So. We can compute the probability of our observed data.
- We can also condition on probability. So we've been thinking about the probability of Y equals one given X.
- But we can extend that to think about the probability of Y equals one given our parameters, beta two.
- So what our model is really computing if we generalize it over betas.
- So their input as well is the probability of one given X and given our parameter values beta.
- And that is equal to the logistic function of our batur zero plus our sum of better job, better j.s multiplied by the feature values.
- So this gives us given data Y and X, and I'm using boldface here to indicate this is a vector of data values as opposed to individual data values,
- and just to make it a little bit distinct from a random variable.
- The likelihood of the data, given the parameters, is the probability of the data Y and X,
- given our parameters, which is proportional to the in this case, it's proportional to probability of Y,
- A and X given beta is proportional to probability of Y given X and beta,
- which in turn is equal to the product of the probabilities of our individual y excise.
- We are assuming here that they are exchangeable, that there's no difference if we shuffle the order.
- Now this is the probability of the exact sequence that we observed.
- But and we can renormalize it if we want to be able to observe these data in any sequence.
- But this is the likelihood function. And you're talking a little bit more about what likelihood means and how it fits into a bigger picture.
- But I said here that it is proportional. So this operator here is the proportional to operator.
- And what it means is that it is equal to the left hand side is equal to the right hand side, multiplied by some scaling constant.
- And the reason we get this here is that probably by the by the definition of conditional probability, P of Y of X,
- Y and X given be given better is the probability of Y given X and bita times
- the probability of X given bita but X we're not choosing X based on Batur.
- Y y es probability is conditional on beta, but X is is not X is independent of our parameter is better.
- It's just the data that we have. So the probability of X given beta is equal to the probability of X also because X is fixed.
- We just have X, we treat it as an unknown constant or we can treat it as one.
- The probability of having the data we have is one as one as another way to think about it.
- And so. If we think of that as one than this, proud torsional two becomes equals.
- But we get this proportionality so that we can just move we can move the ax to the other side of the given bar here in this specific case,
- because we're not using Beda to choose X, so we can then do the last piece and we can convert this to a log.
- Likelihoods of a log likelihood is the log of the likelihood. So it's the log of P of X times, this big product,
- because we want the probability of of the first outcome and the second outcome and the third outcome and probability of these things.
- Since we're assuming all of our values are are independent, the probability is equal to their product.
- So but then and log is low-Pitched X plus the sum of the logs of the individual probabilities because multiplying values becomes Sum's and log space.
- So we can do this thing as a big sum. We do it as a multiplications value is gonna get vanishingly small.
- The example dataset I've been using like ten to the negative 80.
- But if we use logs, then it's gonna be in a much more reasonable space, like nine minus one hundred and fifty two.
- And so lets us do it with additional lets us keep the probabilities in a much more reasonable range so we can compute with much smaller probabilities.
- The the probability of any specific sequence of a specific set of observations is relatively low.
- We can still talk about finding the data that gives it the most probability, but the probability is small because,
- well, you could have just shuffled them and gotten a very, very different value on the order of N factorial.
- Different times. So we have this log likelihood is the some of the lives of the individual probabilities
- we saw in an earlier slide how to compute those individual probabilities.
- The code that actually does this is in the logistic regression demo notebook.
- So we have this likelihood function. And we can use what's called a maximum likelihood estimate or so logistic regression.
- The way it actually trains what it does is it uses the log likelihood as a as a utility function.
- Utility function is the opposite of a lost function or a cost function. And it maximize it.
- So it finds the parameter values that maximize this log likelihood.
- We can optimize many other models. We could opt out. We can optimize any model that produces a probability or a likelihood.
- By computing this kind of an amax, by maximizing the log likelihood of the training data given the model.
- This gives us what, as I said, this gives us a maximum likelihood estimate or note.
- This is it's maximizing the log likelihood, not the log odds, but the log maximum with the log likelihood,
- a maximum of the log odds are going to have the same parameter values because they're monotonic functions or they have a monotonic relationship.
- So if we expand this log likelihood, though, we get the log likelihood is equal to Y times.
- The log of Y had EI plus the one minus log. Why times the log of one minus Y had I.
- And now I've turned. Because we're now adding in log space.
- And also because of how you expand powers when you're doing a log. We've now moved.
- What was this multiplicative switch or multiplicative conditional that we were using the power as we've now turned it into an additive conditional.
- They said this is applicable to any model, where are y hat is an estimate of the probability of the positive class.
- So to show you an example of computing this, if we have the first data point.
- It's SCOR, why that is. Point one eight to eight 07, and but it's admit it's why a zero, so it's going to be.
- We're going to compute why hat zero? Why hat time to the zero?
- Times one minus Y hat. Which is point eight, seven to the one, and that's going to be point it one seven.
- And we have point eight one seven right here. If you want to compute the log likelihood, you're gonna get negative point, too.
- To sum up all these log likelihoods and we're going to get the total log likelihood of the data for this model.
- We can compare these on the same data. But if one if you have this, if we have the same.
- If we have this model, that fits just as well. But we have different data, even if it's just, say, half of this data.
- It's going to change the likelihood because the likelihood is over the whole data set.
- So you can use the log likelihood to compare models on exactly the same data.
- But as soon as you change the data set. You can't you can't compete.
- You cannot use the log likelihood to compare a model on one data with the same
- model with a with either the same or a different model on a different data set.
- It's only comparable within the exact same set of training data.
- So I said I was gonna tell you a little bit about what it means to a maximum likelihood estimate by sideswiping likelihood function,
- we're going to maximize it. So if we're ever Bayes Theorem. We can break Bayes Theorem down into a few pieces.
- We have the posterior. Which is the probability I'm using here data.
- And why? Because we have some data. Why? And I'm folding X into Y for now.
- And we have some model parameters, STADA, oftentimes we want to be able to ask,
- given the data I have, what's the probability of a particular set of model parameters?
- And this is the heart of what we call Bayesian inference. Not all applications that base theorem are Bayesian inference,
- but Bayesian thinking becomes it's quite common in various machine learning applications.
- So we can think about what's the probability of my parameters, given my particular data.
- And we do that with a few pieces. We have our prior.
- Before I see any data, what's my knowledge of the ah, what's the probability I assigned to different portions of the parameter space?
- This might be uniform, it might be broad, like some broad normal. It might be based on actual information.
- The likelihood function, it tells me, for a particular parameter set.
- How likely is my data? And that's what we just saw computing for a logistic regression for a particular parameter set.
- How likely is the data that we have seen? If it if it's true, if this is the true value of the parameter,
- how likely is the particular data I would have seen then we have probability of y, which effectively for our purposes is a scaling factor.
- Because for a given data set, if the data set is not changing, its probability is not going to change.
- So if our goal is to say find Fada, that either maximizes the likelihood or maximizes the posterior.
- Because multiplying by a scalar doesn't change where the maximum point is, it just changes the value of that maximum point.
- We can ignore it. Most of the time. So we treated the scaling factor, if you need the definition.
- It's the integral overall of the possible parameter values of the numerator there.
- So. Where Maxima in training a logistic regression,
- we maximize the likelihood we call this a maximum likelihood or an MLS demetre because it does exactly what it says,
- it maximizes the likelihood it maximize it finds the parameter values.
- For which the data is as likely as possible with our particular model.
- We can also maximize the posterior. We can find the theta that is the most likely given our model.
- That's often more computationally expensive. And when the prior is, say, uniform.
- They're all constant across parameter space. There's no difference.
- But also with lots and lots of data as the amount of data you have and why increases
- the relative importance of the likelihood increases and outweighs the prior.
- And so when you have a lot of data, the prior doesn't influence the posterior very much so long as it's sufficiently broad over your parameter space.
- And so the parameter values that maximize the likelihood we very close to the parameter values to maximize the posterior.
- The exact relationship between those and when you when in detail, when you can use MRL, when you really need to use map,
- those are you're going to say you should see those in more detail in either machine learning or the computational statistics course.
- For now, we're going to. For our purposes right here, they're going to be very, very similar.
- To wrap up, the logistic function is back trained by maximizing the log likelihood of the training data.
- Given the model with a particular set of parameters, you could implement this yourself.
- And if you want to practice this, you could take what we did in last week's material to optimize linear regression using
- the optimize function and use that to optimize the parameters of a logistic regression.
- Note that you can just tick in negative if you if you maximizing the log likelihood is equivalent to minimizing the negative log likelihood.
- So if you want to see this in action, I encourage you to open up one of the notebooks and go practice and try to use it out,
- optimize to train yourself a logistic regression.
🎥 Scikit-Learn
This video introduces SciKit-Learn, and using it for a logistic regression.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
SCIKIT-LEARN
Learning Outcomes
Train and apply a logistic regression model with scikit-learn
Understand API differences between scikit-learn and statsmodels
Photo by Elijah Hail on Unsplash
Training a Model
feat_cols = ['gre', 'gpa', 'rank']out_col = 'admit’
train_x = train[feat_cols]train_y = train[out_col]
lg_mod = LogisticRegression(penalty='none')lg_mod.fit(train_x, train_y)
Using a Model
test_d = lg_mod.predict(test_x)np.mean(test_d == test_y) # returned decisions
test_yhat = lg_mod.decision_function(test_x)
# returned log odds
test_prob = lg_mod.predict_proba(test_x)[:, 1]
Differences from Statsmodels
Works with matrices, not data frames
Data frames are matrices, but SciKit-Learn doesn’t know that
Column labels not used – everything based on position
Fit and predict column positions must match
Single model object updated by fit(…)
No separate results object
predict returns predicted class not score
Use decision_function for log odds ratio
Use predict_proba for probability
Use Cases
Statsmodels is good for inference
Reports statistical goodness-of-fit measures
Defaults to unregularized models
SciKit-Learn is good for prediction
Many models
APIs output actual predictions by default
Many data transformation features
Only saves parameter estimates
Wrapping Up
Scikit-learn provides many machine learning models, including logistic (and linear) regression.
It is more difficult to do inference with scikit-learn, but it has a broader selection for prediction.
Photo by Andy Hall on Unsplash
- So this video I want to introduce you to, Saikat Learned, which is another tool kit for training models.
- So with this video, I want you to be able to apply a logistic regression model with psychic learn,
- understand and understand the API differences between psychic learning stats, models code to accompany.
- This is in the course notes for this week. You'll be able to go see the psychic learn.
- It's the same problem as I use for the logistics regression demo with stats models.
- You can directly compare a stats model solution and a psychic learned solution.
- So to train a model with psychic learn, we need to do a few things.
- So I'm going to create here a couple of variables that store the names of columns just to make it easy to extract columns,
- the same columns in the same order from both my training data and my test data.
- So I'm going to get my training x my input features by getting the feature column.
- So this. So this is my predictor features.
- Then I'm going to grab the outcome variable. This is my outcome.
- And then I, I set up my logistic regression and I just pass those values to fit Kirilov just to progression model, pass them to fit.
- And it is going to train the models parameters based on my data to fit my logistic regression model.
- And that's all there is to it. Structures just a little bit different here. Notice that fit does not return a new results object.
- It actually just returns to the model. There's not two objects in stats, models, you create a model and then you fit it and you get results.
- And Saikat learn, you create a model and then you fit it.
- And the results are stored in the model object. Lots of other software in the Python ecosystem follows the Saikat Learn API patterns.
- So if you're familiar with them, that's going to also help you with a lot of other software.
- Tenzer flow follows the same patterns. Many other packages follow the same pattern, just like it learn.
- So those who want to use the model we use predict now.
- And Saikat learn to, when stat's models predict, gave us the estimated probability in psychic learn,
- it gives us the actual class, the actual predicted class.
- So it gives us the decision. It makes the decision right away.
- One of the advantages of this is that any of the psychic learn classifier A-P eyes or classifier models do this.
- So you predict is going to return the decision. There's other function to give you.
- The underlying scores decision function gives you the log odds.
- You can use predict Prop eight to get the probabilities. It returns both the probabilities of fail of zero and the probabilities of one.
- So we need to get the second column, the one column to get the probabilities of one.
- This is equivalent to the stats model's output. But the nice advantage of Saikat predict returning your classes is it's easier to use directly.
- You don't have to manage Threshold's yourself.
- And the other thing is that since all of the different psychic classifier models return decisions like that,
- then you can make write code to do classification. They can use any model and you can plug in the different models with the same code.
- And it makes it easier to to exchange models as part of your workflow and to also test the performance of different models in the same data workflow.
- So a few differences from stats models Saikat Learn works with matrices, not data frames.
- A data frame is a matrix. They're compatible. You can treat a data frame as a matrix, but Saikat Learn doesn't know that it's a data frame.
- It just treats it as a matrix. One of the things this means is that your column labels are ignored.
- Everything is based on position. You need to have your columns in the same order for fit and for every call to predict and fit and predict.
- Map require the same column num numbers, the same column positions.
- You also don't have your you have your feature, your input feature matrix and your outcome variable as separate vectors.
- So a matrix of input features a vector of outcome variables. And then.
- And so that's why I when I was doing the training I got our input matrix and I got
- our our outcome vector column as a series which can be treated as an umpire ray,
- which is what syk it does. So you can't cycad it doesn't know anything about your column names.
- Also, Psyche just has a single motto object that's updated by fit. As I said, there's no separate results.
- And then predict returns, the predicted class, not the score.
- So use case wise stats models is good for inference, and it's it's easier to do inference.
- With stats, models am a psychic. Learn it reports a lot of statistical goodness.
- Fitz measures it defaults to UN regularized models, which are often easier for inference.
- It reports. It also say things like when you do your fit, you get your results.
- It also gives you all the residual psychic learning to go back and get those yourself because Saikat learns best for prediction.
- So you don't need the residuals to go to a prediction. You only need the residuals to understand your training process.
- For in France, Saikat learned, though, is really good for prediction. It has many bottles.
- Many more models than Stotz models. And its API is output.
- The actual prediction, not just the underlying score, but a fault.
- It also is a bunch of capabilities for data transformation, preprocessing post-processing,
- etc. and it only saves the parameter estimates when you call the logistic regression that fit.
- It learns coefficients, it learns at intercept, and that's all it saves.
- It doesn't say if you're fitted values, it doesn't save your residuals. You have to go back and get those by having it predict.
- The outcomes for your training is what you have to do yourself.
- If you want to do those. So it's more work if you want to do inference, but it's fantastic for prediction.
- So wrap up Saikat Learn provides a lot of machine learning models, including logistic and linear.
- You can go back into the linear regression things we've been doing with Saikat learn as well.
- It's more difficult to do inference with Saikat learn,
- but it's got a broader selection of models that are useful for prediction and in most production prediction predictive analytics tasks.
- You're probably gonna want Saikat learn instead of stats models.
🎥 Receiver Operating Characteristic
This video introduces the receiver operating characteristic (ROC) curve, and its use in evaluating classifiers and selecting tradeoffs.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
ROC CURVES
Learning Outcomes
Plot an accuracy curve
Compute and interpret a receiver operating characteristic curve.
Photo by Steve Halama on Unsplash
Metrics and Tradeoffs
Metrics from the confusion matrix are hard yes/no.
A single classifier often has tradeoff points.
Precision Curve
Plot precision at different thresholds.
See tradeoff.
Can do for any metric.
Receiver Operating Characteristic
Plot TPR (Recall) vs. FPR
What FPR must you tolerate for a certain recall?
Diagonal line is random
Area Under the Curve
Compute area under curve (AUC)
0.5 is random
Greater is good (1 is perfect)
< 0.5, invert classifier
AUC and Probability
Wrapping Up
ROC curves give us a way to see tradeoffs and compare the tradeoff curves of multiple classifiers.
Photo by David Talley on Unsplash
- Love this video. I want to introduce you to R.O. curves that we can use to understand and visualize tradeoffs between
- different types of errors as we change the threshold for our our logistic or for classifier.
- So objectives here. Friedberger plot accuracy curve and to compute interprete receiver operating characteristic curve.
- So the matrix metrics from the confusion matrix are hard.
- Are based on hard. Yes no outcomes. They compare your decision. Yes.
- Or one or zero. Yes or no. To the action to the observed outcome in our ground.
- Truth data. One or zero. Yes or no. But a single classifier often has this tradeoff points.
- So you train a logistic regression, the default. This is what Saikat learned does is it uses point five probability or zero log odds as the threshold.
- And if it's more if one is more likely than zero, it returns one.
- But we could have it be more conservative to say require an 80 percent probability of one in order to classify as one.
- Or we could classify as one as soon as that are 20 percent probability,
- depending on the needs we have and depending on the specific costs of false positives and false negatives in our application.
- So we can plot curves for various metrics here, I've done precision, you could do it for recall, you could do it for accuracy at different thresholds.
- So here I have thresholds. And these are in law gods and I. The X axis decreases as you go.
- Right. So as you go from left to right, we're decreasing our threshold and seeing what happens to the physician.
- And it's wobbly up at the top. That's wobbly in the higher end because we make a few more classifications.
- It can wind up being more precise for a little bit.
- And then as we keep it starts stabilizing and as we decrease to as we keep decreasing the threshold,
- we keep decreasing the precision of our system because we're classifying more and more as.
- Yes. And we're deeping digging deeper and deeper into the barrel to find the ones we want to classify as.
- Yes. And we're finding quite a few. And then we wind up classifying quite a few noses.
- Yes. Now, we also do see here that at ah, at our zero cut off about here, we can look at a negative point five and that actually has just as high.
- That actually has a higher precision than the default cutoff. A zero.
- This is useful for actually setting the threshold value, if you can plot your metric, be a precision,
- be it something else at these different thresholds, and and use that to gain insight, to think about where you want to set your threshold.
- Another curve that we use for evaluating classifiers sometimes is the receiver operating characteristic curve.
- And so in the ROIC curve, what we do is we plot the true positive rate on the Y axis and the false positive rate on the x axis.
- And what this lets us see is as we as we decrease R, as we increase our tolerance for false positives,
- what happens to the number of true positives we find? Now, remember, two true positive rate and recall are the same thing.
- So if we want to find half of the Yes cases. How what's the flip, false positive rate?
- Do we have to accept in order to do that, we have to accept the point to.
- And it lets us see it. Well, if we want to find if we want to find 80 percent of the positives, then we have to accept around a point for three or so.
- But it lets us it lets us see here what false positive rate we have to tolerate in order to achieve a certain recall in our classifier.
- You can also do other curves like this for other pairs of metrics, but other pairs of metrics against each other.
- A precision recall curve looks at the relationship between precision and recall.
- As you change your threshold. Another thing that's important to note is that the diagonal line here is random.
- So a random classifier is going to get the diagonal lines performance. If you're if your curve is up over here to the left, then that's doing better.
- One of the things we can do, though, is even for the same precision, we might have a curve that goes like this.
- We might have a curve that goes up more quickly and then dials off.
- We might have a curve that trails off for a while and then gets better. This lets us understand.
- It lets us see the different tradeoff points for different classifiers and determine which one has a curve that
- better aligns with the needs of our application and and generally want to be able to pick up the false positives.
- The true positive quickly. But this curve here, as you see, is as soon as you get over here, if you want to get over about.
- You want to get more than point six recall. So you want point eight.
- Recall, you have to accept a lot more false positives with this one than you do with the blue curve.
- Because it doesn't cross point eight to here as opposed to here.
- Whereas if you want if you want to recall a 50 percent, then you don't have to have as many false positives with this one.
- So it really lets you it lets you know. It characterizes.
- The tradeoffs of the different classifiers and lets you pick one that's going to be better aligned with the needs of your particular application.
- If you wind up with a classifier down here.
- It's worse than random, but in verdict classified, yes.
- When it and when it says no. And then you're going to get a relatively good classifier.
- So we're really paying attention to classifiers in the top left triangle of the ROIC curve.
- We could also compute the area under the ROIC curve. And this gives us a metric, a U.S.
- Put a random classifier, that agonal line is going to have an AUC of point five greater is good because.
- The only way you get greater than point five is you have mass here,
- you have a curve that goes up that gives you more true positives for the false positives than you would with a random classifier.
- It's less than point five again. You can avert your classifier. Also, the area under the curve is the problem, Bill, is equal to the probability of.
- If you pick two items at random that your classifier put them in the correct order.
- So it said it scored Y above J when Y is actually better than J.
- Or its score or vice versa. It scored Y below J. Y is actually below J.
- The probability of that is the same as the area under the curve, which becomes useful in a few applications, especially where your application,
- what you care about is what's the probability that I put you care about the relative classifications of things.
- This becomes important when you're doing systems that rank that rank their outputs.
- And the classification is we're gonna take the top ten as our good ones.
- That's basically what a search engine does. Area under the curve gives you the probability that you stuck two things in the right order.
- So to conclude ROIC,
- Kurz gives us a way to see tradeoffs between false positives and false and true positives and compared to the tradeoff curves of multiple classifiers.
- It also gives us the area under the curve metric that we can use to quantify classifier performance.
✅ Practice
Load the Penguin data, and use a logistic regression to try to classify a penguin as Gentoo or Chinstrap using various measurements.
Delete the Adelie penguins first, so you have a binary classification problem.
🎥 Biases and Assumptions
This video revisits sources of bias and discusses the assumptions underlying prediction.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
BIASES AND ASSUMPTIONS
Learning Outcomes
Reason about potential biases in classification inputs and outputs
Identify cases where building a classifier is not appropriate
Assumptions of Most Predictions
Outcome variable is unbiased
Features unbiased
Outcome variable matches target
Predicting this outcome with these features is reasonable
Think about what the predictor implies
A Few Sources of Bias (Week 2)
Selection bias
Some people more likely to be contacted
Response bias
Some people are more likely to respond
Measurement bias
Measurement skews one way or another
Error and Bias
Observations are often erroneous.
Bias is when they are systematically erroneous.
Tend high or low
May be different for different groups
Biased Features
If features are unbiased ⇒ errors roll into model uncertainty
If they have known bias ⇒ may be able to correct
Remove bias term
Normalize by groups?
Unknown bias ⇒ how severe? (may need to abandon!)
Assumptions
Example: SAT scores — differ by SES (socio-economic status)
What’s more likely?
Poor students are less academically capable
Poor students have less access to academic preparation
The SAT measures both academic capability and familiarity with middle- to upper-class social conventions
Outcome/Target Mismatch
Aren’t predicting what we think we are
Example: crime
Data: crime reports or arrests
Goal: predict crime level
Reality: predict crime report level or police activity
Photo by Woody Kelly on Unsplash
Claims and Evidence
Our claims must be supported by our evidence
Can’t claim a crime detector when our data is crime reports
This is why the goal ⇔ question ⇔ analysis chain is critical!
Task Reasonableness
What does trying to predict Y with X inherently assume?
Predict college GPA with SAT scores
Performance on a standardized test is legitimate basis for predicting future academic grades
Classifying ‘criminal’ based on photos
Facial features are a legitimate predictor of criminality
This is called ‘physiognomy’, and has been rejected for a century
Related: ‘phrenology’, connecting personality to skull shape
Crime/Face Correlation
What is crime?
Why would it be a physically observable characteristic?
Be careful what you assume
Theory drives research questions
Theory drives predictions
Label Dependencies
What to Do?
Assume data is biased (especially social data)
Understand how the data is collected and what labels actually are
Study what is known about those biases
Look for systematic variations in the data
Clarify and document your assumptions
Always be critical — does problem or outcome make sense?
Read broadly and critically
Wrapping Up
All analyses are based on assumptions — be clear about what yours are.
Data are biased. Study to see how.
There’s no magic bullet.
Photo by Parizad Shojaei on Unsplash
- Hello again.
- This video, I want to talk some more about biases we've been talking about a little bit throughout the class and also the assumptions that we make.
- We're doing a predictive modeling task.
- Learning outcomes are for you to build a reason about its potential biases and classification inputs and outputs and
- also identify some cases or building a classifier or a predictor it applies to regression to is not appropriate.
- So I want to talk a little bit about the assumptions of most predictive modeling tasks.
- So we make use that we have an outcome variable and observed I'm biased.
- That's not the same as not erroneous. But we have. But we assume that there's no systematic bias in our outcome variable or in our features that
- are outcome variable actually matches the target of the thing that we're trying to classify.
- And that predicting this outcome with these features is reasonable.
- Think about what you need to think about what the fact that you are trying to do this prediction implies.
- The one of the readings I've assigned is going to talk in more detail about the assumptions of using prediction systems to make decisions.
- So recall from week two, I talked about a few sources of bias. Selection bias is when your there's a there's a discrepancy in who gets selected.
- Your selection is not uniformly at random from the population.
- Some some instances are more likely to be selected than another that others response bias
- is that some some selected instances are more likely to give you a response than others.
- This is super common when we're dealing with human data, because even if you are perfectly unbiased and who you select to ask a survey question,
- people aren't necessarily all are people aren't always going to respond.
- And there may be a correlation between whether or not they respond and what their response would be if they responded.
- One one example of this. That's not a Soviet or refusing to respond is if you ask somebody to tell you they're to
- rate a random movie that they've never seen and you're not having them watch it just hey,
- what did you think of the 5000 fingers of Dr. T? They're more likely to be able to answer that question for movies they watched and
- they're more likely to have watched movies that they think they're going to like.
- And so you can think of this as a selection bias that users selecting movies to rate are more likely to psych movies they want to watch.
- But if you flip it around, so it's you asking a person, what do you think of the 5000 fingers of Dr. T?
- They're going to be more likely to respond if it's Poovey they thought they would have liked and watched.
- You're not going to get very many respondents on the 5000 fingers of Dr. T, and they measure my biases.
- You can get the response and it skews one way or another based in a systematic way,
- possibly based on on sensitive attributes, protected group classification of protected attributes of of the data subjects.
- So. It's important to notice that there's a difference between error and bias.
- Observations often have error in them. But the bias comes in when they are systematically erroneous.
- So when we talk about we talk about an unbiased estimate or statistics,
- an unbiased estimate or is an estimate or whose expected value is equal to the parameter or the expected value,
- the parameter a bias comes in when the the values the estimate is or the actual observations we're making are systematically higher,
- low, or they may trend different for different groups. And maybe the deal for all means is even.
- But one group tends to score higher on on your measurement than another group or or you're mis measuring one group more likely than another group.
- So if you have features or outcome variables that are unbiased, then you can just roll your errors into your model uncertainty.
- Everything's uncertain if the the errors are independent and identically distributed.
- It's more uncertainty in your model. If they have no bias, you may be able to correct.
- You may be able to remove a bias term.
- You may be able to make some assumptions like saying, well, there's a difference in the score between these two groups.
- We're going. But we don't believe the groups are actually different, so we'll just normalize them within groups.
- Some of these things are what happens in some of the election forecasting.
- So the election forecasters, they they pool together polling data from a bunch of polling sources, along with other data that affects their forecast.
- And one of the things they have is they have a model of the bias of different polling houses.
- So some polling houses do their sampling strategy might might be more likely to contact or there
- might be a Republican leaning bias or a Democrat leaning bias in their sampling strategies.
- And they're polling results. You can tell that you can.
- Look, see this in some ways by their agreement with each other, also by their agreement and their historical agreement with election outcomes.
- If you are assuming that the House bias is relatively stable over time and but if you've got good data to estimate how it's biases,
- you can use those to adjust your polling average polling when you're pool pooling together multiple polling sources for election modeling.
- That's one example of a way where you where there's deliberately trying to de bias, where you've got an estimate of the bias.
- There's unknown bias, which this is really common. You need to start to think about how and how severe.
- And first, can you start to try to quantify the bias, but then also what are the downstream applications?
- This is one of the reasons why we always start with our goal and we move to our question because some biases will affect the question.
- And for that, the same bias may render some questions unanswerable and have negligible impact on other questions.
- So the the the problems that arise from bias in our data are not intrinsic, necessarily intrinsic to the data ourselves itself.
- They arise in combination of what we're going to do with it and that that needs to
- inform how we go about understanding the impact of a potential bias in the data.
- We also, though, we need to think about the assumptions we bring to our data and also some assumptions can sometimes help us get out of some problems.
- We need to document them and be clear about them, but assumptions can provide us some guidance.
- So, for example, if we find S.A.T. scores differ by socioeconomic status, that's relatively well established.
- But what's what causes this? What are these is more likely that poor students are intrinsically less academically
- capable or that poor students have had less less access to academic preparation.
- And that can be formal academic preparation, such as S.A.T. prep courses, a greater selection of college prep courses in high school.
- And it may be less formal, such as a greater selection of reading materials when they were in elementary school.
- It's relatively well established that good access to reading,
- which had access to a good quantity of reading materials for young children and engagement with reading,
- can really help them with educational outcomes down the road.
- There's one study that we cite in some of the research that I've been I've been involved
- with that found that if you engage children in what they called authentic literacy tasks,
- but reading things besides a text book,
- when they are in the first or second grade, then later on in junior high around the junior high age or just a little bit younger,
- those students have higher learning outcomes when in in various STEM tasks.
- And so that can be the kind of this informal a student who has early access to good
- reading materials and diverse reading materials is going to do better academically.
- This is a good, good reason to believe they'll do better on the S.A.T.
- And so one of my one one of the research projects I'm involved with is are working on
- is looking at how do we how can we use technology tools to enable to make it easier?
- For teachers to provide more and different reading materials for their students in a way with minimal costs,
- it's accessible in low resource educational settings. Then there's also the question of does the S.A.T. measure some combination of academic
- capability and familiarity with conventions and expectations for middle to upper class?
- So when you see a difference like this. Anything. What's more like is it really more likely that the students are intrinsically different or that?
- There's this difference in access to academic preparation. And that's what we're actually measuring.
- And if that's what we're actually measuring,
- what implications should that have for how we actually use the resulting numbers and what we do with them?
- So I also want to talk a little bit about outcome target mismatch.
- So what I mean by this is you got a Klatt, you're trying to predict X and you have a class label for X Prime.
- That's not really X. And so you're actually training a model to predict one thing when your goal is another.
- Sometimes you have to. Sometimes all we have is a proxy. And that's a reasonable thing to do.
- We need to evaluate the quality and the credibility of our proxy because sometimes it's all we have.
- But sometimes the proxy is too disconnected from the target to be credible.
- For example, if we want to model crime. So if we want to classify neighborhoods, if we want to be able to predict the crime level of a neighborhood.
- And the data we have as crime reports and or arrests.
- What we're trained to predict or to do is predict crime reports and police activity, which is not the same thing as actual crime,
- because an area where crimes go unreported, because there could be a variety of reasons.
- One could be that the people in that neighborhood have less trust for the police, with the police.
- And so they're less likely to report minor crimes or crimes aren't observed because that area isn't as heavily police.
- So crime is happening. But police aren't there to observe it and make arrests and nobody's bothering to report it.
- So if you're trying to predict crime, but your prediction model is trained on.
- Crime reports or police activity? You're not predicting crime. You're predicting crime reports or police activity.
- And so you have to be really careful about the relationship between the labels you have and the actual target variable that you're trying to measure.
- So with that, our claims then need to be supported by our evidence.
- And so we can't claim that we're detecting crime when we are trusting our classifier on crime reports because we're testing its ability to detect.
- They reported crime, which is not the same thing as a crime that actually happened.
- We're missing all of the unreported crimes and we also have the reports of things that weren't actually crimes.
- This is why early on in the semester, I talked with you about this goal question analysis chain.
- And I encourage you to go back and review that material because we need we have a goal.
- We would find that in the research questions and then we connect the research questions to the analysis.
- And at every step, it needs to be clear, the analysis we do needs to directly illuminate the research question and we need
- to revise the analysis and or reframe the research question until we have a match.
- That the analysis is actually addressing the question, can we detect crime reports?
- And then the question needs to advance our goal and breaks anywhere in that chain.
- Reduce the ability of our data analysis to advance the goals that we're trying to use it for.
- We also need to think about the reasonableness of the task,
- because whenever we use acts to predict why, we are assuming that that's a reasonable thing to do.
- Predicting college performance with S.A.T. scores. That's kind of what the S.A.T. is built for.
- There are problems. The S.A.T.,
- but this isn't an inherently unreasonable task performance on a standardized test as a legitimate basis for predicting future academic grades.
- On its face, not a bad assumption, but. Every year or so,
- somebody or another gets the idea that they're going to take a deep learner and they're going to train out of a bunch of photos
- and their outcome variable is going to be some attempted measure of criminality has been arrested or maybe has been convicted.
- And trying to convey trying to trying to predict criminality from photos and what
- this assumes is that facial features are a legitimate predictor of criminality.
- And so the question can arise, why would you assume that? This this mechanism is called physiognomy.
- It's been rejected for a good, solid century, it was kicking around in the eighteen hundreds and then people realized that it was a bad idea.
- It's close cousin. It's phrenology. Physiognomy is when you're trying to predict attributes such as criminality or other personality traits.
- My face fits. Phrenology is when you're trying to do it based on the shape of a skull.
- It involves a lot of calibers and school measurements.
- But you it's it's assuming here that you can use these physical characteristics to predict criminality and you can probably predict arrested.
- You can put you might even be able to predict charged or convicted with a crime because the social process of
- constructing what is a crime and who gets arrested for a crime is going to have correlates to physical attributes.
- But. At its base, we have to think about what is crime and crime is actions.
- That as a society, we have decided are sufficiently aberrant that they deserve criminal treatment.
- Some of those are relatively uncontroversial, like theft. But if crime is an action in violation of our societal laws.
- Why would that be a physically observable characteristic? What's the theory, what's the mechanism there?
- Because you can find all kinds of correlations and theory and thinking about
- as the assumptions and what theoretical constructs could motivate something.
- Are a key. Ah, one of the guiding points we can use to keep away from some of these madrassas.
- So be careful what you assume. Theory drives our research questions. We don't just ask every question willy nilly.
- That's a very inefficient use of science theory either.
- The theory we have that we want to clarify or that we want to apply to a new problem or propose theories that we want to evaluate,
- we should first give them a smell test and see it.
- Is this a reasonable theory to try to evaluate? And theory also drives or drives our predictions.
- We don't want to just throw a bunch of data at Saikat Learn or at Tenzer Flower or whatever and use whatever predictions come out.
- We want to put some thought into the process and we want to think about is this a reasonable prediction task?
- Is this a reasonable set of features? Is it legitimate to try to predict?
- Is this person a criminal? Based on what they look like, based on the picture that you see in the CCTV camera?
- Is that suppose, even suppose we did have a reliable correlation, is that a societally legitimate or useful thing to do?
- So another problem, though, that we also have is we can have problems with labeled dependency, so observations are often incomplete.
- For example, if the bank does not give someone a loan, they don't get to observe whether or not they're going to pay that loan back.
- This is a problem. Machine learning researchers, stative statistics, researchers like to give things clever names.
- This is a problem called the apple tasting problem. Also, though, criminal databases only have those were caught by the justice system in some way,
- it doesn't have the criminals that people who committed crimes and didn't get caught.
- And also, we have to be careful of inverse probabilities. So the probability of A given B and the probability of being given A are not the same thing.
- You have to be careful when you're using this to say if you've got two groups, say the group and you look try to look at their composition.
- That does not. That that that's not enough evidence to make.
- Conclusions in the other direction.
- So if you look at the racial makeup of basketball players, what you get is probability of race given basketball player.
- But that doesn't get give you a probability of skilled basketball player given race.
- And.
- So you have to be really careful about accidentally inverting your probability when you don't have the rest of the pieces of Bayes theorem involved.
- This is a this is one of the traps of common in formal probabilistic reasoning.
- You also have to be careful with pulling from different groups.
- So one example is like if if you're if you're if you have a bunch of mug shots and people getting arrested using those, you're criminal cases.
- Well, what are you using as a non-criminal face?
- If you're getting it from a different set, then are you really learning criminality or are you learning the visual distinct,
- the distinctive visual features of a mug shot? This also comes up in a variety of other settings.
- You need to you need to pay attention to what your learner is actually learning when it's going to try to do a prediction.
- I read of a case a few years ago where a machine learning algorithm for examining X-ray photos
- was trying to learn to to identify X-ray photos that indicated a particular medical condition.
- Had relatively good accuracy, but. Someone went and dug into what it was actually learning and see what parts of the images it looking at.
- And it was looking at over on the side, this code that indicated where the X-ray was taken because some of the X-ray pictures came from.
- A hospital where people were far more likely to have the disease.
- And the other one came for more general hospital or more general X-ray lab.
- And so it wasn't actually learning to identify the disease in the X-ray photo.
- It was learning the X-ray photos taken in a particular hospital's lab were more likely
- to have the disease because it was where the more advanced cases were being sent.
- So you have to be really, really careful, even if you've got a reasonable set of data.
- Oh, I have a bunch of X-ray photos.
- There can be differences that you don't expect that you need to be careful about what your systems actually learning.
- So what do you do about all of this? Unfortunately, there's not just a quick fix solution.
- You can't import psychic psychic lern dot unbias.
- And if someone ever gives us killer and that unbias be very, very skeptical.
- The starting point was we assume that we've got bias problems, we just don't necessarily know what they are, especially if you've got social data.
- The question isn't, are the data biased? The question is how are they biased? How much are they biased?
- And what impact does that have on the conclusions and tasks to which we're trying to apply it?
- We then need to start understanding how the data is collected and what the labels actually are.
- This is why we spent so much time early in the class talking about how do you actually describe data?
- And I have you read things about describing where your data came from, how it was collected, why it was collected,
- because incentive structures can skew the data collection process so that you have the information to start doing the reasoning about it.
- Study what's known about the data biases in your domain. There may be a plethora of radiate a body of existing research.
- You can draw off of to understand what's going on in it.
- Also look for systematic variations in the data, especially if you have data from different groups of people.
- If you have data from different sites, look for systematic variations.
- Those alone aren't necessarily aren't enough to tell you the drivers of different biases, but they give you a starting point where to go looking.
- And then you can go look for research like what might cause this kind of a difference between the groups that I'm seeing.
- Also, clarify a document, all of your assumptions. We want our analysis.
- We want to do a good job with our analysis. Our analysis will never be perfect.
- At some point they need to be done, but clarify and document what you're assuming each step of the pathway document why you're building this model.
- What's your theoretical justification for using these features to predict this outcome?
- What does that theoretical justification have to say about how you should use this features and how you should evaluate your model?
- Always be critical of of your own work and a reasonably critical the work of others as well.
- Does your problem make sense that your outcome makes sense? Are the results too good to be true?
- Too good to be true? They often are. But then also read broadly and critically.
- And this is one that it's hard to give a quick fix on, too.
- But a lot of what I learn about the way biases creep in and how to deal with them and the data that I work
- with and a lot of it's domain specific and contextual is from reading widely and reading deeply sometimes,
- but reading a lot of different things.
- And not just the statistics research, the data science research at the computer science research, but reading, pop science work.
- Good pop science work, reading, legal scholarship, reading,
- various other things to give you a more holistic picture of what is going on in the domains that I'm trying to study.
- So to wrap up all of our analysis are based on assumptions, and you need to be clear what your assumptions are.
- And you just study how your data is biased. And there's no magic bullet for all of this.
- We're gonna be talking about some things. We're going to talk about some measures for how do you measure bias and outcomes of a system.
- But there's no magic bullet. It requires continuous critical thought and reflection on what it is that we're doing and interrogation of what our
- system is doing and how its impacts are distributed and what its underlying data and conceptual theoretical bases are.
- This is also where the place where a few weeks ago we had the video about epistemology is this is one of the sources.
- This is one of the places where critical epistemology is can become very useful because they they give us the starting point for wit,
- the ways in which our system could go wrong and or might be a bad idea.
- That doesn't mean we thwe we just shut it down because someone said something.
- But it's something we need to reflect on it and incorporate what we learned from reading critical scholarship
- and reading critical analysis into how we think about going about the work that we're trying to do.
📃 Prediction-Based Decisions
Read Sections 1 and 2 of the following paper:
We’ll come back to ideas here, but sections 1 and 2 describe the assumptions underlying most classification problems.
While the overall topic of the paper is fairness in making these decisions, I am not assigning it because it is a fairness paper;
rather, those first two sections provide a succinct description of the assumptions that we make when we undertake most
classification problems. They apply no matter what properties of a classification problem or model we care about.
If you would like to learn more, I recommend:
🚩 Week 10 Quiz
The Week 10 quiz will be posted to Canvas.
📃 Abolish the #TechToPrison Pipeline
Read Abolish the #TechToPrison Pipeline (the Medium reading time estimate includes the thorough — and valuable — footnotes and list of 2435 signatories).
This article probes in more detail the assumptions underlying classes of criminal justice data science applications.