Week 10 — Classification (10/24–28)
Contents
Week 10 — Classification (10/24–28)#
This week we introduce classification as a prediction task, with methods for evaluating classifiers.
🧐 Content Overview#
Element | Length |
---|---|
6m39s | |
10m4s | |
9m7s | |
11m48s | |
9m44s | |
16m54s | |
6m42s | |
7m25s | |
22m | |
3650 words | |
2000 words |
This week has 1h40m of video and 5650 words of assigned readings.
🎥 What is Classification?#
In this video, I introduce the week and what classification is.
- Hello and welcome to the tenth week of CSI 533. This video and one introduce our week's topic,
- where we're going to pivot from regression to classifications where learning outcomes for this
- week are for you to be able to predict binary outcomes or classes for observed instances to
- evaluate the accuracy of a model for doing this classification and also to use psychic learn
- for predictive modeling in addition to the stats models code that we've been using so far.
- So if we think about our different outcome variables,
- we talked at the beginning of the class about different types of variables and that applies to outcomes as well as all the other variables.
- What we saw in the last couple of weeks is using is trying to predict continuous outcome variables and we call this regression.
- It also applies to other numeric outcome variables like integers,
- but regressions where we're trying to we're trying to predict these continuous numeric outcome variables.
- We're trying to predict a categorical variable. We typically call it classification.
- And if we have a two level category or a two level ordinal,
- we call this binary classification because our goal is to classify instances into one of two different categories.
- There's also methods for dealing with multi-level ordinals, which are called ordinal regression.
- And if you have multi-level categorically, you have more than two categories.
- Then you then we have what we call multiclass classification.
- We're going to be focusing, particularly for now, on binary classification, where we have a yes or no outcome that we're trying to predict.
- We call it yes, no positive, negative, true, false. We could call it a B. if we don't want to assign positive value to one.
- But our usual classification setup is going to be to estimate. Is this in the positive class?
- Sometimes we'll simplify another problem to a binary problems. We can apply a classification, for example, in the Rotten Tomatoes data.
- We may say we want to we want to classify movies as mostly fresh.
- If they have a percent fresh, greater than 50 percent or some other threshold.
- You might say categorized a rating as positive if it's greater than three or greater than or equal to three other ways in which we
- will sometimes take outcomes that are richer than a binary class and convert them into one for the purposes of using a classifier.
- Now, to relate this to the inference that we saw back in the fourth week.
- So if we've got two different groups or two different classes here,
- an inference that we saw is looking at what what is different about these two groups?
- One's in a basket. One's on stairs. One's kittens. One's puppies.
- We use things like T tests, other pair Y or other tests to look at the difference between two groups.
- But our goal is to understand the groups and their relationships. That's what we're doing with inference.
- Just like with with inference in the regression context,
- trying to understand the structure of the space and what drives changes with classification, which is a type of prediction.
- What we're trying to do is we're given a new animal. We're trying to figure out which group it goes in.
- So inference is, given these groups of animals, what makes them different classification is given a new animal.
- Is it a cat or a dog? Now we're assuming everything is cats or dogs for now.
- If you bring in a rabbit or a ferret or a durable classifier, it's going to have a hard time.
- But. The got the idea here is that the classification is trying to figure out which group it belongs in,
- whereas in France is trying to figure out what is the differences between the groups.
- These are gonna be very related because the differences between the groups are the basis for figuring out which group something belongs in.
- But the outcome is different. And the way we assess what we're doing is different classification, as with prediction.
- We assess on the basis of its ability to correctly predict or classify future unseen data.
- So we often encode our binary classes in zero one where one is the positive class.
- If you use a logical value, if you use a logical and pandas, it will automatically convert to an integer zero one where one is true.
- And remember that when we have a vector or a series of zero ones, then if you can take you take the mean of that,
- you get the fraction of trews or the estimated probability of true.
- It's the probability of drawing an item from your data and it being true.
- But also, if your data is a reasonable as a as a representative sample, it is a estimate of the probability of data being true.
- When you drop in the population so we can think about taking the mean of why for some fixed X.
- So when we're doing a regression, what we do is we compute the the the expected y the expected outcome variable
- for some fixed X regression is gonna find an estimate of the mean outcome.
- And if the regression properly captures the the what's going on in the data, then the output of the regression is going to be the mean outcome.
- Because remember, our residuals are normally distributed. Heteros, you're homeless, get ascetics centered at zero.
- So everywhere along of everywhere along your line, the mean zero, which means that the the lie is the mean residual zero,
- which means that the line is the expected value of the outcome variable for that particular value of X.
- What we do here, well, we often do here as we try to compute the conditional probability.
- What's the probability of Y being true, given a particular value of X?
- So it's gonna have this relationship there to regression. In both cases, we're trying to compute the dis expectation or this probability.
- Fun sidebar fact probability is the expected value of the characteristic function where the characteristic function is one.
- If something is true and zero if it's false. But we're trying to compute this conditional probability, which we as a which.
- Conclusion to this we can also frame as a continuant, as a conditional expectation.
- So to wrap up classification allows us to predict discrete outcome variables.
- There are many different models for doing this. We're gonna start with linear ones as we were using for regression.
🎥 Log-Odds and Logistics#
In this video, I introduce log odds, along with the logistic function and its inverse, the logit function. Log odds are a useful concept in many situations!
- Hello. This video, I want to introduce the concept of odds and log odds,
- learning outcomes are to be able to convert between probability odds and log odds
- and understand the role and the purpose of the logistic and logic functions.
- So if we have a probability of success. Probability is not the only way to quantify a probability.
- We can also quantify it. And what's her called? The odds. So if you've heard of odds of like ten to one against odds of two to one.
- This is what it's talking about. So if we have a probability of success.
- And remember, this is going to be in the range zero less than or equal.
- So probability is going to be in the range zero to one.
- We can't actually have probabilities of zero and one when we're talking about odds four.
- I'll get to that in a moment, but the odds is the probability of success.
- Divided by the probability of failure or the probability of success, divided by the probability of one minus success.
- And it gives us this ratio of the likelihood of one outcome versus the other.
- And how much it expresses odds expresses how much more likely.
- Success is than failure or failure is then success if we're quantifying the odds of failure.
- So a couple of examples. Star Wars robots.
- A couple of them talk differently, some in some in probability and some odds.
- K2 Esso says there's a ninety seven point six percent chance of failure.
- So our probability of failure is ninety seven point six. Point nine, seven, six.
- And so our odds are point nine seven six divided by pointing to four.
- Which is equal to forty point six seven. So we affect it.
- We approximately have a forty one to one odds. The odds against success are the odds of failure are forty one to one.
- That if we're going to translate this ninety seven point six percent chance into OD's, we can also translate the other way around.
- See through Apio is being a little imprecise in his speech.
- He should say the odds. The odds against successfully navigating an asteroid field.
- But he says the possibility of successfully navigating will forgive him because he's talking about a ratio 37 20 to one.
- He's clearly talking about odds. And so if the odds are thirty seven, twenty over one, that's equal to P. over one minus P.
- If we solve that equation, what we're going to get is that the probability of failure, if we converting this odds of thirty seven,
- twenty to one, that we get a probability of failure of point nine nine nine seven or ninety nine point nine seven.
- And so we can convert back and forth between probability and odds.
- So long as our probability is not zero or one, we can't.
- If probability is zero, then the odds will be zero.
- But if the probability is one, then the probability of failure will be zero.
- And so the odds will be infinite. Another thing about odds is that P is equal to zero point five, then odds is equal to one.
- So one is even odds. Just as likely as it is unlikely.
- Now, hopefully you may remember from high school algebra, logarithms and logarithms allow us to convert between exponents and.
- In normal numbers, and so if X is the log base B of Y, that means Y is equal to be to the X logarithms undo exponentiation.
- A few common bases. We have E the natural log. That's the one we use.
- By far the most. We have log 10. That's really useful for doing orders of magnitude.
- And when we plot on a log access, that log is almost always log.
- Ten logs and different bases are proportional to each other, just a constant multiplier.
- And so we have the we can have E for a lot of computations based 10 logs for displaying log axes and plots.
- Sometimes you're gonna see base two logs when we're dealing, particularly dealing with low level operations and computer programing.
- NUM Pi has the function log, which is the natural log and log 10, which is the base 10 log.
- So a couple logarithms have a variety of identities.
- If you're going to multiply two numbers, you can that's equivalent to adding their logarithm.
- So log of X Y is equal to log of X plus log of Y. Exponentiation is multiplying the logarithm to the log of X to the Y as Y times the log of X.
- These identities are useful for reasoning about what's happening with your logarithms.
- Logarithms are only defined for X greater than zero.
- And so. In a moment, we're going to get to log odds in which you so your.
- The probability can't be won and allow you to convert to odds, because then you're going to have infinite,
- infinite odds and it can't be zero and convert to law gods because you have zero and you can't take the log of zero.
- One common operation when we do have some zeroes is to take the log of one plus X and this and num PI has a function that does that very,
- very precisely. If you need to do log of one plus X, don't just do log of one plus X two log one P floating point is a weird beast.
- I'm going to post a reading or two free to learn more about what's going on with floating point.
- But logarithms are very useful for compute because they will help us avoid floating point under or overflow.
- And so very, very small probabilities. We'll have logarithms that are still within the reasonable range that we can store.
- Also, though, you have to be a little bit careful because in flooding point arithmetic.
- So in addition and subtraction are more efficient than multiplication and division.
- So if you do the logarithms in your computations, the more efficient they stay in a more useful range of values many times.
- But you have to be a little bit careful with particularly subtraction,
- but also addition and floating point to avoid problems with precision and cancelation.
- Again, I'm going to post a couple of reading or two about that.
- But we have our odds and we have our logarithms, so now we can get to law gods or add logs.
- But we can go to log odds. And so the log odds is the log of the odds.
- And so it's the log of p0 ovei times one minus PMA, which is equal to log of P of a minus the log of one minus P of A.
- Because multiplication is addition. Division is subtraction. This function has a name.
- This log God's function is called the logic and the logic of P is equal to
- log P minus log of one minus P or log the P of log of P over one minus P and.
- This converts odd or this converts probability into odds to go back.
- It would convert K to S.O. indices through Pyo. The logistic function is the inverse of the logic.
- So the logic is the log, the logistic is one over one plus E, the minus X, equivalently, it's either the X over each of the X plus one.
- And it has a curve like this. The logic, the logistic of zero is point five as logistic goes to infinity.
- This approaches one. As it goes to negative infinity, it approaches zero.
- So it maps unbounded real values into the range zero one.
- This is very useful because we have zero one outcomes.
- It allows us to have an unbounded value that predicts it and it gets us into zero
- ones without having to do clamping or anything else that's not differentiable. It also often when you.
- You might have a zero one outcome variable. OK?
- That's not not the 01 isn't normal or Pritam probabilities often aren't normal because they're bounded in this range zero one.
- But it's not uncommon to find a situation where the logic of probabilities is normal.
- And so you can if the logic of your probabilities is normal, then that lets you apply that like the law gods is normal.
- So that lets you apply things that, like normal numbers,
- normally distributed numbers to the log odds and then convert them back into probabilities with the logistic function.
- When you actually need a probability, the logistic function is also an example of what's called a sigmoid curve.
- S shaped curve. As I said, it converts log odds in the probability.
- So odds are another way of representing probabilities and the logistic and logic functions convert between probabilities and the log odds and odds.
- The odds of one is even log odds of zero is even log odds.
- Zero. When odds equals one.
- So and also they're more their balance. So. The odds of two to one is two, odds of one to two is point five.
- Odds of long odds of two to one is just a little under one.
- And the drug odds of one to two is just a little over negative one.
- So it gives you more symmetry, more balance in the law. Become very computationally useful to work with.
🎥 Logistic Regression#
We’re now ready for our first classification model: logistic regression.
- This video, we're going to take the combination, the concept we introduced in the last video,
- the law guides the logistic and introduced the logistic regression learning outcomes are for you to be able to compute logistic,
- which are great aggression, to predict a binary value and understand and generalize logistic regression.
- So recall linear regression where we have we're trying to generate predictions y hatam.
- We do that with an intercept. And then the sum of scalar multiples of our feature values.
- We can generalize this in one way. We can transform feature values with which generally we can think of as applying a
- function F of JD each feature X of IJA rather than just multiplying it by a car,
- by a coefficient. The full version of this is called what's called a generalized additive model.
- But that's it's transforming our input features. We can also do that as a part of our data cleaning process.
- But in addition to transforming input features, we can transform output features.
- We're going to see that as we always we look at our classification setup. So if Y is a zero one logical random variable,
- one is going to be what we call our positive class is admitted is spam is fraud, whatever it is we're trying to detect.
- We're going to say one will get Y equals. One is the positive class and Y equals zero is the negative class in some cases numbers.
- There's not going to be any hierarchy or moral value.
- We just have to pick one and say it's the positive class and then we have our predictive variables X just like we did in the linear regression.
- And our goal is to compute Y hat that predicts y just like we did in linear modeling,
- except now we don't have these continuous values and it's not meaningful.
- Subtract. What happens if you subtract zero like true from false.
- So one way to do this is to rather than estimate the value of why we can estimate the probability that Y is one.
- Now, remember probability of one.
- And when we have zero. One probability of one. And the mean.
- Are the same thing we can try to estimate the probability that Y is one.
- In a way, we can do this as well with what's called a general linear models with general linear model wrapped the whole model and a functions.
- We have a link function G and we wrap the model and its infers inverse G to the minus one.
- And so there's different ways we can do this. We can do this for count data with what's called a Poisson.
- Regression in the link function is the log. We can do it with binary data or zero one outcomes.
- That's what we're going to be focusing on here. This is called a logistic regression and the like function is the logic.
- So G minus one. So this is G. So G to the minus one is the logistic.
- It's called logistic regression, because we wrap call, we wrap.
- The results of our linear model, better zero plus the sum of the better.
- I better JS X JS in the logistic function.
- So we get the probability of Y equals one. For a particular ax.
- Is why hat our predictive value, which is equal to the would?
- Which is equal to the logistic. Made an error in the slide.
- There it is equal to the logistic of BITA zero plus the sum of our beta JS.
- So in this case, why can be our variable admitted grad school X can be and then we have X one, X two, x three as our GRV GPA and school rank.
- I'm going to provide a notebook that does this and we can try to predict that with our Y hats now.
- So we can do this. We can not. We have this code here that we call G.L. AM instead of oh,
- I'll ask for the general linear model we want to predicted met with our three variables and then we tell it that we have a family.
- So general linear models have what are called the code usually caused in families.
- We're going to say it's the family binomial, which gives us the logic link function.
- And then we we fit the bottle when we get the results. And this is going to look kind of familiar.
- We don't have an R squared because we're predicting zero one outcomes. It's not really meaningful to talk about their variance.
- We do have a lot of likelihood. We're going to see in a later video what that means.
- We also have our column of P values, which gives us significant tests on our different coefficients.
- We can use them. We can drop non significant predictor variables like we did and like we would in linear regression.
- But this gives us that. But there's now so then builds up this linear model.
- The coefficient is the coefficients are harder to interpret.
- They're not impossible to interpret and logistic regression, but they're harder to interpret.
- The important the first important thing is that they are all in terms of log odds.
- And so a increase in school. An increase in rank of one decreases the law gods of admission by by point five.
- One point five to. But this is the output of our logistic regression.
- We can then predict with a logistic regression using the predict method, just like we do in with a linear regression stats,
- models that predict method gives us predicted scores which are going to be and these are after calling logistic.
- So these are estimated probabilities and they get. We can then convert them into a binary class so we can actually make a decision with a threshold so
- we can say we're going to accept everything where the predicted accept is greater than point five.
- So it's more likely than not that we're going to accept that we that this one would be.
- Yes, according to our model. Different models and different tasks may require different thresholds because there are going
- to be different costs for false positives and false negatives depending on our application.
- So a couple of terms that are going to be useful as we talk more about these.
- First, the outcome that this is the true outcome of the data we seek to predict as it the linear model.
- We're going to call it Y. We call this the ground truth as well, and this is also why it's called a supervised learning problem,
- because we have this ground truth outcome data that we're trying to learn to predict.
- And for the purposes of building and evaluating our model, we assume the data is correct.
- Ground troops data can be biased in various ways as well, that can affect our that can affect what we.
- That affects what we learn from it, that affects our ability to predict.
- It also is a really, really important to note that the outcome variable, this needs to be the actual outcome we're looking for.
- Because you're predicting your outcome variable and if your outcome variable isn't what you think it is.
- Then you're not predicting the thing you think you're predicting.
- I mentioned this in class at one point, but Cathy O'Neill, author of Weapons of Mass Destruction,
- pointed out on Twitter a few earlier this fall that in most cases we don't actually have crime data if we want to predict crime,
- say, high crime area, or whether a crime is going to happen in a particular area or particular time.
- We don't usually actually have crime data. None of our data tells whether a crime happened or not.
- Our data says whether a crime was reported. Our data says what police do.
- And so if we're trading off of that data, we're not predicting crimes. We're predicting crime reports or police activity, which is not the same thing.
- And so it's important to know what our outcome variable,
- what our observed outcome variable actually is and how that relates the task that we're actually trying to solve.
- We then have a score, which is the prediction score that comes out of our model and the logistic regression.
- It's the estimated probability if we just take the linear part of the logistic regression, that it's the estimated log odds.
- And then we have our decisions. We use the score to make a decision, often by thresholding it.
- And that's what we decide to do.
- So if we got a spam detector, our outcome is whether or not the message is spam, as it's been labeled, maybe by our users, maybe by our spam experts.
- Our decision is whether or not our model says it's spam. And then what we're going to look for when we start to look at the accuracy of these models
- for their predictions is the extent to which those decisions match those outcome variables.
- When we say it's spam, is it spam? So to wrap up general linear models, allow linear predictions of nonlinear quantities such as sometimes counts.
- Such as. Particularly for our purposes. Binary outcomes, yes or no, true or false.
- Logistic regression is the particular way that we use a linear model in order to do binary classification.
🎥 The Confusion Matrix#
The confusion matrix describes the outcomes of a classification model and is the basis for computing effectiveness metrics.
Resources#
The Wikipedia article has a very good diagram of the confusion matrix and its derived metrics.
📓 Logistic Regression Demo#
The demo notebook for our initial logistic regression videos.
🎥 Baseline Models#
📃 Floating Point#
This is provided for reference.
📃 StatsModels Documentation#
The following StatsModels page documents its logistic regression:
This is not an assigned reading - it is here for your reference.
🎥 Log Likelihood#
This video describes the log likelihood that is the objective function used by logistic regression.
- Oh, and this video, I'm going to introduce the log likelihood measure that you see when you're training a logistic regression.
- We're going to see how that's computed. And we're going to talk about what it means to to estimate parameters with a maximum likelihood estimate.
- So our our learning outcomes for this video,
- for you to be able to compute the log likelihood of data and understand the objective function of logistic regression.
- So recall and logistic regression, we're learning a model y hat equals the logistic function of a linear model.
- And that's why Hat is trying to predict the probability that our outcome variable will be one or the probability of yes.
- That's what we're doing in our logistic regression. But we can also think about the probability of the data.
- So what if we want to not just compute, what if we want to compute,
- not the probability that Y is one, but the probability that it equals our observed value?
- So what probability does our model, our logistic regression model, assign to the outcomes that we actually observe in the data?
- That's what's being captured in the log likelihood. So we can compute this by computing, by using the scores.
- These estimated probabilities. And the actual values.
- So. We have here. We're going to take the score and we're going to raise it to the power of the outcome variable, which is in zero one.
- We're gonna have one minus the score. And remember, the one minus the probability of something is the probability of not something.
- So if y hat sabai is, the probability of Y is one than one minus Y has survived, the estimated probability of Y is zero.
- We're going to raise that to the power of one minus Y to the eye. So.
- This is a trick where if you're going to multiply things.
- So we saw in previous examples, when we're trying to add things, we can use multiplication by a one or a zero effectively as an.
- If if you multiply something by one, you get something. If you multiply it by zero, you get zero.
- So if you're adding those together, what you get it basically turn the ones and zeros turn on and off.
- Different pieces of the of the computation and multiplication.
- We can use exponentiation to do that or raising something to a power.
- And so because because remember that X to the zero is one no matter what axis.
- So if our power is a one or a zero.
- X to the one equals X. So. If we have, why I as one.
- Then why hat's a bye to the one is why hat survived and one minus Y hats a bye to the one minus Y.
- Is going to be. Zero. So when Y equals one, then this is one.
- And this is zero. See, this is why hat sub I.
- And this is zero. And if. Why have I?
- If Y. Supply equals zero, then we get zero over here and we get why we get.
- One minus Y hats have I here. And so it the zero over the one picks, which of these two scores we actually use.
- Because if the observed value is one, then it's probability as the result of our logistic regression.
- Passes the logic for the logistic function, so it's actually a probability.
- But if the observed value is zero, then we need the negation of that probability one minus it.
- And the exponentiation and the multiplication turns into precisely the switching, the conditional that we need there.
- It's a neat little trick, because if you if you multiply a variable by a one or a zero, you get the additive identity.
- And if you raise a variable to the power of one or zero, you get the multiplicative identity one.
- When the very when. When you raise it to the zero.
- So. We can compute the probability of our observed data.
- We can also condition on probability. So we've been thinking about the probability of Y equals one given X.
- But we can extend that to think about the probability of Y equals one given our parameters, beta two.
- So what our model is really computing if we generalize it over betas.
- So their input as well is the probability of one given X and given our parameter values beta.
- And that is equal to the logistic function of our batur zero plus our sum of better job, better j.s multiplied by the feature values.
- So this gives us given data Y and X, and I'm using boldface here to indicate this is a vector of data values as opposed to individual data values,
- and just to make it a little bit distinct from a random variable.
- The likelihood of the data, given the parameters, is the probability of the data Y and X,
- given our parameters, which is proportional to the in this case, it's proportional to probability of Y,
- A and X given beta is proportional to probability of Y given X and beta,
- which in turn is equal to the product of the probabilities of our individual y excise.
- We are assuming here that they are exchangeable, that there's no difference if we shuffle the order.
- Now this is the probability of the exact sequence that we observed.
- But and we can renormalize it if we want to be able to observe these data in any sequence.
- But this is the likelihood function. And you're talking a little bit more about what likelihood means and how it fits into a bigger picture.
- But I said here that it is proportional. So this operator here is the proportional to operator.
- And what it means is that it is equal to the left hand side is equal to the right hand side, multiplied by some scaling constant.
- And the reason we get this here is that probably by the by the definition of conditional probability, P of Y of X,
- Y and X given be given better is the probability of Y given X and bita times
- the probability of X given bita but X we're not choosing X based on Batur.
- Y y es probability is conditional on beta, but X is is not X is independent of our parameter is better.
- It's just the data that we have. So the probability of X given beta is equal to the probability of X also because X is fixed.
- We just have X, we treat it as an unknown constant or we can treat it as one.
- The probability of having the data we have is one as one as another way to think about it.
- And so. If we think of that as one than this, proud torsional two becomes equals.
- But we get this proportionality so that we can just move we can move the ax to the other side of the given bar here in this specific case,
- because we're not using Beda to choose X, so we can then do the last piece and we can convert this to a log.
- Likelihoods of a log likelihood is the log of the likelihood. So it's the log of P of X times, this big product,
- because we want the probability of of the first outcome and the second outcome and the third outcome and probability of these things.
- Since we're assuming all of our values are are independent, the probability is equal to their product.
- So but then and log is low-Pitched X plus the sum of the logs of the individual probabilities because multiplying values becomes Sum's and log space.
- So we can do this thing as a big sum. We do it as a multiplications value is gonna get vanishingly small.
- The example dataset I've been using like ten to the negative 80.
- But if we use logs, then it's gonna be in a much more reasonable space, like nine minus one hundred and fifty two.
- And so lets us do it with additional lets us keep the probabilities in a much more reasonable range so we can compute with much smaller probabilities.
- The the probability of any specific sequence of a specific set of observations is relatively low.
- We can still talk about finding the data that gives it the most probability, but the probability is small because,
- well, you could have just shuffled them and gotten a very, very different value on the order of N factorial.
- Different times. So we have this log likelihood is the some of the lives of the individual probabilities
- we saw in an earlier slide how to compute those individual probabilities.
- The code that actually does this is in the logistic regression demo notebook.
- So we have this likelihood function. And we can use what's called a maximum likelihood estimate or so logistic regression.
- The way it actually trains what it does is it uses the log likelihood as a as a utility function.
- Utility function is the opposite of a lost function or a cost function. And it maximize it.
- So it finds the parameter values that maximize this log likelihood.
- We can optimize many other models. We could opt out. We can optimize any model that produces a probability or a likelihood.
- By computing this kind of an amax, by maximizing the log likelihood of the training data given the model.
- This gives us what, as I said, this gives us a maximum likelihood estimate or note.
- This is it's maximizing the log likelihood, not the log odds, but the log maximum with the log likelihood,
- a maximum of the log odds are going to have the same parameter values because they're monotonic functions or they have a monotonic relationship.
- So if we expand this log likelihood, though, we get the log likelihood is equal to Y times.
- The log of Y had EI plus the one minus log. Why times the log of one minus Y had I.
- And now I've turned. Because we're now adding in log space.
- And also because of how you expand powers when you're doing a log. We've now moved.
- What was this multiplicative switch or multiplicative conditional that we were using the power as we've now turned it into an additive conditional.
- They said this is applicable to any model, where are y hat is an estimate of the probability of the positive class.
- So to show you an example of computing this, if we have the first data point.
- It's SCOR, why that is. Point one eight to eight 07, and but it's admit it's why a zero, so it's going to be.
- We're going to compute why hat zero? Why hat time to the zero?
- Times one minus Y hat. Which is point eight, seven to the one, and that's going to be point it one seven.
- And we have point eight one seven right here. If you want to compute the log likelihood, you're gonna get negative point, too.
- To sum up all these log likelihoods and we're going to get the total log likelihood of the data for this model.
- We can compare these on the same data. But if one if you have this, if we have the same.
- If we have this model, that fits just as well. But we have different data, even if it's just, say, half of this data.
- It's going to change the likelihood because the likelihood is over the whole data set.
- So you can use the log likelihood to compare models on exactly the same data.
- But as soon as you change the data set. You can't you can't compete.
- You cannot use the log likelihood to compare a model on one data with the same
- model with a with either the same or a different model on a different data set.
- It's only comparable within the exact same set of training data.
- So I said I was gonna tell you a little bit about what it means to a maximum likelihood estimate by sideswiping likelihood function,
- we're going to maximize it. So if we're ever Bayes Theorem. We can break Bayes Theorem down into a few pieces.
- We have the posterior. Which is the probability I'm using here data.
- And why? Because we have some data. Why? And I'm folding X into Y for now.
- And we have some model parameters, STADA, oftentimes we want to be able to ask,
- given the data I have, what's the probability of a particular set of model parameters?
- And this is the heart of what we call Bayesian inference. Not all applications that base theorem are Bayesian inference,
- but Bayesian thinking becomes it's quite common in various machine learning applications.
- So we can think about what's the probability of my parameters, given my particular data.
- And we do that with a few pieces. We have our prior.
- Before I see any data, what's my knowledge of the ah, what's the probability I assigned to different portions of the parameter space?
- This might be uniform, it might be broad, like some broad normal. It might be based on actual information.
- The likelihood function, it tells me, for a particular parameter set.
- How likely is my data? And that's what we just saw computing for a logistic regression for a particular parameter set.
- How likely is the data that we have seen? If it if it's true, if this is the true value of the parameter,
- how likely is the particular data I would have seen then we have probability of y, which effectively for our purposes is a scaling factor.
- Because for a given data set, if the data set is not changing, its probability is not going to change.
- So if our goal is to say find Fada, that either maximizes the likelihood or maximizes the posterior.
- Because multiplying by a scalar doesn't change where the maximum point is, it just changes the value of that maximum point.
- We can ignore it. Most of the time. So we treated the scaling factor, if you need the definition.
- It's the integral overall of the possible parameter values of the numerator there.
- So. Where Maxima in training a logistic regression,
- we maximize the likelihood we call this a maximum likelihood or an MLS demetre because it does exactly what it says,
- it maximizes the likelihood it maximize it finds the parameter values.
- For which the data is as likely as possible with our particular model.
- We can also maximize the posterior. We can find the theta that is the most likely given our model.
- That's often more computationally expensive. And when the prior is, say, uniform.
- They're all constant across parameter space. There's no difference.
- But also with lots and lots of data as the amount of data you have and why increases
- the relative importance of the likelihood increases and outweighs the prior.
- And so when you have a lot of data, the prior doesn't influence the posterior very much so long as it's sufficiently broad over your parameter space.
- And so the parameter values that maximize the likelihood we very close to the parameter values to maximize the posterior.
- The exact relationship between those and when you when in detail, when you can use MRL, when you really need to use map,
- those are you're going to say you should see those in more detail in either machine learning or the computational statistics course.
- For now, we're going to. For our purposes right here, they're going to be very, very similar.
- To wrap up, the logistic function is back trained by maximizing the log likelihood of the training data.
- Given the model with a particular set of parameters, you could implement this yourself.
- And if you want to practice this, you could take what we did in last week's material to optimize linear regression using
- the optimize function and use that to optimize the parameters of a logistic regression.
- Note that you can just tick in negative if you if you maximizing the log likelihood is equivalent to minimizing the negative log likelihood.
- So if you want to see this in action, I encourage you to open up one of the notebooks and go practice and try to use it out,
- optimize to train yourself a logistic regression.
🎥 Scikit-Learn#
This video introduces SciKit-Learn, and using it for a logistic regression.
- So this video I want to introduce you to, Saikat Learned, which is another tool kit for training models.
- So with this video, I want you to be able to apply a logistic regression model with psychic learn,
- understand and understand the API differences between psychic learning stats, models code to accompany.
- This is in the course notes for this week. You'll be able to go see the psychic learn.
- It's the same problem as I use for the logistics regression demo with stats models.
- You can directly compare a stats model solution and a psychic learned solution.
- So to train a model with psychic learn, we need to do a few things.
- So I'm going to create here a couple of variables that store the names of columns just to make it easy to extract columns,
- the same columns in the same order from both my training data and my test data.
- So I'm going to get my training x my input features by getting the feature column.
- So this. So this is my predictor features.
- Then I'm going to grab the outcome variable. This is my outcome.
- And then I, I set up my logistic regression and I just pass those values to fit Kirilov just to progression model, pass them to fit.
- And it is going to train the models parameters based on my data to fit my logistic regression model.
- And that's all there is to it. Structures just a little bit different here. Notice that fit does not return a new results object.
- It actually just returns to the model. There's not two objects in stats, models, you create a model and then you fit it and you get results.
- And Saikat learn, you create a model and then you fit it.
- And the results are stored in the model object. Lots of other software in the Python ecosystem follows the Saikat Learn API patterns.
- So if you're familiar with them, that's going to also help you with a lot of other software.
- Tenzer flow follows the same patterns. Many other packages follow the same pattern, just like it learn.
- So those who want to use the model we use predict now.
- And Saikat learn to, when stat's models predict, gave us the estimated probability in psychic learn,
- it gives us the actual class, the actual predicted class.
- So it gives us the decision. It makes the decision right away.
- One of the advantages of this is that any of the psychic learn classifier A-P eyes or classifier models do this.
- So you predict is going to return the decision. There's other function to give you.
- The underlying scores decision function gives you the log odds.
- You can use predict Prop eight to get the probabilities. It returns both the probabilities of fail of zero and the probabilities of one.
- So we need to get the second column, the one column to get the probabilities of one.
- This is equivalent to the stats model's output. But the nice advantage of Saikat predict returning your classes is it's easier to use directly.
- You don't have to manage Threshold's yourself.
- And the other thing is that since all of the different psychic classifier models return decisions like that,
- then you can make write code to do classification. They can use any model and you can plug in the different models with the same code.
- And it makes it easier to to exchange models as part of your workflow and to also test the performance of different models in the same data workflow.
- So a few differences from stats models Saikat Learn works with matrices, not data frames.
- A data frame is a matrix. They're compatible. You can treat a data frame as a matrix, but Saikat Learn doesn't know that it's a data frame.
- It just treats it as a matrix. One of the things this means is that your column labels are ignored.
- Everything is based on position. You need to have your columns in the same order for fit and for every call to predict and fit and predict.
- Map require the same column num numbers, the same column positions.
- You also don't have your you have your feature, your input feature matrix and your outcome variable as separate vectors.
- So a matrix of input features a vector of outcome variables. And then.
- And so that's why I when I was doing the training I got our input matrix and I got
- our our outcome vector column as a series which can be treated as an umpire ray,
- which is what syk it does. So you can't cycad it doesn't know anything about your column names.
- Also, Psyche just has a single motto object that's updated by fit. As I said, there's no separate results.
- And then predict returns, the predicted class, not the score.
- So use case wise stats models is good for inference, and it's it's easier to do inference.
- With stats, models am a psychic. Learn it reports a lot of statistical goodness.
- Fitz measures it defaults to UN regularized models, which are often easier for inference.
- It reports. It also say things like when you do your fit, you get your results.
- It also gives you all the residual psychic learning to go back and get those yourself because Saikat learns best for prediction.
- So you don't need the residuals to go to a prediction. You only need the residuals to understand your training process.
- For in France, Saikat learned, though, is really good for prediction. It has many bottles.
- Many more models than Stotz models. And its API is output.
- The actual prediction, not just the underlying score, but a fault.
- It also is a bunch of capabilities for data transformation, preprocessing post-processing,
- etc. and it only saves the parameter estimates when you call the logistic regression that fit.
- It learns coefficients, it learns at intercept, and that's all it saves.
- It doesn't say if you're fitted values, it doesn't save your residuals. You have to go back and get those by having it predict.
- The outcomes for your training is what you have to do yourself.
- If you want to do those. So it's more work if you want to do inference, but it's fantastic for prediction.
- So wrap up Saikat Learn provides a lot of machine learning models, including logistic and linear.
- You can go back into the linear regression things we've been doing with Saikat learn as well.
- It's more difficult to do inference with Saikat learn,
- but it's got a broader selection of models that are useful for prediction and in most production prediction predictive analytics tasks.
- You're probably gonna want Saikat learn instead of stats models.
📓 SciKit-Learn Logistic Regression#
The SciKit Logistic notebook demonstrates training and using
sklearn.linear_model.LogisticRegression
.
🎥 Receiver Operating Characteristic#
This video introduces the receiver operating characteristic (ROC) curve, and its use in evaluating classifiers and selecting tradeoffs.
- Love this video. I want to introduce you to R.O. curves that we can use to understand and visualize tradeoffs between
- different types of errors as we change the threshold for our our logistic or for classifier.
- So objectives here. Friedberger plot accuracy curve and to compute interprete receiver operating characteristic curve.
- So the matrix metrics from the confusion matrix are hard.
- Are based on hard. Yes no outcomes. They compare your decision. Yes.
- Or one or zero. Yes or no. To the action to the observed outcome in our ground.
- Truth data. One or zero. Yes or no. But a single classifier often has this tradeoff points.
- So you train a logistic regression, the default. This is what Saikat learned does is it uses point five probability or zero log odds as the threshold.
- And if it's more if one is more likely than zero, it returns one.
- But we could have it be more conservative to say require an 80 percent probability of one in order to classify as one.
- Or we could classify as one as soon as that are 20 percent probability,
- depending on the needs we have and depending on the specific costs of false positives and false negatives in our application.
- So we can plot curves for various metrics here, I've done precision, you could do it for recall, you could do it for accuracy at different thresholds.
- So here I have thresholds. And these are in law gods and I. The X axis decreases as you go.
- Right. So as you go from left to right, we're decreasing our threshold and seeing what happens to the physician.
- And it's wobbly up at the top. That's wobbly in the higher end because we make a few more classifications.
- It can wind up being more precise for a little bit.
- And then as we keep it starts stabilizing and as we decrease to as we keep decreasing the threshold,
- we keep decreasing the precision of our system because we're classifying more and more as.
- Yes. And we're deeping digging deeper and deeper into the barrel to find the ones we want to classify as.
- Yes. And we're finding quite a few. And then we wind up classifying quite a few noses.
- Yes. Now, we also do see here that at ah, at our zero cut off about here, we can look at a negative point five and that actually has just as high.
- That actually has a higher precision than the default cutoff. A zero.
- This is useful for actually setting the threshold value, if you can plot your metric, be a precision,
- be it something else at these different thresholds, and and use that to gain insight, to think about where you want to set your threshold.
- Another curve that we use for evaluating classifiers sometimes is the receiver operating characteristic curve.
- And so in the ROIC curve, what we do is we plot the true positive rate on the Y axis and the false positive rate on the x axis.
- And what this lets us see is as we as we decrease R, as we increase our tolerance for false positives,
- what happens to the number of true positives we find? Now, remember, two true positive rate and recall are the same thing.
- So if we want to find half of the Yes cases. How what's the flip, false positive rate?
- Do we have to accept in order to do that, we have to accept the point to.
- And it lets us see it. Well, if we want to find if we want to find 80 percent of the positives, then we have to accept around a point for three or so.
- But it lets us it lets us see here what false positive rate we have to tolerate in order to achieve a certain recall in our classifier.
- You can also do other curves like this for other pairs of metrics, but other pairs of metrics against each other.
- A precision recall curve looks at the relationship between precision and recall.
- As you change your threshold. Another thing that's important to note is that the diagonal line here is random.
- So a random classifier is going to get the diagonal lines performance. If you're if your curve is up over here to the left, then that's doing better.
- One of the things we can do, though, is even for the same precision, we might have a curve that goes like this.
- We might have a curve that goes up more quickly and then dials off.
- We might have a curve that trails off for a while and then gets better. This lets us understand.
- It lets us see the different tradeoff points for different classifiers and determine which one has a curve that
- better aligns with the needs of our application and and generally want to be able to pick up the false positives.
- The true positive quickly. But this curve here, as you see, is as soon as you get over here, if you want to get over about.
- You want to get more than point six recall. So you want point eight.
- Recall, you have to accept a lot more false positives with this one than you do with the blue curve.
- Because it doesn't cross point eight to here as opposed to here.
- Whereas if you want if you want to recall a 50 percent, then you don't have to have as many false positives with this one.
- So it really lets you it lets you know. It characterizes.
- The tradeoffs of the different classifiers and lets you pick one that's going to be better aligned with the needs of your particular application.
- If you wind up with a classifier down here.
- It's worse than random, but in verdict classified, yes.
- When it and when it says no. And then you're going to get a relatively good classifier.
- So we're really paying attention to classifiers in the top left triangle of the ROIC curve.
- We could also compute the area under the ROIC curve. And this gives us a metric, a U.S.
- Put a random classifier, that agonal line is going to have an AUC of point five greater is good because.
- The only way you get greater than point five is you have mass here,
- you have a curve that goes up that gives you more true positives for the false positives than you would with a random classifier.
- It's less than point five again. You can avert your classifier. Also, the area under the curve is the problem, Bill, is equal to the probability of.
- If you pick two items at random that your classifier put them in the correct order.
- So it said it scored Y above J when Y is actually better than J.
- Or its score or vice versa. It scored Y below J. Y is actually below J.
- The probability of that is the same as the area under the curve, which becomes useful in a few applications, especially where your application,
- what you care about is what's the probability that I put you care about the relative classifications of things.
- This becomes important when you're doing systems that rank that rank their outputs.
- And the classification is we're gonna take the top ten as our good ones.
- That's basically what a search engine does. Area under the curve gives you the probability that you stuck two things in the right order.
- So to conclude ROIC,
- Kurz gives us a way to see tradeoffs between false positives and false and true positives and compared to the tradeoff curves of multiple classifiers.
- It also gives us the area under the curve metric that we can use to quantify classifier performance.
✅ Practice#
Load the Penguin data, and use a logistic regression to try to classify a penguin as Gentoo or Chinstrap using various measurements. Delete the Adelie penguins first, so you have a binary classification problem.
🎥 Biases and Assumptions#
This video revisits sources of bias and discusses the assumptions underlying prediction.
- Hello again.
- This video, I want to talk some more about biases we've been talking about a little bit throughout the class and also the assumptions that we make.
- We're doing a predictive modeling task.
- Learning outcomes are for you to build a reason about its potential biases and classification inputs and outputs and
- also identify some cases or building a classifier or a predictor it applies to regression to is not appropriate.
- So I want to talk a little bit about the assumptions of most predictive modeling tasks.
- So we make use that we have an outcome variable and observed I'm biased.
- That's not the same as not erroneous. But we have. But we assume that there's no systematic bias in our outcome variable or in our features that
- are outcome variable actually matches the target of the thing that we're trying to classify.
- And that predicting this outcome with these features is reasonable.
- Think about what you need to think about what the fact that you are trying to do this prediction implies.
- The one of the readings I've assigned is going to talk in more detail about the assumptions of using prediction systems to make decisions.
- So recall from week two, I talked about a few sources of bias. Selection bias is when your there's a there's a discrepancy in who gets selected.
- Your selection is not uniformly at random from the population.
- Some some instances are more likely to be selected than another that others response bias
- is that some some selected instances are more likely to give you a response than others.
- This is super common when we're dealing with human data, because even if you are perfectly unbiased and who you select to ask a survey question,
- people aren't necessarily all are people aren't always going to respond.
- And there may be a correlation between whether or not they respond and what their response would be if they responded.
- One one example of this. That's not a Soviet or refusing to respond is if you ask somebody to tell you they're to
- rate a random movie that they've never seen and you're not having them watch it just hey,
- what did you think of the 5000 fingers of Dr. T? They're more likely to be able to answer that question for movies they watched and
- they're more likely to have watched movies that they think they're going to like.
- And so you can think of this as a selection bias that users selecting movies to rate are more likely to psych movies they want to watch.
- But if you flip it around, so it's you asking a person, what do you think of the 5000 fingers of Dr. T?
- They're going to be more likely to respond if it's Poovey they thought they would have liked and watched.
- You're not going to get very many respondents on the 5000 fingers of Dr. T, and they measure my biases.
- You can get the response and it skews one way or another based in a systematic way,
- possibly based on on sensitive attributes, protected group classification of protected attributes of of the data subjects.
- So. It's important to notice that there's a difference between error and bias.
- Observations often have error in them. But the bias comes in when they are systematically erroneous.
- So when we talk about we talk about an unbiased estimate or statistics,
- an unbiased estimate or is an estimate or whose expected value is equal to the parameter or the expected value,
- the parameter a bias comes in when the the values the estimate is or the actual observations we're making are systematically higher,
- low, or they may trend different for different groups. And maybe the deal for all means is even.
- But one group tends to score higher on on your measurement than another group or or you're mis measuring one group more likely than another group.
- So if you have features or outcome variables that are unbiased, then you can just roll your errors into your model uncertainty.
- Everything's uncertain if the the errors are independent and identically distributed.
- It's more uncertainty in your model. If they have no bias, you may be able to correct.
- You may be able to remove a bias term.
- You may be able to make some assumptions like saying, well, there's a difference in the score between these two groups.
- We're going. But we don't believe the groups are actually different, so we'll just normalize them within groups.
- Some of these things are what happens in some of the election forecasting.
- So the election forecasters, they they pool together polling data from a bunch of polling sources, along with other data that affects their forecast.
- And one of the things they have is they have a model of the bias of different polling houses.
- So some polling houses do their sampling strategy might might be more likely to contact or there
- might be a Republican leaning bias or a Democrat leaning bias in their sampling strategies.
- And they're polling results. You can tell that you can.
- Look, see this in some ways by their agreement with each other, also by their agreement and their historical agreement with election outcomes.
- If you are assuming that the House bias is relatively stable over time and but if you've got good data to estimate how it's biases,
- you can use those to adjust your polling average polling when you're pool pooling together multiple polling sources for election modeling.
- That's one example of a way where you where there's deliberately trying to de bias, where you've got an estimate of the bias.
- There's unknown bias, which this is really common. You need to start to think about how and how severe.
- And first, can you start to try to quantify the bias, but then also what are the downstream applications?
- This is one of the reasons why we always start with our goal and we move to our question because some biases will affect the question.
- And for that, the same bias may render some questions unanswerable and have negligible impact on other questions.
- So the the the problems that arise from bias in our data are not intrinsic, necessarily intrinsic to the data ourselves itself.
- They arise in combination of what we're going to do with it and that that needs to
- inform how we go about understanding the impact of a potential bias in the data.
- We also, though, we need to think about the assumptions we bring to our data and also some assumptions can sometimes help us get out of some problems.
- We need to document them and be clear about them, but assumptions can provide us some guidance.
- So, for example, if we find S.A.T. scores differ by socioeconomic status, that's relatively well established.
- But what's what causes this? What are these is more likely that poor students are intrinsically less academically
- capable or that poor students have had less less access to academic preparation.
- And that can be formal academic preparation, such as S.A.T. prep courses, a greater selection of college prep courses in high school.
- And it may be less formal, such as a greater selection of reading materials when they were in elementary school.
- It's relatively well established that good access to reading,
- which had access to a good quantity of reading materials for young children and engagement with reading,
- can really help them with educational outcomes down the road.
- There's one study that we cite in some of the research that I've been I've been involved
- with that found that if you engage children in what they called authentic literacy tasks,
- but reading things besides a text book,
- when they are in the first or second grade, then later on in junior high around the junior high age or just a little bit younger,
- those students have higher learning outcomes when in in various STEM tasks.
- And so that can be the kind of this informal a student who has early access to good
- reading materials and diverse reading materials is going to do better academically.
- This is a good, good reason to believe they'll do better on the S.A.T.
- And so one of my one one of the research projects I'm involved with is are working on
- is looking at how do we how can we use technology tools to enable to make it easier?
- For teachers to provide more and different reading materials for their students in a way with minimal costs,
- it's accessible in low resource educational settings. Then there's also the question of does the S.A.T. measure some combination of academic
- capability and familiarity with conventions and expectations for middle to upper class?
- So when you see a difference like this. Anything. What's more like is it really more likely that the students are intrinsically different or that?
- There's this difference in access to academic preparation. And that's what we're actually measuring.
- And if that's what we're actually measuring,
- what implications should that have for how we actually use the resulting numbers and what we do with them?
- So I also want to talk a little bit about outcome target mismatch.
- So what I mean by this is you got a Klatt, you're trying to predict X and you have a class label for X Prime.
- That's not really X. And so you're actually training a model to predict one thing when your goal is another.
- Sometimes you have to. Sometimes all we have is a proxy. And that's a reasonable thing to do.
- We need to evaluate the quality and the credibility of our proxy because sometimes it's all we have.
- But sometimes the proxy is too disconnected from the target to be credible.
- For example, if we want to model crime. So if we want to classify neighborhoods, if we want to be able to predict the crime level of a neighborhood.
- And the data we have as crime reports and or arrests.
- What we're trained to predict or to do is predict crime reports and police activity, which is not the same thing as actual crime,
- because an area where crimes go unreported, because there could be a variety of reasons.
- One could be that the people in that neighborhood have less trust for the police, with the police.
- And so they're less likely to report minor crimes or crimes aren't observed because that area isn't as heavily police.
- So crime is happening. But police aren't there to observe it and make arrests and nobody's bothering to report it.
- So if you're trying to predict crime, but your prediction model is trained on.
- Crime reports or police activity? You're not predicting crime. You're predicting crime reports or police activity.
- And so you have to be really careful about the relationship between the labels you have and the actual target variable that you're trying to measure.
- So with that, our claims then need to be supported by our evidence.
- And so we can't claim that we're detecting crime when we are trusting our classifier on crime reports because we're testing its ability to detect.
- They reported crime, which is not the same thing as a crime that actually happened.
- We're missing all of the unreported crimes and we also have the reports of things that weren't actually crimes.
- This is why early on in the semester, I talked with you about this goal question analysis chain.
- And I encourage you to go back and review that material because we need we have a goal.
- We would find that in the research questions and then we connect the research questions to the analysis.
- And at every step, it needs to be clear, the analysis we do needs to directly illuminate the research question and we need
- to revise the analysis and or reframe the research question until we have a match.
- That the analysis is actually addressing the question, can we detect crime reports?
- And then the question needs to advance our goal and breaks anywhere in that chain.
- Reduce the ability of our data analysis to advance the goals that we're trying to use it for.
- We also need to think about the reasonableness of the task,
- because whenever we use acts to predict why, we are assuming that that's a reasonable thing to do.
- Predicting college performance with S.A.T. scores. That's kind of what the S.A.T. is built for.
- There are problems. The S.A.T.,
- but this isn't an inherently unreasonable task performance on a standardized test as a legitimate basis for predicting future academic grades.
- On its face, not a bad assumption, but. Every year or so,
- somebody or another gets the idea that they're going to take a deep learner and they're going to train out of a bunch of photos
- and their outcome variable is going to be some attempted measure of criminality has been arrested or maybe has been convicted.
- And trying to convey trying to trying to predict criminality from photos and what
- this assumes is that facial features are a legitimate predictor of criminality.
- And so the question can arise, why would you assume that? This this mechanism is called physiognomy.
- It's been rejected for a good, solid century, it was kicking around in the eighteen hundreds and then people realized that it was a bad idea.
- It's close cousin. It's phrenology. Physiognomy is when you're trying to predict attributes such as criminality or other personality traits.
- My face fits. Phrenology is when you're trying to do it based on the shape of a skull.
- It involves a lot of calibers and school measurements.
- But you it's it's assuming here that you can use these physical characteristics to predict criminality and you can probably predict arrested.
- You can put you might even be able to predict charged or convicted with a crime because the social process of
- constructing what is a crime and who gets arrested for a crime is going to have correlates to physical attributes.
- But. At its base, we have to think about what is crime and crime is actions.
- That as a society, we have decided are sufficiently aberrant that they deserve criminal treatment.
- Some of those are relatively uncontroversial, like theft. But if crime is an action in violation of our societal laws.
- Why would that be a physically observable characteristic? What's the theory, what's the mechanism there?
- Because you can find all kinds of correlations and theory and thinking about
- as the assumptions and what theoretical constructs could motivate something.
- Are a key. Ah, one of the guiding points we can use to keep away from some of these madrassas.
- So be careful what you assume. Theory drives our research questions. We don't just ask every question willy nilly.
- That's a very inefficient use of science theory either.
- The theory we have that we want to clarify or that we want to apply to a new problem or propose theories that we want to evaluate,
- we should first give them a smell test and see it.
- Is this a reasonable theory to try to evaluate? And theory also drives or drives our predictions.
- We don't want to just throw a bunch of data at Saikat Learn or at Tenzer Flower or whatever and use whatever predictions come out.
- We want to put some thought into the process and we want to think about is this a reasonable prediction task?
- Is this a reasonable set of features? Is it legitimate to try to predict?
- Is this person a criminal? Based on what they look like, based on the picture that you see in the CCTV camera?
- Is that suppose, even suppose we did have a reliable correlation, is that a societally legitimate or useful thing to do?
- So another problem, though, that we also have is we can have problems with labeled dependency, so observations are often incomplete.
- For example, if the bank does not give someone a loan, they don't get to observe whether or not they're going to pay that loan back.
- This is a problem. Machine learning researchers, stative statistics, researchers like to give things clever names.
- This is a problem called the apple tasting problem. Also, though, criminal databases only have those were caught by the justice system in some way,
- it doesn't have the criminals that people who committed crimes and didn't get caught.
- And also, we have to be careful of inverse probabilities. So the probability of A given B and the probability of being given A are not the same thing.
- You have to be careful when you're using this to say if you've got two groups, say the group and you look try to look at their composition.
- That does not. That that that's not enough evidence to make.
- Conclusions in the other direction.
- So if you look at the racial makeup of basketball players, what you get is probability of race given basketball player.
- But that doesn't get give you a probability of skilled basketball player given race.
- And.
- So you have to be really careful about accidentally inverting your probability when you don't have the rest of the pieces of Bayes theorem involved.
- This is a this is one of the traps of common in formal probabilistic reasoning.
- You also have to be careful with pulling from different groups.
- So one example is like if if you're if you're if you have a bunch of mug shots and people getting arrested using those, you're criminal cases.
- Well, what are you using as a non-criminal face?
- If you're getting it from a different set, then are you really learning criminality or are you learning the visual distinct,
- the distinctive visual features of a mug shot? This also comes up in a variety of other settings.
- You need to you need to pay attention to what your learner is actually learning when it's going to try to do a prediction.
- I read of a case a few years ago where a machine learning algorithm for examining X-ray photos
- was trying to learn to to identify X-ray photos that indicated a particular medical condition.
- Had relatively good accuracy, but. Someone went and dug into what it was actually learning and see what parts of the images it looking at.
- And it was looking at over on the side, this code that indicated where the X-ray was taken because some of the X-ray pictures came from.
- A hospital where people were far more likely to have the disease.
- And the other one came for more general hospital or more general X-ray lab.
- And so it wasn't actually learning to identify the disease in the X-ray photo.
- It was learning the X-ray photos taken in a particular hospital's lab were more likely
- to have the disease because it was where the more advanced cases were being sent.
- So you have to be really, really careful, even if you've got a reasonable set of data.
- Oh, I have a bunch of X-ray photos.
- There can be differences that you don't expect that you need to be careful about what your systems actually learning.
- So what do you do about all of this? Unfortunately, there's not just a quick fix solution.
- You can't import psychic psychic lern dot unbias.
- And if someone ever gives us killer and that unbias be very, very skeptical.
- The starting point was we assume that we've got bias problems, we just don't necessarily know what they are, especially if you've got social data.
- The question isn't, are the data biased? The question is how are they biased? How much are they biased?
- And what impact does that have on the conclusions and tasks to which we're trying to apply it?
- We then need to start understanding how the data is collected and what the labels actually are.
- This is why we spent so much time early in the class talking about how do you actually describe data?
- And I have you read things about describing where your data came from, how it was collected, why it was collected,
- because incentive structures can skew the data collection process so that you have the information to start doing the reasoning about it.
- Study what's known about the data biases in your domain. There may be a plethora of radiate a body of existing research.
- You can draw off of to understand what's going on in it.
- Also look for systematic variations in the data, especially if you have data from different groups of people.
- If you have data from different sites, look for systematic variations.
- Those alone aren't necessarily aren't enough to tell you the drivers of different biases, but they give you a starting point where to go looking.
- And then you can go look for research like what might cause this kind of a difference between the groups that I'm seeing.
- Also, clarify a document, all of your assumptions. We want our analysis.
- We want to do a good job with our analysis. Our analysis will never be perfect.
- At some point they need to be done, but clarify and document what you're assuming each step of the pathway document why you're building this model.
- What's your theoretical justification for using these features to predict this outcome?
- What does that theoretical justification have to say about how you should use this features and how you should evaluate your model?
- Always be critical of of your own work and a reasonably critical the work of others as well.
- Does your problem make sense that your outcome makes sense? Are the results too good to be true?
- Too good to be true? They often are. But then also read broadly and critically.
- And this is one that it's hard to give a quick fix on, too.
- But a lot of what I learn about the way biases creep in and how to deal with them and the data that I work
- with and a lot of it's domain specific and contextual is from reading widely and reading deeply sometimes,
- but reading a lot of different things.
- And not just the statistics research, the data science research at the computer science research, but reading, pop science work.
- Good pop science work, reading, legal scholarship, reading,
- various other things to give you a more holistic picture of what is going on in the domains that I'm trying to study.
- So to wrap up all of our analysis are based on assumptions, and you need to be clear what your assumptions are.
- And you just study how your data is biased. And there's no magic bullet for all of this.
- We're gonna be talking about some things. We're going to talk about some measures for how do you measure bias and outcomes of a system.
- But there's no magic bullet. It requires continuous critical thought and reflection on what it is that we're doing and interrogation of what our
- system is doing and how its impacts are distributed and what its underlying data and conceptual theoretical bases are.
- This is also where the place where a few weeks ago we had the video about epistemology is this is one of the sources.
- This is one of the places where critical epistemology is can become very useful because they they give us the starting point for wit,
- the ways in which our system could go wrong and or might be a bad idea.
- That doesn't mean we thwe we just shut it down because someone said something.
- But it's something we need to reflect on it and incorporate what we learned from reading critical scholarship
- and reading critical analysis into how we think about going about the work that we're trying to do.
📃 Prediction-Based Decisions#
Read Sections 1 and 2 of the following paper:
Shira Mitchell, Eric Potash, Solon Barocas, Alexander D’Amour, Kristian Lum. 2018. Prediction-Based Decisions and Fairness: A Catalogue of Choices, Assumptions, and Definitions. arXiv:1811.07867 [stat.AP].
We’ll come back to ideas here, but sections 1 and 2 describe the assumptions underlying most classification problems. While the overall topic of the paper is fairness in making these decisions, I am not assigning it because it is a fairness paper; rather, those first two sections provide a succinct description of the assumptions that we make when we undertake most classification problems. They apply no matter what properties of a classification problem or model we care about.
If you would like to learn more, I recommend:
🚩 Week 10 Quiz#
The Week 10 quiz will be posted to Canvas.
📃 Abolish the #TechToPrison Pipeline#
Read Abolish the #TechToPrison Pipeline (the Medium reading time estimate includes the thorough — and valuable — footnotes and list of 2435 signatories). This article probes in more detail the assumptions underlying classes of criminal justice data science applications.
📩 Assignment 5#
Assignment 5 is due November 6.