Week 11 — More Modeling (10/31–11/4)#

In this week, we’re going to learn more about model building, that will be useful in Assignment 5:

Feature engineering
SciKit-Learn pipelines and workflows
Regularization
Analyzing model results

🧐 Content Overview#

Element	Length
🎥 Intro and Context	4m39s
🎥 Feature Transforms	21m3s
🎥 Workflow and Iteration	14m29s
🎥 Pipelines	7m19s
🎥 Regularization	15m4s
🎥 Models and Depth	7m23s
🎥 Inference and Ablation	14m55s
📃 Statistical Significance Tests for Comparing Machine Learning Algorithms	3400 words
🎥 Dates	8m34s

This week has 1h33m of video and 3400 words of assigned readings. This week’s videos are available in a Panopto folder.

🎥 Intro & Context#

In this video, I review where we are at conceptually, and recap the ideas of estimating conditional probability and expectation.

Video (4m39s)

Slides

Oh, this video, I'm going to introduce our week's topic about building and evaluating models, talking more in detail about how we go about doing that,
learning outcomes for the week or for you to be able to build and refine a predictive model,
construct features for that model, apply regularization to control features and then her interaction and to give us models that
generalize better and to model measure of a model's effectiveness and its other behavior.
So where we're at right now, we've seen linear regression and we have seen continue to be able to do continuous prediction,
we want to predict a continuous outcome or target variable.
We've seen logistic regression that lets us take the concept of linear modeling and move it into the realm of binary classification,
where rather than having a continuous outcome variable, we have a binary outcome such as defaulted on the loan or is spam or fraud.
We've also seen the idea of minimize it, of optimizing objective functions, we might minimize a loss function such as the squared error.
We might maximize utility functions such as log likelihood. These are equivalent to each other.
And if you've got a utility function in the minimize or you can minimize the negative of the utility function.
We've also seen that we can think about what we're doing with modeling is doing conditional estimation.
So in a regression model, we're trying to estimate the conditional expectation, given a particular set of values for my input features X.
What's the expected value of Y? We might we might do some transformations to all these variables.
But we're trying to compute this conditional expectation function.
What's the expected value of Y condition done by feature values, X and classification?
We're trying to solve a conditional probability problem.
What's the probability of a particular outcome given that I have some particular feature values x.
Also so. There's another, though, thing in here that's useful to thinking about,
so that would just add regression at its heart is trying to model the probability of your data.
So what we've been doing is with stats, models.
We do model that predict and we get some scores and then we use the scores to make a decision because internally,
the logistic regression mathematically with solving this problem of maximizing the log likelihood.
Mathematically, what the logistic regression is doing is it's trying to build a probabilistic model of the data and the
parameters are estimated based on their ability to accurately model probabilities in your training data.
We then use these output probabilities to make decisions. So we'll say success if y had is greater than point five.
Saikat Learn uses the logistic regression to directly classify by using the threshold of point five.
But you can get those estimated probabilities out of it with decision, the decision function.
This is important to note.
So the log likelihood that you get out of a logistic regression is not based on its actual actual decisions that it's making.
It's based on its ability to model probabilistically what the labels look like in your training data.
And it's the more it's the probability that it assigns to those labels with the final fitted versions of the parameters.
I want to mention briefly again, a trick that I mentioned, I believe, last week where.
Expected value and probability are closely related. The expected value is the integral or the somewhere of values weighted by their probabilities.
But also if we have an indicator function, ie, which is one if.
X is in the set and and zero, if it is not with what one?
Basically, given a value, it decides whether or not it's in the set. If that said as an event, it says whether or not the event happened,
the probability and the expected value of the indicator function are the same thing.
So we can think about estimating conditional expectation probable, but we can think about everything is estimated conditional expectation.
When we're estimating a probability, we're estimating the conditional expectation of the characteristic or indicator function.
So to wrap up, we're building models that estimate conditional probability and expectation.
We've been doing this in a variety of ways. We use these models to make decisions.
This week we're gonna see more. So we've got the idea of doing the modeling. This week, we're looking more at how do we build inputs for these models?
And how do we evaluate the outputs that we get out of them?

🎥 Feature Transforms#

What are some useful techniques for engineering features in an application?

Video (21m3s)

Slides

This video I want to talk with you about, transforming features,
learning outcomes are for you to be able to transform individual features and also derive new features and combine features.
What we're talking about here applies to both classification and regression models.
So. We've seen a few different things we can do with features already.
Just a little bit. Such as dealing with categorical features by dummy coating them.
But I'm going to start by refreshing on some discrete feature transference if we have one feature and it's a discrete feature.
There's a few things we can do with it. This is not an exhaustive list. But we can we can recode it.
We can rename coats. It might be that we've got the code. Just the names aren't very useful, so we want to rename them.
It might be that we want to merge codes. So some distinctions are irrelevant.
So, for example, on one of my datasets for completeness and being able to track coverage across each stage of the data integration pipeline,
there are four or five different ways a value can be unknown.
But when it comes to doing my final computations, I just care if it's unknown.
So I merge all of those codes into one unknown code. So that's one thing you might want to do is,
is merge some codes that your model doesn't have as many different codes to
work with because you don't care about the distinctions between some of them.
You may want to convert a value to a logical or a zero one numeric so that maybe it's you just want you're doing your recoding
where you pick one value and it's got to be one or two values that are going to be true and everything else is gonna be false.
You may also want to threshold values. For example, if you've gotten ordinal,
maybe you have some ratings and you want to say you want to collapse that into a category or a logical feature of rated positively,
where if they gave it more than three out of five stars, you say it was rated positively.
This can be really useful because people are really noisy and their inputs, like some people will say.
For some people say a five and you can reduce some of that noise by saying, you know,
we don't care how good they said the movie was, we just care if they said it was good.
This is kind of what Rotten Tomatoes is doing with its percent Frasch.
They take each rating and they convert it into did the you did the person say rate it positively or not.
And they look at the fraction of users or of critics who rated the movie positively and that becomes that becomes a feature.
So you can also dummy code your values. If you've got a categorical value with more than two levels,
then you can expand that out of the multiple features, that dummy code, your variable or one hot in code.
You can also do a number of things, the continuous features. You can take a log. You can take a square root.
Both of these are useful for reducing SKU. Sometimes you might want to take a square or a higher order polynomial in order
to a higher order power or higher order polynomial building out more features.
You've got a feature and it's square and it's cube that lets you learn more complex, nonlinear functions using a linear model.
You can also standardize various standardized and center variables. So they're mean what they call what we call mean centered.
If you take the mean of a feature and you subtract it from all the feature values,
the resulting mean is zero and the data is now means centered and you can also democratize it.
So you can convert it to either to more than one bean or you can threshold it to convert it to a binary value that's positive or negative.
There is a binary value based on whether it's above or below a threshold. So but how do you think about when you want to do this?
There's a couple of things that particularly drive when we might want to transform features.
One is when the feature has a non linear relationship to our outcome.
Variable transformation of the feature and or the outcome variable can make the relationship linear.
And now all of our linear modeling techniques work again. Also that the feature is not normally distributed.
This isn't inherently a problem, but close to normally distributed features often work better.
There's often more likely going to be linear.
So if we have a feature that's really that's very not normal and there's a simple transformation that can make it normal,
that's often going to make it work better as a feature that's input into particularly a linear model.
But other kinds of models as well. So, for one example,
if we want to standardize variables where we want to do is we want to subtract the
main feature value that's going to make the new mean on the training data zero.
And then what this means is so if if you've got this means centered variable and you use it as a feature, a linear model.
Then when the value is average, you're just going to have the intercept or the intercept,
plus the other features and the coefficient describes the change in the re outcome.
As the variable bill goes above or below average, rather than as it goes just with respect to zero, it's its natural value.
Mean Centerin can also result in more interpretable intercepts.
Because. If all of your features are mean centered, then your intercept of your linear model is the average value.
If your features aren't, many aren't mean centered than the the intercept is average,
but corrected for the averages of your different features so it can make the model more interpretable.
It can make it make the meet means enter.
It makes the model far more interpretable.
It's also useful for dealing with sparse data because if you mean center your values, then it's a lot more reasonable to treat missing values as zero.
It's still a form of a mutation, but if you mean center your values and you have something come in that doesn't have a value,
you can say, well, we don't know anything about it, so we're gonna assume at zero or we're going to zoom it's average.
Since you've means centered, the value average means the coefficient on that feature plays no role in the outcome prediction.
That's also very important. So mean centering really gives you this way to allow missing data to not have an effect
in your model because you're going sue its average average has a zero coefficient.
Its outcome is going to be based completely on the values of the other features and the observation.
It's not a perfect solution for all systems. You can't just blindly assume it's going to work, but it is a really useful technique.
The other thing we do in standard is a full standardization is we divide by the standard deviation and so the resulting value of F X.
We have a value X AI that's in our input feature. We subtract the main divide by the standard deviation and we get this transformed value x sabai.
Now the coefficients on this in a linear model are going to be a units of standard deviations.
So if f x changes by one standard deviation, how much does that change the output?
One. And this also makes if we standardize it, this makes our coefficients more directly comparable.
Because if all of our features are standardized or all of our numeric features are standardized,
then all of the coefficients are in terms of standard deviation.
You can say if this feature moves by one standard deviation,
if that feature moves by one standard deviation and you don't have to deal with all this was in millimeters and this one's in Gramp's,
how do we think about the relative impact of these two features? You can't if they're in their natural units, you can.
If they're in terms of standard, it's much easier if they're in terms of standard deviations.
And this applies to both inference and predictive modeling.
Now, it's important to note that the the parameters here, the mean that we're going to shift by.
And the scaling factor are parameters that we learned from the training data.
And when we want to transform the test data, we need to transform them by the training data's parameters,
because effectively what you do is part of the training process. You learn.
Okay, I've got I've got my coefficients, but also I normalize this feature by subtracting seven and dividing by three.
Well, that's going to change a little bit with different training data.
So you treat these as parameters and use the same values for transforming your test data.
I want to show you an example of why you might want to do a log transform even for binary outcome.
So here what I've done. Here's our outcome variable.
And I've been and I've shown a Bloks box plot of our input variable as it changes for the two versions, the output variable.
So we've got some much larger values here. And what's going to happen with these larger value?
One of the things that's gonna happen with these larger values.
Yes, it's useful that yes, it's useful that the values, the mean or the median is higher.
So it's going to. Yes. If for higher values, it's more likely to be a one on the outcome.
But we have a very, very large values. And you get. OK.
So you get some these a little bit lower. But you can. OK. We're gonna get one of these values its way up here at fifty.
You put that in your linear model. If on the off chance, that might be a zero.
It's going to jack the. It's going to push the numbers so far. Or the model output so far,
it's impossible for any other features to do anything to allow these extreme that these larger values to just completely dominate your computation.
So if you log transform, what you're gonna do is you're going to significantly decrease the skew.
It's not going to make our poor distributions perfectly symmetric. But the skew is going to be there's going to be substantially less skew.
There's going to be substantially smaller range. These values are a lot more comparable to each other.
And so the log we are looking at a difference of two here rather than about eight.
And we don't have the massively large values like the the top value here is only as large as four.
And so it really is going to make the values a lot a lot better distributed.
If you've got heavily skewed data, it's worth trying both a log transform and a square root transform,
just depending on how precisely the data is skewed. One might work better than another, but you decrease the skew, you decrease the range.
The values are a lot more contained. You don't the extreme values are still being collapsed down to a much more manageable range.
And this is. But we haven't lost the fact that a difference in this value is going to correlate with a difference in outcome.
We haven't lost the ability to use it to distinguish. We've just compressed, condensed that ability down into a more reasonable range of values.
So the resulting mathematical model is going to be better behaved. Another example is descript haisong.
So sometimes this might come from outside knowledge.
So one of the hints I've given you in the assignment is that you might want to disk critize term
so that greater than or equal to two hundred and forty months is considered a real estate loan.
We know that's a reasonable thing to do from the reading that came with the data set and explain what happens with the real estate loans.
You might also, as I said earlier, talk about going from greater than three stars to I liked it.
It can reduce the noise in the rating data. One sign that you might want to discuss ties.
If there's a non-linear response with a sharp change, it might be that you want rather than.
Trying to fit a continuous model may be the OK. There's this really sharp jump, a really sharp change at a particular point.
Let's let's turn that into a binary feature. One way you can think and sometimes what you might want to do is just split the data in half.
So your median becomes your threshold. You might want to look for an inflection point, increase in the response curve of some other variable to it.
One thing to note, though, is that discrimination can have very subtle effects on your model performance.
You have to be careful with it. If you're measuring all the things you're measuring about your model,
make sure you measure them after you change your disparate causation to see what happens as the results change.
So there's other transformations you can do to a box.
Cox or a power transformation learns a monotonic function of data to transform all your points quite a bit into something that's close to normal.
And effectively, you've learned the parameters of this transformation to optimize like with the objective function.
That is that back that minimizes its distance from normality and the distribution of the resulting form of the resulting data.
Psychic learned gives you methods for doing boxcutter, for doing power transformations, splain functions,
allow you to learn complex functions that don't have to be monotonic of a single variable.
We're not going to touch on them. I just want you to know that they exist.
But also sometimes we're going to need to deal with multi feature normalization. So we have a group of related features.
These might be we've got some of our features are accounts of different tack.
Like how often users have tagged the ah item with different tags.
And sometimes it's going to be useful to normalize those together that either some to one or they form a unit vector.
The sum of squares is equal to one. And this. This for the unit vector one,
it puts makes its that all of them are on a unit hyper sphere and so like you can compute similarities between them easily.
Distances become more normalized if it's word counts or tag count.
So if you if your features are how many times you've say you've got ten different tags and you allow users to tag it
with those tags or you have the Facebook emoji responses and users can respond with five or six different emojis,
and your features are how often they've responded to each of these emojis?
Well, there's two components of each of those features.
The two in the two components are how many people have interacted with this because if a million people interact with a message.
And five and a thousand people interact with a message or 10 people interact with the message.
They're going to have and the same fraction of them use the wow emoji.
You're going to have the feature values are going to be dominated by the popularity, not by the wow.
How much it's wow versus how much it's care or how much it's t.
It's cry.
And so if you make them some to one or if you turn them into a unit vector, then you make it to these features are no longer proxies for popularity.
But they are specifically. What fraction of the interactions are wow or cry or heart or care?
And if you all you probably also want to keep a popularity and there you.
There's a good chance you want to take take the log of it. But it really treant changes the meaning of the feature.
How many caires is a different feature from what fraction of responses work hair.
And depending on your modeling task, one of those might be a more useful model than the other.
And so you can get this multi feature normalization that you want to do together.
Another thing you can do with multiple features is an eye interaction term that we've seen a product.
You can combine the effects of two features by multiplying them together. And if they're numeric, then it's the product of them.
If one of them is logical or it's the dummy for a categorical, then effectively what it does is it turns on or off the other feature.
It gives you an F in your linear model.
And so B one two is the influence of if if X one is logical and B one, two is the influence of X two when X one is equal to one.
But if one is equal to zero, it just uses the default term.
B two X two. Now the effects are additive. So one X one is one.
The result is B two times. So if, if we have B two X two and we have B one two x one.
X two. And X is equal to one or X, one is equal to one, then the total coefficient.
On X two is Beda two plus beta one to.
And that's how you'd interpret it. You're going to go interpret the coefficients.
But this is a really useful way to allow a feature to have additional effect for some of the.
Additional linear effect for some of your of your data point and not for others.
Another one you can do as you can, computer ratio or a fraction.
And one example of this, if we're trying to model something that happens when students are in a class, again, this is a useful thing to do.
And you've got something that's dominated by popularity. You have to be careful with counts.
Counts from user activity often become really heavily skewed.
And if you've got anything that's a count of something, you really it's going to be dominated by popularity.
And it often times it's going to be more effective to separate popularity as its own feature
from how much of that popularity is being allocated to different things in your model.
So if a an x ray is the number of students in a class and X F is the number of first year students in a class,
and we want to understand like something about the B,
the impact of a class or the dynamics of a class based on what frat based on how many first years are taking it?
Well, a small class is going to a small number of first years and it's going to have a small number of students.
So we might take a ratio or proportion.
And so we might take the X, F, F for a fraction first year, which is the first year divided by the total students.
And that's going to give us the fraction of students for first year. And that allows us to build a model that rather than being.
I have two proxies for popularity. Number of students, a number of first years.
It allows us to separate. OK. What is the influence of having a lot of students in the class as a separate component of our predictive
modeling from what is the influence of having a large portion of the class being first year students?
And it also reduces our clinic already because X, say, at X, F are going to be quite highly correlated.
But X say an X, F, F probably won't or won't be.
At least they likely won't be unless larger classes are more likely to have more first year for a higher fraction of first years.
So it's another piece of the toolbag, an assignment five.
I give you a suggestion to possibly consider computing a ratio or a fraction feature.
There are other combinations we can do too, such as the difference. So if you take if you have a list of of actions, say,
forum posts or account transactions and you take the activity date minus the accounts creation date,
then what you get is the account age at the time of activity. And if you're trying to classify something that you're trying to understand.
OK, what is this more happened with new accounts. It allows you to make the feature rather than being when did the thing happen?
It allows the feature to be how established was the account when it was the account when it happened.
You then might want to describe disparities. It is this account at least a year old,
because it might be that established accounts of that established accounts are going to have different behavior than new accounts.
You might also want to some you want to combine with.
You can also combine with single feature transforms so you can do a product where one of the features is logged or square rooted or whatever.
So you can combine these things in arbitrary ways to get the final set of features that you need.
Actually, figuring out what are these you need to do takes practice and creativity.
I've given you a few hints. Look, to try to make things normal. Look to try to build linear relationships.
Look for hard jumps. Like if you if you haven't if you plot an X value and you plot a Y response, you see this jump at some point.
That suggests democratization might be useful. One thing that's super useful, though, is to read read other data.
Scientists working in Europe, brain working in other domains. If you're doing research, you should be reading papers.
Pay attention to what they do in their feature engineering. What features do they pick?
Why do they pick them? That can give you a lot of good ideas for what to go do when you're doing your own projects.
Also, it's important to note that you do all this feature exploration and design on the training data.
You don't get to look at your testing data while you're doing your feature exploration and development.
So to wrap up transforming and building features is a really important and powerful part of model building.
The model can only accept. So one of the things some deep learning models can do is do some of their own feature engineering,
like they can work with raw features and learn sophisticated functions of them on somewhat on their own.
That works well, very well for some domains, but for a lot simpler models.
The model can only work with the features you give it.
We're gonna see some techniques for automatic feature selection, but though they can't generate new features, they can.
If you if you give them a product, they can decide whether or not it's actually helping the model.
But they can't create the product. If you didn't give them product feature to start with.
So it's important to get your features right and to give your model a good set of features to work with.
Even when you're doing automated feature selection.

🎥 Workflow#

How do you do feature engineering and model selection in a machine learning workflow? What is the iterative process involved?

Video (14m29s)

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand WORKFLOW AND ITERATION Learning Outcomes Properly split training, tuning, and evaluation data. Understand what is and is not “cheating” for evaluating a predictive model. Photo by Tam DV on Unsplash Split the Testing Data The Data Training Data Testing Data One Way Train model Experiment with different model designs Experiment with different features Select hyperparameters Evaluate Effectiveness Refine Motivation Purpose: build models that can process new data Eval goal: simulate model processing new data Method: hide some data and pretend it’s new Violation: allowing “new” data to affect the model design Iterative Modeling Work with the train data: Exploratory analysis Try different features and transforms Try different hyperparameters Try different models (logistic, random forest, etc.) Test effectiveness with tuning set (another test set held out from the training data) Or cross-validation (e.g. LogisticRegressionCV) Applying to Test Data Apply feature transforms / combos to test data Otherwise the model won’t work Apply them, but don’t use test data to assess if they’re useful If you transform target in train, do that in test too! Use trained model to predict test data Measure accuracy / precision / whatever Outcome of model process: one model (or one from each family) to evaluate for effectiveness. Dos and Don’ts Do Split training data into further subsets (tune data) to test model concepts Iteratively refine model’s predictive quality w/ tuning data Explore and test features on training data Don’t Go back to fix the model if it performs poorly on test data Use test data to inform model or feature decisions Production Systems Production systems often have new streams of test data. New data arrives tomorrow! Knowledge from today’s test data can be used for tomorrow’s modeling. Carrying Knowledge Forward Use what you learned on your test data for the next project May have new data ⇒ no problem Same data set ⇒ technical violation; less problematic w/ new test sample Over-reuse of data is a problem, ramifications not fully known. Wrapping Up Train/test splits are to help us test the ability of a model to predict future, unseen data. Using test-data knowledge to inform modeling decisions breaks that down. Photo by Alexandre Debiève on Unsplash

Though, in this video, I want to talk more about the workflow and the iterative process of model, building and refinement.
We talk about how to properly split training, tuning in evaluation data,
understand better what is and is not cheating for evaluating a predictive model.
So we're setting up our setup. So we split our testing data. We have our main dataset, all the data.
We split it into training data and testing data. And then on our training data, we print our model.
We experiment with different model designs, different features. We select hyper parameters.
We can do this based on the models internal goodness of fit statistics.
So you can if you're training a linear regression model, you can be looking at your R-squared.
You can look and be looking at your adjusted R-squared. You can be looking your AIC for a for logistic regression model.
You can be looking at your log likelihood or you can do it by testing, by running a classifier evaluation metric on some tuning data.
So you further subdivide your training data into train. Antoon.
Or you may do cross-validation where you split your training data into five or 10 pieces.
And for each piece you trade, the rest of the data, predict that piece and measure your metric.
You can do all of these things basically so long as you don't touch your training.
You're testing data. You can do whatever you want with your training data to better improve and understand your model.
Well, not all things are reasonable to do, but you can do it. You can do. You're not cheating with whatever you do there in your training data.
What you can't do is use knowledge from the testing data to refine.
Your modeling process, and this includes exploratory analysis of the testing data, because the idea here is that.
There is sort of the the motivation of what we're trying to do with this predictive
modeling is to build models that are going to be able to process new data.
So predicting our testing data isn't the point. If you're training something to detect.
Fraudulent transactions in your online gaming platform.
Your goal isn't to predict that like you're the purpose of your model is never to predict the fraudulent transactions in your historical data.
For the purpose of the motto is to be able to run it. And as new transactions happen, categorize them as likely fraud or not.
And so the goal of our evaluation is to simulate the model's ability to generalize to new data that it hasn't seen yet.
And the way we do this is we hide some of the data and pretend it's new.
And as soon as you allow this data that's supposed to be new.
If you're simulating what's gonna happen, if you run this for a week and try to classify the new transactions,
what you're doing is you're giving the model. Or the modeling process data that it's not going to be allowed to have in real life.
We call this leakage. Information leaks into the model than its actual application.
It's not going to be able to have in some ways, it's the opposite of the problem that we have when we're trying to give you tests in class and tests.
We say you can't have a textbook, you can't have notes, you can't use the Internet, answer these questions.
But in real life, you can use all of the reference material you want.
Anytime you want have to actually solve that problem. In practice, there's still value in internalizing.
A lot of it's that you can detect when you because you need if you haven't internalized a lot
of the knowledge that it's hard to detect when you're going to go need to look something up.
If you don't know that overfitting is a problem, then you don't know when you need to go read more about Overfitting.
You just don't even think about it.
But when it comes to actually doing things about things, you have all these resources available in machine learning.
We have the opposite problem. Because in real life,
the model is not going to have access to the test data because you're trying to use it to classify new transactions as they come in.
You're trying to use it to predict the purchasing behavior of users as they come in.
You're trying to use it to forecast the load that's going to be on your power grid or on your transportation network for a time in the future.
And you don't get to look ahead and see any of that information. So in the real world,
your model does not have access to any information about what it's trying to predict other than what it can learn from historical data.
And so if you do anything with your test data that leaks information about the unknown,
it's supposed to be predicting into your true model building process,
either learning the model itself or the process of figuring out what feature parameters and values and whatever are going to be useful for your model.
Then you effectively allow the model to cheat.
And it's going to get better and you reduce it's gonna get better performance than who actually will in reality.
And you reduce the ability of the evaluation process to simulate what you actually care about.
Can my model effectively predict how much traffic is going to be on the freeway in December?
Based on. Previous Decembers and on the date earlier in the air, like let's say we've got 10 years of traffic data.
Can I accurately predict what the freeway data is going to be this December?
You don't get to look at this December if you do. I think the physics department would like to have a word with you.
So. When we have within this setup, we have an iterative model process.
So with our training data, we can do exploratory analysis. We can try features and transforms.
We can try different hyper parameters, talking to a hyper print.
The parameters are what we learn in the model. Your logistic regression coefficients, those are parameters.
We learn them from the data. Hyper parameters are additional values that control how the model learning process works.
Oh, we can try different models like a logistic regression or random forest.
We can test effectiveness with the tuning set so we can take our training set, split it into tuning and real training.
We can do cross-validation, as I talked about, where we can split into many separate things.
Some of the circuit models Saikat learn models have built in selection for some of their hyper parameters using cross-validation.
Once you see regularization, you can pick the regular.
You can tell logistic regression c.v to automatically find the regularization strength using cross-validation on on your training data.
Then, though, you need to apply it to your test data. And a couple of things here.
First, you need to apply your feature transformations and combinations to your test data.
You have to apply them to test data because your model is built on these, your model, your linear model or whatever is built on these.
These transformed features. These combine features, all of your feature engineering.
The results of that are what the model is trained on. If you just try to apply the model to the raw data, it's not going to work.
It's not going to have the features it needs. But the difference is you apply the feature transformations, the test data,
but you don't use the test data to a site to test to assess which feature transformations are useful.
You did all of that in the training data. And you take that as a pattern or a recipe.
I'm going to show you in a future video how you can change Saikat, learn pipelines together to do this, to automate some of this.
If you aren't using. Saikat learned you might write a function that, given raw data,
will return transformed feature Saul or will you turn data with the final set of features?
That's a very good design as well. You just take this as a pre canned recipe and you apply it to your test data.
Then you run the model, predict the test outcome data and you measure accuracy precision area under the ROIC curve,
whatever measure you're going to measure of of your model effectiveness on those results.
The outcome of the iterative modeling process in the preceding slide, though,
is one model or possibly like one model from each of three or different families
that you want to finally evaluate for effectiveness using the test data.
So. A few, too, does kind of synthesize what I've been talking about here.
It's fine to split the training data and a small and additional subsets.
You can do train test things within your training process as an iterative process to figure out.
Does this does this feature give me? Does this feature transformation? Give me a more accurate classifier.
Will do a train tune split. Add the feature. Measure the accuracy on the tuning data.
Does it help? Does it not? Does a square root give me better give me better classifications or does a log give me better classifications.
Do that on splits of your training data and leave the test data alone.
Go put it on the shelf, lock it in the cupboard, whatever you're going to do with it. You can iteratively refining the models, predictive quality,
you can explore and test all of your features using, as I said, using cross-validation or using train,
using tuning splits of your training data,
allow you to use predictive accuracy as part of your decision for what features do include how to transform them,
how to construct new combinations, etc. Don't know if it was a once you then you take you take your model, you go, you run on your test data.
You don't get to go back and fix the model if it performs poorly on the test data.
That's what you need.
In get it to do all of those fixes, because as soon as you say, oh, it didn't perform on the testing data, let me go back and fix something.
Then you're giving your model development process access to information it doesn't have in reality.
And testing on that test data is no longer a reliable test of what's going to happen when your model meets new data in the field.
You also can't use the test data to inform model or future decisions, at least within the scope of one project.
You can't say you've got your project, your test data thing. You've got to learn things from that.
You're going to publish a paper.
If you're doing this for a graduate, research the results of that learning you're going to carry into the next project.
Arguably, that's that can induce a little bit of leakage because you or someone else is going to use them and they might work on the same dataset.
Get a different data set. Arguably, it's a little bit of leakage if they read your paper on your test data.
OK. We have these things on test data. I'm going to make a neutron test split of the same dataset and I want to do things.
Arguably, we have some leakage. We can't plug all the leaks. The goal is to have the goal is not to be perfect.
The goal is to have a good and credible emulation of the actual production environment for what
we're trying to do so that we have an effective test of our models ability to do its job.
And its job is almost never classify preexisting data.
And the trick in the test data. That's how we study the model's effectiveness.
But that's not how we deploy the model to improve our lives and improve our businesses.
Production systems often have new streams of test data coming in every day.
If you're doing online. If if you're doing an online.
Shopping center. If you are monitoring quality control processes in a chip fab you've gotten used to in the test data.
The things keep running next week, next month. And so you could knowledge from today's test data.
So you run things. You predict the month of October. You predict last December.
You're doing these tests on your models effectiveness. You run it.
You're predicting this coming December. It's fine to learn what you learn about predicting this December for next December.
There's no cheating. You aren't seeing next December.
You're accusing the year over year, the month over month, etc, trends to be able to predict what's gonna happen next December.
So in the industrial setting, because we can continually acquire new test data,
it makes some of the iterative problems that happen in academic research with a static dataset significantly less of a problem because.
OK, well, that's so I learned something from this project. What am I going to do next?
Project when next project you have new test data because your chip plant has been running for another two UTS.
You can use some of that data and your what you learn from the data you captured before, that isn't cheating at all.
It's a problem in academic research where we have a static dataset.
The movie lends data that wreck data edness pick your data set.
We're all working with test sets on the same data set. I read your paper.
And that's effectively a form of leakage. That's not a whole, as I said, that we can completely plug.
So we carry knowledge forward and we use what we learn that our test data for the next
project in production with new test data that comes from new runs of the system.
As I said, technical violation. But if we pick a new test sample, it's less of a problem.
We're all using the same test set. Then we have a real problem.
But we don't often have a choice when we're trying to do academic research with these data sets.
So to wrap up trend test splits are to help us test the model's ability to predict future unseen data that it didn't have a chance to learn from.
So we're doing what we've been doing here with this predictive modeling its machine, learning the machine that the system is learning about the data.
So we can generate predictions for future data. And we test it by giving it data hasn't been able to see using test data knowledge going back.
It's like making a loop and using knowledge from our test data to inform our modeling decisions, breaks down that barrier.
And it means testing on our test data is no longer an effective test of the model's ability to generalize to data.
It wasn't able to learn from. This isn't a problem within your training process.
The train tunes that like OK, train and try something, train data, see it on tuning data,
go back and keep doing with the same tuning data because we're not using the
tuning data results as the conclusive evidence of our model's effectiveness.
We're just using them for our own internal debugging. It's why we want to go see this model works better.
That we do that, then we go. We do that with the test data,
we're not allowed to go back because otherwise we're optimizing for our ability to predict that specific set of test data.
Which means predicting that specific set of tests we want is predicting that test
data to be representative of predicting data that we haven't been able to see yet.

🎥 SciKit Pipelines#

In this video, I introduce SciKit pipelines that put multiple transformations together.

Video (7m19s)

Slides

This video, I'm going to introduce psychic learn transformers and pipelines that are going to allow
you to put your feature transformation and your modeling process into one pipeline that's
reproducible across your training and your test data learning objectives are for you to
be able to use a psychic learning pipeline to combine feature transforms and prediction.
So our data often takes the form conceptually of a pipeline.
We're going to transform some features that are going to fit a model.
And then in prediction, we need to transform the features and generate the predictions.
Both of these steps you have, the transformation, the transformation may have parameters, for example, the standardization that we talked about.
We have to learn the mean and the standard deviation that we're going to subtract and scale by in order to do the transformation.
And in the test data, we want to transform by the training data properties, not the test data properties, for two reasons.
One, they might be different to an actual production.
You don't necessarily have a whole batch of test data. If you've got a new.
So if you've got a new user coming to your online shop and you're going to predict for you want to
predict whether they're going to which of the three specials they're most likely to be interested in.
You just have this customer coming and you need to be able to transform their features to put them in the model
or the way you do that as you use the transformation parameters that you learned from the training data.
So Psyched Learn has an object called a pipeline that allows us to create a sequence of models.
And the typically this is one or more transformer algorithms or models followed by finally a regressive or a classifier or some other kind of model,
though, to output.
There's other things you can put in the middle, like Matrix decompositions and other things like that that we're going to see a little bit of later.
But if you have if you put this into a pipeline and then you tell and then you fit the pipeline, the pipeline exposes the fit method.
It will fit its inner models and it will transform the data through in sequence.
So if you've got so if you've got a if you've got a pipeline.
And it has a transform. And it has a classifier.
When you call fit, what it's going to do is it's going to, one, fit the data.
Or fit the transform to transform data. Using the parameters at Just Fit.
And then three fit the classifier. On the Transform data.
So it automates this process of managing your data pipelines.
So to talk a little bit about Transformers, the learned use case we've seen so far is that we train something on data with fit.
We give it our input features, we give it our output class, and then we generate predictions with predict transformers.
Add another modeled another function to this paradigm.
Transform some functions can do both transformation and prediction, but transform returns a copy of your input data with the features adjusted.
So if you fit so for the scale the standardization transformer.
Fitt. What it does is it computes. It computes X Bar and S and then transform.
Return. X minus X bar over s.
And it. It does this separately for each column,
for each of your input features is going to learn a separate mean and a separate scale for each of your input features.
But that's what fit and transform. And so you can then.
So if you fit the transformer and then you transform that, you can then use the transformed data as input to the next stage in the pipeline,
another transformer or your your final classification of regression model.
If you want to transform your columns differently, so Transformer's if you have a transformer, it's going to fly to every column in the asset column.
Transformer allows you to apply different transformations to different columns in your input data.
It's also one of the few Pay Saikat learn classes that actually knows about Panda's data frames.
And so you give it a list of triples, you give it name transformer and column triples,
and it will learn this transformer for these columns in this transformer for these columns.
And then there's a remainder option, which you can say either a transformer that apply to all of them or a drop or there's some other options as well.
And so what you can do is if you've got, say, three different categorical transfer functions, you want to do something.
Or call it you want to do something, too. And you have a number of numerics you can apply.
OK, here's one transformer for one of the the the categorical calls transformer
for another one and then remainder just standardize all my numeric variables.
Lets you do that conveniently. But the column I'm going to refer you to the documentation.
I've got links to the documentation in the notes for this week.
I'm going to refer you to that to learn more about how to apply column transformers, but they allow you to transform columns differently.
Some of the useful transformers that psychic learn gives you are the standard scalar that standardizes variables.
There's a power transformer that does power box or does box Cox style power transformations binary or
converts numeric data to zero one by applying a threshold's be one of its greater one out encoder.
We'll take a particle Virk Oracle variable. So a transformer is not limited to just returning one output column.
It can expand the column in the multiple columns. So one hot encoder.
We'll take your categorical column and it will return multiple columns by encoding
by dummy encoding the categorical variable and then the function transformer.
You can give an arbitrary function that will use that to transform the data.
So the Transformers, though, they only apply to features if you also need to transform your outcome variable.
The transformer target regressed classes. What you need to use and it does not go into a pipeline.
You could use it as the as the last stage of a pipeline or as a stage in a pipeline.
But it wraps an underlying predictors.
You pass a predictor and a transformer in its constructor parameters and it transforms the target before calling the predict method or the fit method.
And then if you when you call, predict it untransformed, the results, you get the results back out in the original scale.
So to wrap up pipelines, let us combine multiple data steps into a single operation.
One of the things this is really useful for is being able to apply your training data transforms to your test data.
You fit the whole pipeline that transforms. You're going to learn the parameters from the training data.
You then go apply them to the test data and it just does the right thing for you automatically.
Now, one thing you have to do throughout your work with Saikat learn is pay very close attention to defaults.
The defaults are not always what you expected.
You need to pay close attention to them in order to understand that the model is doing exactly what you think that it's doing.

📃 SciKit Learn Pipelines#

Read the SciKit-Learn User Guide chapter on pipelines.

📃 SciKit Learn Preprocessing#

Read the SciKit-Learn User Guide chapter on pre-processing.

🎥 Regularization#

This video introduces regularization: ridge regression, lasso regression, and the elasticnet. Lasso regression can help with (semi-)automatic feature selection.

Video (15m4s)

Slides

But now it's time for a topic that I've mentioned a few times, when we're actually going to learn what it is, regularization.
So the goal here is for you to understand the function of regularization,
terminal lost function and apply regularization to your logistic regression models and then finally tune regularization parameters.
I want to start by reviewing MultiKulti Darity. So remember that if we have correlated predictors that can cause poor model fit.
So if we've got X1 and x2 and they cause why we've got this correlation between them.
We don't know particularly where the common affects, so we can have a look.
So we can factor this out as X one two plus X one plus X two.
Except we don't actually have X one to. It's hidden behind the wall.
Where does it. Where does its its value go when the coefficients?
Does it go on X one, does it go next to you split it between them?
The linear model itself has no way to determine where the common component should actually be allocated.
And so one way we can deal with this and several other problems is by introducing what we call a regularization.
So rather than just solving the problem, minimize lost function.
And so if this is a linear regression, this might be squared loss or suspect B squared error.
This might be negative log likelihood. Log likelihood is a utility function, a negative log likelihood to be a positive value,
because the log likelihood they're negative is a lost function. You want to minimize your negative log likelihood.
And what we do is we add to that another term, which is we call the regularization term.
And all it is, is it is a parameter of the regularization strength times,
the magnitude of our parameters lost function now as two terms, the error and the and the magnitude of the coefficients.
When we're doing the squared magnitude here, we call it the ridge regression.
So quick detour on some of the notation. I'm using a norm as a measure of the magnitude of a vector.
So when we say X, we say the L2 norm, which is indicated with the subscript two, there's called it L2 Norma Euclidean norm.
What it is, is it's the square root of the sum of the squares that the elements of the vector.
If you take the L2 norm of Y minus Z, that's the Euclidean distance between Y and Axis.
If they're two dimensional vectors, it's the straight line distance between them.
So if you've got Y. And you've got X, Y, Z.
It's the straight line distance between them, the.
And then we can square it. So subscripts two means L2 Naum superscript, two means square.
And that's the sum of the squares of the element. So we get rid of this square root and we get the some of the squares useful, really useful.
It it simplifies the computation just a little bit. And it's how the retrogression normalization is defined or regularization is to find the L1 norm.
Subscript one is the sum of the absolute values and we call this the Manhattan or taxicab distance
because it's the distance you would have to travel if you could only travel in straight lines.
So if you want to go from X to Y, it's the it's the total length of that path.
So but it's also it's also useful, some of absolute values.
I'll want to some of absolute values. L2 is the sum as the square root of the sum of the squares.
You can generalize to get other norms as well. But this is what this notation means.
The magnitude of the vector. And so when we build up this rig, we build up our regularized model the way we increase.
The. The way we increase this component, the loss.
Remember the way we want to think about it. One of the tools we want to use for understanding a metric is how do you how do you make them change?
How do you increase them or decrease them? And the way you increase or decrease this part of the lost function is you.
You increase the coefficient and that can happen where having a strong relationship
that can happen by putting more the common factor on one than another. So when you have this multiple linearity,
one thing the retrogression regression is going to do is it's going to encourage the
model to distribute the influence of the common factor between the different sub factors.
Because if I put it all on one, that would increase the square more than if it divides it evenly between the two,
the way you minimize the squares as you divide the common the common components evenly between the two features.
It's it's a part of and it gives us a solution. So a multi linearity our system is under determined.
We don't have enough information to know where the coefficient is by adding regularization to our to our lost function.
We introduce this additional this additional loss that.
Tells it where to put it. By making the least expensive solution be the one where it's evenly distributed between all of the correlated features.
So where do we have this lost function, we have our our error loss plus our coefficient strength.
We can minimize this in two ways. We can minimize it by decreasing our error and we can minimize it by having small coefficients.
And what effectively, though, what that means is in order for a coefficient value to be large, it has to earn its keep and it has to earn its keep.
By decreasing training error. If if if if you've got a minimum, if you if you've got a particular value,
we're going to try to try to increase the coefficient and increase the coefficient.
That might give us a lower error. We only get a lower total loss if it decreases the error by more than it increases
the coefficient after take into account square and our regularization strength term.
And it gives this it encourages the coefficients to be small values unless a large value contribute significantly
to decreasing the models error on the training data squared error or increasing its log likelihood.
We're talking about a logistic regression.
The regularization parameter lambda is what we call a hyper parameter because we don't learn lambda from the data in general,
like within a single linear model. We don't learn lambda from the data.
We have to come in from outside the exact impact. The value depends somewhat implementation details such as how difficult one thing is,
the loss function itself, a mean or a some different psychic models actually make it.
You can't just take a regularization term for once I get model and use it for another,
even if they're both doing L2, because other details of lost function mean the value doesn't transfer,
because if it's using a sum of squared error,
then the regularization strength needs to depend on the data side because for the same amount, for the same amount of average error.
The sum of squared error is going to be larger just for having more data.
If it's a mean, then it's going to then you're right, Visitacion term is not going to depend on your data size.
Some Saikat models also use a concentration parameter C, which is one over Lambda Lambda,
and it's multiplied by the error instead of being multiplied by the by the, the coefficients.
Because the strict parameters. So an increased value of lambda or a decreased value of C results in stronger regularization,
a coefficient has to contribute more to the model performance to earn the keep for for a large value than it does with weaker regularization.
Now one good way to learn to write a good value for Lambda is to optimize with the training and tuning split of the training data.
Saikat learned. We'll do this automatically if you use.
So a lot of the repressors also of a CVO class logistic regression c.v you're going to have REJ CV.
Quite a few others have a CV variant.
And what happens with the CV variant is it will learn values for one or more hyper parameters by doing Krait when you call fit with training data.
It will cross validate on the training data to learn and you can give it a range of Perama,
a range of hyper parameter values to consider a list of them.
It will do. It will do the cross validation to automatically learn good values, the best values it can for these regularization parameters.
There is also a class grid search CV that allows you to do hyper cross validation to search for good hyper parameter values,
for any parameter, for any hyper parameter for a psychic. Learn model.
I encourage you to go play with that at some point. But logistic regression CV will do that automatically just in the fit call within
itself's with all it'll find a good and a good regularization strength value.
So the lasso regression. This looks very, very similar, except every place, that square at L2, nor in the sum of squares.
With the L1 norm, we're now looking at some of the absolute values and so the Elst, the square, the L2 norm allows it encourages values to be small.
But if the value is close to zero, it doesn't like it's close to zero. Fine.
What the oh one naum one of the effects it has is it doesn't like small noun's zero values.
If a coexistent value was small as L1, Naum is going to push it to zero.
And what this does is it makes the coefficient spot what we call sparse, sparse data is data with a lot of zeros.
And so. If a coefficient is not contributing very much to classification, it's going to go to zero.
And you can use that to see which class, which features are actually being used in the classification.
And it effectively becomes an automatic feature selection technique because it's going to push the it's going to push the.
Coefficients for features that don't contribute very much to decreasing your training error to zero.
You can then put them together in what's called the elastic net, which combines L1 and L2 regularization.
And you have an overall regularization strength lambda that controls your regularization or Seage was one over lambdas.
What's going to multiply the loss function by sea? And then we have L1 regularization and L2 regularization.
And they're balanced and they're balanced with this parameter ro.
And so you could parameter Ryze. It's you just have your L1 strengthen your L2 strength.
But most elastic net implementations have a regularization strength.
Your out your lambda area C. And some of the psychic docs that use Alpha for this.
And then you have a balance that says how much of the regularization to put on a one?
And how much to put on L2 and these parameters both need to be chosen by cross-validation.
That's really the only way to find good values if you use logistic regression.
So logistic regression and logistic regression CVA can do elastic net.
There's also an elastic net and elastic net CV classes. And by default, if you use logistic regression CV, it's only going to use.
It's only going to search for the it's going to default L2 regularization and search for the regularization strength.
If you want elastic net, you change the penalty option.
You also have to change the solver because only one of the logistic regression can you several solvers to learn the logistic regression parameters.
Only one of them supports elastic net. And then you're going to need some additional options in order to tell it to also search for for that L1 ratio.
But it can do all of that for you.
I refer you to the documentation for though, with logistic regression, logistic regression, CV classes to see how to do that.
You're gonna find it useful in assignment five. I'm also gonna be giving you an example in the synchronous session that is dealing with some of this.
So some notes on applying regularization, though.
Regularization really works best when you're numeric variables are standardized because the coefficients.
It's it's looking at the total magnitude of your coefficient vector.
And if one of your coey if one of your features is in units of millimeters and one of your features is in units of KG's,
the coefficient values have nothing to do with each other. And so looking at the total magnitude, treating them as elements of a vector,
it becomes really difficult and it's going to penalize one just for having to have a larger range because of the underlying units.
If you standardize your numeric variables, then each one is in terms of standard deviation.
The coefficients become a lot more directly comparable with each other and your regression is going to be better be your your regularization
is going to be better behaved than you want to select your hyper parameters based on performance and the tuning getter and the CV classes,
as I said, get help with this. I'm giving you an example and one of the notebooks that does so give you an example,
a notebook that does uses logistic regression CV to do hyper parameter search for L2 regularization.
So you can see that in action with a simple example. So to conclude.
So to conclude regularization. Imposes costs on the model for large coefficient values, either large squared values, the large, absolute values.
Squared costs, which we call Rig Ridge Regularization, encourages values to be small.
Absolute value loss, so to call L1 or lasso regularization encourages small values to be zero.
If you put those together, it encourages values to be either zero or large enough to be meaningful, but not super large.
L2 regularization or Vage regularization is useful for controlling the effects of multicam linearity.
And together they're useful for decreasing your moral complexity. Allow making coefficient values to earn their keep.
Another way.
So another thing that they do is if everything's standardized or at least means centered, then small coefficients results in small effects.
And effectively what it means is assume everything's average. Unless we have enough evidence,
enough data to justify stronger beliefs and beliefs and stronger relationships that
are justified in terms of their ability to reduce our error on the training data.

📓 Pipeline and Regularization#

This notebook demonstrates pipelines and \(L_2\) regression, and performs a significance test of classifier improvement.

It also shows a training of a decision tree (next video).

📓 Advanced Pipelines#

The Advanced Pipelines notebook demonstrates a much more advanced SciKit-Learn pipeline.

🎥 Models and Depth#

What does the world look like beyond logistic regression? Can a model output be a feature?

Video (7m23s)

Slides

Bo and this video, I want to move beyond logistic regression to talk about some additional classification
models and also introduce the idea of putting models in features for other models.
So learning outcomes are to do exactly what I just said.
So so far, we've been estimating the probability of Y equals one by using a linear linear model, a Y hat equals actually would logistic.
Of this. And we can use any estimate of this probability or we can just use models that output decisions,
these may be based on scores, the scores that aren't estimated probabilities.
For example, a support vector machine uses distance from either plane as its score.
But we're not limited to just using a logistic regression, of course.
So for one model, a decision tree is a tree of nodes where each node is a decision point.
So I made a little decision tree here for the grad student admissions example.
And at the first node, it's going to check if the GPA is less than or equal to three point four three five.
And if it's less, it's gonna go to the left hand side. And there's extra nodes here, but it's going to deny admission.
And if it's greater than three point four, three, five, it's then going to look at their class rank,
their school rank, and if their school rank is less than one point five, it's going to do.
It's going to admit. And if it's greater than one point five, it's going to deny.
Really simple model. It would be absolutely terrible to actually use this model for regression, for admissions decisions, but for predicting the.
But here we aren't. We aren't trying to build a model that will admit we're trying to build a model
is gonna predict whether someone is going to get admitted it might work. But this illustrates how the decision tree actually works.
They can learn complex interaction effects on their own because you can have the threshold.
And what happens with the features changes as you go down to the node? Now, one of the problems, though, they have high variance,
they can effectively memory memorize all of the training data by building themselves a lookup table that looks up the outcomes for training data.
By the by the feature values, you can get extremely good training, accuracy.
I trained one on this data with with unlimited feet, feature depth and I got training accuracy of over 99 percent.
And I got tested accuracy of point five to.
But a random forest, what a random forest does is it takes bootstrap samples by default psych, it learns random for us.
We'll take complete bootstrap samples. You can tell it to take smaller ones.
It's not actually a bootstrap sample, but it's a subsample of the dataset. And it fits a decision tree to that sample.
And then it does that 100 times or however many times to get a bunch, you get one hundred decision trees.
And then for a final classification, when you tell it to predict what it's going to do is it asks all of the decision, trees to vote.
It's building up this random forest of happy trees. They're happy because they have a functioning democracy.
They all get to vote on the final outcome. And the random forest takes the vote and returns the majority of the classification.
Or if the if the individual values are producing scores, that it then it might average the scores and use that as an output.
So but you build up, you decrease your variance.
That you would get from training, it is, isn't she, on one set of data training decisions and another set of data by train?
The decision tree on a bunch of sets of data by sub sampling your training data and then averaging over that in order to produce your final output.
Brandon Forest is one of the classifiers that I want you to use in your assignment.
Another thing, though, that I want to introduce is that feature output features don't have to directly come from data.
So a lot of our features are going to come from data.
But sometimes they're when they come from other models, sometimes they're a transformation model, some kind of what we call unsupervised learning,
where it's computing things,
but it doesn't have an output class that it's that's known that it's trying to predict or prediction models for other tasks.
For example, in link to end their job ad recommender, the last I knew just a few years ago, it was it was at a high level.
It was a logistic regression. You're going to LinkedIn. It says, here's a job ad for you.
Well, that's coming from a logistic regression.
But that logistic regression has very complex features, some of which are the outputs of other machine learning models.
And so you're gonna get features from the job text, the job description features in the user's profile.
One particularly interesting feature they use is a transition probability estimate.
So they have a model. This is another. This is a statistical model that tries to predict.
So if you are currently working and Boise as a data scientist,
what's the likelihood that you would transition to a job title of senior data scientist in Salt Lake City?
And so it takes into account job transitions like data.
Scientists might leave the senior data scientists, software engineers to staff, software engineers or principal software engineers.
It takes into account current migration patterns in the industry and various things like that to get this.
How likely are you to even go move someone at a staff?
Software engineering position is unlikely to take a job that where the title is Junior Software Engineer.
And the output of this transition probability model is one of the input features to their logistic regression that's computed.
That's estimating. Would you like to see this job ad for a senior data scientist in Salt Lake City?
Also, you also get things where you might have might come from some kind of a deep learning thing,
a deep learning object detection mechanism, a deep learning image similarity mechanism.
So Pinterest gets a lot of mileage out of doing nearest neighbor calculations where the the neighbor nearest
is defined by a deep learning model for assessing whether two images that are being pinned or similar.
So we can there are many different models that we can look at.
Linear models with their extensions, a generalized linear model and the logistic regression that we've been seeing, generalized adaptive models.
There's also thing the support vector machine, which is another linear model, but it's not a regression model.
The naive, naive Bayes classifier, we're going to see those later, a neural net.
Whether shallow or deep, a lot of models, pretty like a lot of neural nets.
They do a similar thing in logistic regression. They're computing a score and then you pass it through a logistic function or some other sigmoid in
order to convert the model score to probabilities for making your final classification decisions.
So wrap up. There are many different models for classification and for regression.
I'm just the my goal in this class is to teach you what regression and
classification are and how to get started with applying them and evaluating them,
not to teach you a bunch of models in depth.
The machine learning class is going to go into a lot more about how these different models work and how to get them to work.
Well, model outputs also, though, can be features used as input features for other models, often linear.
Not always, though. And so you can get models that build on top of other models.

🎥 Inference and Ablation#

How do we understand, robustly, the performance of our system? What contributes to its performance?

Video (14m55s)

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand INFERENCE AND ABLATION Learning Outcomes Make inferences about model accuracy Understand interplay of cross-validation and inference Use ablation studies to make inferences about feature or sub-model importance Photo by Siora Photography on Unsplash Train/Test Split The Data Training Data Testing Data One Way Train model Experiment with different model designs Experiment with different features Select hyperparameters Evaluate Effectiveness Significant? Testing Effectiveness For test data, we have: Individual classifications, right or wrong Single metric value (accuracy, precision, etc.) for each classifier Can’t significance test a single value! Testing Objectives Does my classifier perform better than benchmark value? What is the precision of my estimated classifier accuracy? Confidence interval Does classifier A perform better than B? P-value Confidence interval for the difference Test Samples: Confidence Intervals Solution 1: treat each test item as a binary measurement If metric denominator from test data: Wilcox confidence interval statsmodels: proportion_confint (with method=‘wilson’) Works for accuracy, FPR, FNR, recall, specificity Any metric: Bootstrap the test samples Compute metric from bootstrap samples P-Value for Accuracy Testing Regression For regression, each sample is a continuous measurement of the model’s prediction error. Use paired t-test or appropriate bootstrap Repeated Testing With repeated cross-validation, we can compute a t-statistic Run 5 times Each time, do 2-fold cross-validation See reading. Simple cross-validation not great– too much non-independence Repeated test sampling unreliable – too much non-independence Cross-Validation and Train/Test Split Cross-validation sometimes used for final eval Allows data leakage – what did you do your model & feature selection on? Good for: Limited engineering – just see how well the model works Model and feature design – when done on training data Understanding Performance & Behavior Suppose you are detecting spam with: Text features Metadata features URL features URL reputation model Sender reputation model What makes it work? Ablation Studies An ablation study examines impact of individual components Turn each off in turn Measure classification performance Lets you see how much each component contributes Do use results for production decisions, future work Do not use results to revisit model design (in this trial) Wrapping Up Inference for classifier performance is not immediately straightforward. Several techniques helpful. Be careful about data leakage. Sometimes tradeoffs are needed. Photo by Bianca Ackermann on Unsplash

Oh, in this video, I want to talk with you about inference from auto effectiveness and introduced the idea of an inflation study.
So our goals are for you to be able to make inferences about model accuracy and underpin
understand a little bit better the interplay of cross validation and inference,
remembering that we can't be perfect. The goal is to do a good and an incredible job.
And then also to be able to use an ablation study to make inferences about the particular
contributions and value of different features or subsets of your subcomponents of your model.
So remember, we've got this train test split training.
We have the training data. We're doing all of our iterative process.
It's a big, loopy thing. And then we've got we evaluate our effectiveness.
One thing we haven't talked about yet is, is the is the effectiveness significant?
To go and wait for our test data, we have a few outputs.
We have the individual classifications of predictions and four classifications we have, whether they're right or wrong for predictions,
we have the error and then we have a metric value, accuracy, precision, etc. for each classifier.
One of the challenges is, though, for the classifier and the test data, we just have accuracy is point nine nine or precision is point four.
We can't significance test that value.
But I want to talk in order to set up how we can significance test, I'm first going to or otherwise do inference, I should say, because significant,
significant testing, as we discussed earlier, a lot of times we might actually care about like an effect size estimate with confidence intervals.
A lot more than we care about significance test.
But there's a few questions that we want to answer as the results of an evaluation. What does my classifier perform better than some benchmark value?
Well, you might have a value we want to beat, say, a value we know was good enough.
And we want to know if my classifier performs better than that value.
We might want to get an estimate of our classifiers, accuracy or precision or recall our pick our metric that has a confidence interval on it.
So we know how precise ice this estimated performance measure is.
And then we may also want to answer the question, is classifier A perform better than B?
Maybe B is our current system, or B is the existing known state of the art.
And we want to know if A does better. We might want to p value.
We might want a confidence interval for the improvement or the difference in performance between A and B.
So. To get started one way, we can compute a confidence interval.
We can treat each item as a binary measurement. So you are each test item.
So you've got hundred thousand test items because you've got a very large dataset.
And hundred thousand to 20 percent split. Or it's a 10 percent split.
You've got a million data points, one 100000 test points for each of these.
You have the true value. Yes or no. You have prediction. Yes or no.
If the metric denominator comes from the test data accuracy, it definitely does, because the denominate because accuracy is correct overall.
You can also do this for false positive ratio, false negative ratio.
Recall specificity, anything where the denominator is completely determined by the test data, not by the classifier results.
You can use a Willcocks confit or a Wilson confidence interval.
Stats models does this with proportion confident and a Wilson confidence interval is a confidence interval for a proportion.
Any metric you can bootstrap that you can take your to your test samples, you can do.
You can take bootstrap samples of them and then you can compute your classifier metric over your bootstrap samples.
Now, you have to be careful when you're doing your bootstrap samples to make sure that
when you're sample you're when you're doing the bootstrap and you keep the labels, the ground truth labels and the classifier outputs together.
And if you're doing multiple classifiers, you have to keep all the classifier outputs together as you're computing these bootstrap samples.
You can bootstrap from your test data and get a confidence interval for any of your classifier performance metrics.
You can also do a computer P-value for the accuracy metric. This specific technique only works for accuracy.
It does not work for any of our other classifier metrics. But you can get a P value for the null hypothesis that the two classifiers have the same
accuracy by using what's called a contingency table and a contingency table for this purpose.
You have you go from reclassifications to whether or not it was right or wrong.
So. Here we have the number of times both classifiers were right in here, the number of times they were both wrong.
And here we have where classifier one was right and classifier two was wrong.
How often did that happen? We can do the same the other way around.
And then we compute. What we do what's called a McNee ma test, and it uses these values and NY n is the value and one is wrong.
And two is right. And then why is. One excuse excuse me.
And this is. So here we have an.
And why? And here we have. And why an.
And so we take the squared difference and the their wrongness is and divide it by the sum of their wrongness is and this gives us a statistic.
And my test statistic and under H zero under the null hypothesis and follows what's called
a chi squared distribution with one degree of freedom to probability distribution,
you can get CGF from stat's models or from sci fi.
And you can use that to compute a P value. What's the probability of having an MS statistic, at least this large?
And it's it's you don't have to deal with absolute values on it because it's it's a non-negative statistic in a non-negative distribution.
We can't just. There is something called a proportion test, but proportion test is for independent proportions and independent samples.
But we don't have independent samples. We have one sample of our test data.
And for each test point, we have two measurements, class of a classifier one and classifier two.
So we can't use a proportion test.
But the Mackney MA test basically that says do this paired proportion test kind of thing and allows us to get a P value for whether this classifier,
whether the classifiers have the same accuracy or not.
And this one, the P value, does not allow us to reject the null hypothesis that they have the same accuracy.
The P value is about one. So.
We can also test regression. So each sample is a continuous measurement of the model's prediction error.
So we have Y minus Y hat Y.
I from C one for CROSSFIRE a. And we have Y minus Y.
Hat I. From Classifier B. And those are two different measurements,
we can use a paired t test or we can use an appropriate bootstrapping mechanism in order to assess the accuracy of a regression model.
Now, when we have when we do a cross validation, so one technique, the sun, sometimes you do cross validation, say 10 tenfold cross validation.
That gives you 10 accuracy's for each classifier and you can compute paired t test.
So each of your each of your folds and your cross validation is a sample.
Is one data point in your sample. So you've got N equals 10.
You can do a T test that actually doesn't work very well because your your samples are not independent.
If you're doing capable cross validation.
Also if you just repeatedly draw a 10 percent sample and draw a 10 percent sample and do that, say, 30 times,
you also have the same problem of the same data points are going to show up and you're sent to monitor your samples.
Also, your training data classifiers are being trained in the same data too much.
And the ideal is to be able to draw, say, 30 completely independent training and testing sets from your big population.
But yeah, but if you can't do that, you're trying to simulate with cross validation,
you wind up with the non independence just causes the resulting come statistical test to not be reliable.
One thing you can do is you can do repeated cross validation where five times you do a two fold cross validation.
I'm going to refer you to one of the readings I put in the notes for a lot more details on this.
Just wanted to bring it up so that, you know, it's there. Cross-validation is sometimes used for final evaluation.
You'll find this in papers sometimes.
One of the problems, though, is this allows data leakage because you're testing on data that was available and you're trying it.
You're testing on all of the data data that was available in your training set.
This is a this can be a significant problem if we've got a large enough data
set that we can just use a single test split or maybe two or three test split.
That's going to allow us to much better simulate to avoid leakage, much better simulate what's going to happen.
We put the model in production. Cross validation is really useful for a couple of contexts.
One where you're not doing much model design or feature engineering. You just want to take you have data.
Want to take a model. Apply it, see how it works. Cross-validation is great for that.
You're not. You don't have the iterative process of how am I really getting this model to work?
You can cross validate if you've got hyper parameter search,
do a hyper parameter search separately for each needful, like make it part of your training process.
Logit like that logistic regression c.v kinds of things. Help with that.
But if you've got a model and just want to see how well it works in the data, cross-validation can work pretty well.
Also with when you are doing cross-validation on the training data to iteratively, do improve your model and feature design.
That can work really, really well as well. The problem arises when you're doing a lot of engineering on your model.
And you get access to the test data, which you effectively have on a cross validation setup.
Because even if you've got it's your say you do 10 cross-validation, do you pick one of them that's gonna be your.
That's gonna be what you're really doing, your development? Well, all of your other test data is in this.
This initial development part.
So it's part of how you're effectively you're using the test data as part of your tuning process for your hyper parameter selection,
as part of your exploratory data analysis. And that that is a cause of leakage.
Again, though, I guess. We're can never be perfect, but it's important to be aware of as a cause of leakage.
I really recommend having the designated test set that you hold out.
You don't touch. That's the basis of your evaluation. Even if it makes the statistical inference a little bit harder.
Now, another thing, though, I want to talk about is suppose.
So let's suppose you've got a complex model and we've got we're detecting spam where we're working for, say, a.
Telecommunications company were detecting text message spam or were detecting an e-mail spam for any mail company.
We have text features. We've got made a day to features when they're sending you are else.
You've got features of the you are all itself. Maybe we even hit the server.
Let's say we've got another couple of sophisticated models that do that score.
You are else by their reputation and they're sent and also score senders by theirs,
by a reputation score that large e-mail search spam, antispam efforts such as the one built in the GMAT.
I'll do this. I'm not just making that up. It's a part of of antispam at scale is building reputations for you are else and centers.
And we've got let's say art, let's say our spam detector works well. Precision of ninety nine point five or ninety nine point nine.
Recall of 80 percent. But what makes it work?
Which of these features is contributing, how much to its success?
The answer is to do what's called an oblation study and an oblation study takes our model.
We take our whole model. We see how accurate it is, but then we turn off individual features of it.
So we might turn off the sender reputation. How what how exactly you turn off depends on the model design.
It might be if it's an honor. You just take that part out of your neuron, that graph syllogistic model, you know, everything's well standardized.
You can put it zeroes for the feature and not retrain or even just take that term out of the model,
trying to on your training data and try to predict your testing data. And what this lets you see and you probably want to do that just in case.
Just to make sure the parameters are being tuned without the peace. What does that seat you see, though, is how much each component contributes?
You can say, OK, my model gets ninety nine percent precision on spams and it gets 98 percent precision if I turn off the sender reputation.
Well, that lets you see, OK, the sender reputation is responsible for one percent of my precision.
Now, it's important to be careful how you use this, because you can use this for production decisions and for future work.
You do this oblation study, you discover, OK, the center reputations only contributing 98 or one percent,
or maybe it's contributing point one percent. And it's really expensive in terms of compute time and engineer time to maintain maybe stop using it.
You could also use it for your future research work. What you can't do, particularly within the scope of one study,
is use the results of your oblation study to go back and revisit your model design that gets you your leakage again.
As again, as I said in the academic setting,
we're doing multiple studies in the same data that we do get some leakage and we carry it forward to the next study.
We again, we can't be perfect, but.
There's a difference between the oblation study and the feature engineering, the feature engineering, I'm trying a bunch of things that keep things.
I'm not going to keep things up doing it with this tuning data. Things are going back. I'm not being keeping my careful firewalls.
In the oblation study, I have my top line performance monitor. Here's my model, I ran it.
It got 99 percent precision and then. I'm trying to understand.
Well, what are the drivers of that? I'm not putting it iteratively back into my life.
Going back and rerunning my my stuff in my training data with it, I'm just using it to get knowledge to carry forward.
That doesn't cause leakage within the context of the specific study we're talking about.
And it is of acceptable practice.
And it's a very, very useful practice for understanding the contributing factors to the performance of a complex model.
So wrap up inference for classified performance is not immediately straightforward.
There are several helpful techniques that pointed you here at pointing to you two in the readings and be careful about data leakage.
But again, sometimes tradeoffs are.

📃 Statistical Significance Tests#

Read Statistical Significance Tests for Comparing Machine Learning Algorithms.

Note

In the Week 9 activity, we used the paired t-test for comparing the output of two regression models. Our use of this test did not violate the guidance in this reading — why is that?

For further reading, you can also see Approximate Statistical Tests.

🎥 Dates#

This video discusses how to use work with dates in Pandas.

Video (8m34s)

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand DATES Learning Outcomes Parse and transform dates Adjust dates using date offsets Photo by J. on Unsplash Dates and Representations Time moves forward at a constant rate (generally…) How we record it changes Daylight savings time – the same hour happens twice Key insight: time is different from representation Typically: store time in monotonic form, translate for presentation Numeric Representations Unix timestamps — time since (or before) midnight Jan 1, 1970 Seconds, milliseconds, nanoseconds Reference point is UTC Julian day numbers Days since January 1, 4713 BC Floating point stores time as fraction of a day “1900 system” (Excel): days since Jan. 1, 1900 String Representations ISO: 2020-11-03 Alphabetic sorts by date (for AD, until 10000) Localized numeric US: 11/03/2020 EU: 03/11/2020 Longer November 3, 2020 datetime64 The Pandas datetime64 type stores dates (and times). Construct from: Number, units, and epoch pd.to_datetime(230481083, unit='s') — seconds since Unix epoch pd.to_datetime(3810401, unit='D', origin='julian') — days since Julian epoch String and format pd.to_datetime('2020-11-03') — convert from ISO pd.to_datetime('Nov 3 2020', format='%b %d %Y') timedelta The Pandas timedelta type stores time offsets Create from number + units or string ‘1 day 00:30:22’ – one day, 30 minutes, and 22 seconds Mark advances in linear time DateOffset DateOffset type stores date offsets to adjust calendar days Create from number + units pd.DateOffset(months=240) Correctly offset dates, even with underlying nonlinearities Months don’t have the same length DST, leap years, leap seconds Not directly supported by Series Can use in ‘apply’:month_series.apply(lambda m: pd.DateOffset(months=m)) Date Arithmetic datetime + timedelta = datetime datetime + DateOffset = datetime DateOffset * num = DateOffset Comparisons DateTime supports comparison operators (==, <, etc.) Need to create DateTimes on both sides Wrapping Up Dates and times are typically stored internally using offsets from an origin. Pandas provides several date and time features, including datetime, timedelta, and DateOffset. Photo by Bundo Kim on Unsplash

So in this video, I want to talk with you about dates,
learning outcomes are for you to be able to pass and transform dates and adjust dates using date offsets.
So first, I want to talk just briefly about the difference between a date as we say it, like.
OK. It's November the 3rd, 2020 and underlying time.
So dates to all kinds of funny things. When we change to or from Daylight Savings Time, we skip an hour or we repeat an hour.
But the underlying time stream doesn't repeat.
It's just that our way of mapping that to the way we write it down repeats.
So we can think of underlying time as moving forward at a constant rate.
Generally, there's relativity and all of those things.
But the time is moving forward and how we record it changes and is complex and subject to a lot of rules.
The key thing is like with text being different, like the text content is different from its encoding.
Time is different from its representation.
Well, one of the implications this says, is that we typically store time in more of its monotonic form, like seconds since a particular date, UTC.
And then we translate for presentation. And so you'll see your store the time it offset UTC and then you will you'll translate
that to the local time zone with all of the daylight savings rules and everything,
we are going to go actually display it.
So internally, there are a few ways we can represent time numerically, and sometimes you'll need to do this yourself.
So one one is Unix timestamps, which is time since or before that can be negative.
Midnight, UTC, January 1st, 1970. Often this is stored in seconds.
Pythons like not in pandas or not pie, but Python, the Pathfinder standard library tends to do time and second floating point seconds since midnight.
The reference point, as I said, the reference point for this is UTC. You can also store at milliseconds or nanoseconds since that time.
If you have a data, if you have a data file that has a file, a column that's labeled as a timestamp and it's an end.
It's a number. There's a very good chance it's a Unix timestamp. That's very common way to store dates and times.
We can also store Julian Day numbers, which are days since January 1st.
Forty seven 13 B.C. And you can you you can store a time by using a floating point numbers,
it might be twenty two million, three hundred and seventy five point eight days.
There's also other origins. You can use a lot of different origins. Pandas actually lets you specify arbitrary origins.
But the nineteen hundred system that's used by Excel and other spreadsheets stores days since January 1st.
Nineteen hundred. So we can also store data strings, so the ISO format is year, month, day.
This has the nice advantage that at least until the year ten thousand.
It sorts by date. If you sorted Alphabet Alpha numerically, it's it's going to sort the resulting dates by date.
So if you're going to name files after dates, this with dates at the beginning of the file name.
This is super useful.
There's also localized numeric forms such as eleven three, twenty twenty, which is how we write the dates in the United States, Europe and the UK.
Right. Generally right at day, month, year three. Eleven, twenty, twenty.
So if you see a date that's two digits, two digits year, that's not enough information to know when we're talking about.
Are we talking about November 3rd or are we talking about March 11th?
You need to know the country locale in which the date came from to know how to correctly interpret it.
Sometimes you can infer it by looking for, say, November twenty eighth,
because 28 isn't a valid month number that I'll let you figure out which one you're dealing with.
But this localized form, just if you get a date, it's often ambiguous.
You could also have longer string forms like you did right out November 3rd.
Twenty Twenty Panels provides a function called Date Time 64 that allows you to store dates and times.
And even if you just have a date, you usually store it. It's a date time with midnight.
At least that's how you work with it. And pan those pandas doesn't have a time free date type.
You can create a date time from number and units as an origin.
So you can say we want. We want two hundred and thirty million.
Seconds since the Unix epoch, we want three point eight million second days since the Julian Origin D Funk.
This function also supports the number can be a series or an array in addition to a single numbers.
You can create a series or an array of pand date time objects.
You can also convert it from a string. This also can be up a series or array of strings.
So in an assignment, if you've got it, if you've got a column that string dates,
you can convert that to a column of date times by using these functions.
And by default, it's going to pass the time for my S.O.
But you can also tell it to pass other time, but providing a format string that describes how the time is laid out.
And there's a link in the pandas documentation to the way these format strings work can also be provided.
That link in the notes that go with this video. Then so we've got daytime's pandas has also had an object called a Time Delta,
which stores of a difference between two times, if you subtract one date time from another,
what you're going to get as a time delta,
you can create one from a number of plus units or a string that describes that like I can create the the the time delta one day,
thirty minutes and twenty two seconds. The time Delta marks advances in linear times.
You can't create a time delta for example, of one month. The date offset is what you use to get one month.
So it's you can create it from a number and units and it correctly offsets the dates,
even if you it knows it can know whether it needs to extend by 30 days or thirty one or twenty eight.
It handles Daylight Savings Time, a Hannahs leap year, a handle's leap seconds and deals with being able to offset dates properly.
Date Offset does not natively support series, so date times and time deltas, both native pandas natively supports them in series.
You can't, however, create a series of an object series that contains data offsets.
So if Month series is a series of numeric series that contains numbers of months, then we can use apply.
And it's it's it's a little slow because it's doing a python loop effectively.
But we can use apply to convert these data, offset these numbers of month into data, offset objects.
We get a series of those which we can then say add to add to a series of date times in order to produce offset date times.
For example, to add if we've got a column that has the term the number of months on loan is for and when the loan was issued,
we can add we can convert the month to a date offset. Add it to the issue date and we can find when the loan is due.
When you're doing arithmetic with dates, if you add a date time and the time Delta,
you're going to get a date time, date time plus a tight offset is also a date time. You could subtract as well as add.
As I said, if you subtract two date times, you're gonna get a time Delta.
You can also multiply a date offset by a number and it's going to give you another date offset that's multiplied.
So you can if you've got two months, you can multiply it by five and you'll get 10 months.
You can also compare you can compare date times using comparison operators.
You do need to create date times on both sides. If you've got something that say strings, you need to convert it to a date time object.
So then you can do the comparison. So in conclusion, dates and times are typically stored internally using offsets from an origin.
Usually store. Usually we store them in UTC and then we translate them to local time.
When we go on display. PANDAS provides a number of functions and types for working with dates and times.
In addition, NUM Pi provides some of its own. I generally work with pandas, but not PI does provide time.
Delta date time objects at work just a little bit differently. Python also does it its standard library for our purposes.
I recommend generally sticking with the pandas ones.

CS 533 Fall 2022

Week 11 — More Modeling (10/31–11/4)

Contents

Week 11 — More Modeling (10/31–11/4)#

🧐 Content Overview#

🎥 Intro & Context#

🎥 Feature Transforms#

🎥 Workflow#

🎥 SciKit Pipelines#

📃 SciKit Learn Pipelines#

📃 SciKit Learn Preprocessing#

🎥 Regularization#

📓 Pipeline and Regularization#

📓 Advanced Pipelines#

🎥 Models and Depth#

🎥 Inference and Ablation#

📃 Statistical Significance Tests#

🎥 Dates#

Links#

🚩 Quiz 11#

📩 Assignment 5#