Week 11 — More Modeling (10/31–11/4)
Contents
Week 11 — More Modeling (10/31–11/4)#
In this week, we’re going to learn more about model building, that will be useful in Assignment 5:
Feature engineering
SciKit-Learn pipelines and workflows
Regularization
Analyzing model results
🧐 Content Overview#
Element | Length |
---|---|
4m39s | |
21m3s | |
14m29s | |
7m19s | |
15m4s | |
7m23s | |
14m55s | |
📃 Statistical Significance Tests for Comparing Machine Learning Algorithms |
3400 words |
8m34s |
This week has 1h33m of video and 3400 words of assigned readings. This week’s videos are available in a Panopto folder.
🎥 Intro & Context#
In this video, I review where we are at conceptually, and recap the ideas of estimating conditional probability and expectation.
- Oh, this video, I'm going to introduce our week's topic about building and evaluating models, talking more in detail about how we go about doing that,
- learning outcomes for the week or for you to be able to build and refine a predictive model,
- construct features for that model, apply regularization to control features and then her interaction and to give us models that
- generalize better and to model measure of a model's effectiveness and its other behavior.
- So where we're at right now, we've seen linear regression and we have seen continue to be able to do continuous prediction,
- we want to predict a continuous outcome or target variable.
- We've seen logistic regression that lets us take the concept of linear modeling and move it into the realm of binary classification,
- where rather than having a continuous outcome variable, we have a binary outcome such as defaulted on the loan or is spam or fraud.
- We've also seen the idea of minimize it, of optimizing objective functions, we might minimize a loss function such as the squared error.
- We might maximize utility functions such as log likelihood. These are equivalent to each other.
- And if you've got a utility function in the minimize or you can minimize the negative of the utility function.
- We've also seen that we can think about what we're doing with modeling is doing conditional estimation.
- So in a regression model, we're trying to estimate the conditional expectation, given a particular set of values for my input features X.
- What's the expected value of Y? We might we might do some transformations to all these variables.
- But we're trying to compute this conditional expectation function.
- What's the expected value of Y condition done by feature values, X and classification?
- We're trying to solve a conditional probability problem.
- What's the probability of a particular outcome given that I have some particular feature values x.
- Also so. There's another, though, thing in here that's useful to thinking about,
- so that would just add regression at its heart is trying to model the probability of your data.
- So what we've been doing is with stats, models.
- We do model that predict and we get some scores and then we use the scores to make a decision because internally,
- the logistic regression mathematically with solving this problem of maximizing the log likelihood.
- Mathematically, what the logistic regression is doing is it's trying to build a probabilistic model of the data and the
- parameters are estimated based on their ability to accurately model probabilities in your training data.
- We then use these output probabilities to make decisions. So we'll say success if y had is greater than point five.
- Saikat Learn uses the logistic regression to directly classify by using the threshold of point five.
- But you can get those estimated probabilities out of it with decision, the decision function.
- This is important to note.
- So the log likelihood that you get out of a logistic regression is not based on its actual actual decisions that it's making.
- It's based on its ability to model probabilistically what the labels look like in your training data.
- And it's the more it's the probability that it assigns to those labels with the final fitted versions of the parameters.
- I want to mention briefly again, a trick that I mentioned, I believe, last week where.
- Expected value and probability are closely related. The expected value is the integral or the somewhere of values weighted by their probabilities.
- But also if we have an indicator function, ie, which is one if.
- X is in the set and and zero, if it is not with what one?
- Basically, given a value, it decides whether or not it's in the set. If that said as an event, it says whether or not the event happened,
- the probability and the expected value of the indicator function are the same thing.
- So we can think about estimating conditional expectation probable, but we can think about everything is estimated conditional expectation.
- When we're estimating a probability, we're estimating the conditional expectation of the characteristic or indicator function.
- So to wrap up, we're building models that estimate conditional probability and expectation.
- We've been doing this in a variety of ways. We use these models to make decisions.
- This week we're gonna see more. So we've got the idea of doing the modeling. This week, we're looking more at how do we build inputs for these models?
- And how do we evaluate the outputs that we get out of them?
🎥 Feature Transforms#
What are some useful techniques for engineering features in an application?
- This video I want to talk with you about, transforming features,
- learning outcomes are for you to be able to transform individual features and also derive new features and combine features.
- What we're talking about here applies to both classification and regression models.
- So. We've seen a few different things we can do with features already.
- Just a little bit. Such as dealing with categorical features by dummy coating them.
- But I'm going to start by refreshing on some discrete feature transference if we have one feature and it's a discrete feature.
- There's a few things we can do with it. This is not an exhaustive list. But we can we can recode it.
- We can rename coats. It might be that we've got the code. Just the names aren't very useful, so we want to rename them.
- It might be that we want to merge codes. So some distinctions are irrelevant.
- So, for example, on one of my datasets for completeness and being able to track coverage across each stage of the data integration pipeline,
- there are four or five different ways a value can be unknown.
- But when it comes to doing my final computations, I just care if it's unknown.
- So I merge all of those codes into one unknown code. So that's one thing you might want to do is,
- is merge some codes that your model doesn't have as many different codes to
- work with because you don't care about the distinctions between some of them.
- You may want to convert a value to a logical or a zero one numeric so that maybe it's you just want you're doing your recoding
- where you pick one value and it's got to be one or two values that are going to be true and everything else is gonna be false.
- You may also want to threshold values. For example, if you've gotten ordinal,
- maybe you have some ratings and you want to say you want to collapse that into a category or a logical feature of rated positively,
- where if they gave it more than three out of five stars, you say it was rated positively.
- This can be really useful because people are really noisy and their inputs, like some people will say.
- For some people say a five and you can reduce some of that noise by saying, you know,
- we don't care how good they said the movie was, we just care if they said it was good.
- This is kind of what Rotten Tomatoes is doing with its percent Frasch.
- They take each rating and they convert it into did the you did the person say rate it positively or not.
- And they look at the fraction of users or of critics who rated the movie positively and that becomes that becomes a feature.
- So you can also dummy code your values. If you've got a categorical value with more than two levels,
- then you can expand that out of the multiple features, that dummy code, your variable or one hot in code.
- You can also do a number of things, the continuous features. You can take a log. You can take a square root.
- Both of these are useful for reducing SKU. Sometimes you might want to take a square or a higher order polynomial in order
- to a higher order power or higher order polynomial building out more features.
- You've got a feature and it's square and it's cube that lets you learn more complex, nonlinear functions using a linear model.
- You can also standardize various standardized and center variables. So they're mean what they call what we call mean centered.
- If you take the mean of a feature and you subtract it from all the feature values,
- the resulting mean is zero and the data is now means centered and you can also democratize it.
- So you can convert it to either to more than one bean or you can threshold it to convert it to a binary value that's positive or negative.
- There is a binary value based on whether it's above or below a threshold. So but how do you think about when you want to do this?
- There's a couple of things that particularly drive when we might want to transform features.
- One is when the feature has a non linear relationship to our outcome.
- Variable transformation of the feature and or the outcome variable can make the relationship linear.
- And now all of our linear modeling techniques work again. Also that the feature is not normally distributed.
- This isn't inherently a problem, but close to normally distributed features often work better.
- There's often more likely going to be linear.
- So if we have a feature that's really that's very not normal and there's a simple transformation that can make it normal,
- that's often going to make it work better as a feature that's input into particularly a linear model.
- But other kinds of models as well. So, for one example,
- if we want to standardize variables where we want to do is we want to subtract the
- main feature value that's going to make the new mean on the training data zero.
- And then what this means is so if if you've got this means centered variable and you use it as a feature, a linear model.
- Then when the value is average, you're just going to have the intercept or the intercept,
- plus the other features and the coefficient describes the change in the re outcome.
- As the variable bill goes above or below average, rather than as it goes just with respect to zero, it's its natural value.
- Mean Centerin can also result in more interpretable intercepts.
- Because. If all of your features are mean centered, then your intercept of your linear model is the average value.
- If your features aren't, many aren't mean centered than the the intercept is average,
- but corrected for the averages of your different features so it can make the model more interpretable.
- It can make it make the meet means enter.
- It makes the model far more interpretable.
- It's also useful for dealing with sparse data because if you mean center your values, then it's a lot more reasonable to treat missing values as zero.
- It's still a form of a mutation, but if you mean center your values and you have something come in that doesn't have a value,
- you can say, well, we don't know anything about it, so we're gonna assume at zero or we're going to zoom it's average.
- Since you've means centered, the value average means the coefficient on that feature plays no role in the outcome prediction.
- That's also very important. So mean centering really gives you this way to allow missing data to not have an effect
- in your model because you're going sue its average average has a zero coefficient.
- Its outcome is going to be based completely on the values of the other features and the observation.
- It's not a perfect solution for all systems. You can't just blindly assume it's going to work, but it is a really useful technique.
- The other thing we do in standard is a full standardization is we divide by the standard deviation and so the resulting value of F X.
- We have a value X AI that's in our input feature. We subtract the main divide by the standard deviation and we get this transformed value x sabai.
- Now the coefficients on this in a linear model are going to be a units of standard deviations.
- So if f x changes by one standard deviation, how much does that change the output?
- One. And this also makes if we standardize it, this makes our coefficients more directly comparable.
- Because if all of our features are standardized or all of our numeric features are standardized,
- then all of the coefficients are in terms of standard deviation.
- You can say if this feature moves by one standard deviation,
- if that feature moves by one standard deviation and you don't have to deal with all this was in millimeters and this one's in Gramp's,
- how do we think about the relative impact of these two features? You can't if they're in their natural units, you can.
- If they're in terms of standard, it's much easier if they're in terms of standard deviations.
- And this applies to both inference and predictive modeling.
- Now, it's important to note that the the parameters here, the mean that we're going to shift by.
- And the scaling factor are parameters that we learned from the training data.
- And when we want to transform the test data, we need to transform them by the training data's parameters,
- because effectively what you do is part of the training process. You learn.
- Okay, I've got I've got my coefficients, but also I normalize this feature by subtracting seven and dividing by three.
- Well, that's going to change a little bit with different training data.
- So you treat these as parameters and use the same values for transforming your test data.
- I want to show you an example of why you might want to do a log transform even for binary outcome.
- So here what I've done. Here's our outcome variable.
- And I've been and I've shown a Bloks box plot of our input variable as it changes for the two versions, the output variable.
- So we've got some much larger values here. And what's going to happen with these larger value?
- One of the things that's gonna happen with these larger values.
- Yes, it's useful that yes, it's useful that the values, the mean or the median is higher.
- So it's going to. Yes. If for higher values, it's more likely to be a one on the outcome.
- But we have a very, very large values. And you get. OK.
- So you get some these a little bit lower. But you can. OK. We're gonna get one of these values its way up here at fifty.
- You put that in your linear model. If on the off chance, that might be a zero.
- It's going to jack the. It's going to push the numbers so far. Or the model output so far,
- it's impossible for any other features to do anything to allow these extreme that these larger values to just completely dominate your computation.
- So if you log transform, what you're gonna do is you're going to significantly decrease the skew.
- It's not going to make our poor distributions perfectly symmetric. But the skew is going to be there's going to be substantially less skew.
- There's going to be substantially smaller range. These values are a lot more comparable to each other.
- And so the log we are looking at a difference of two here rather than about eight.
- And we don't have the massively large values like the the top value here is only as large as four.
- And so it really is going to make the values a lot a lot better distributed.
- If you've got heavily skewed data, it's worth trying both a log transform and a square root transform,
- just depending on how precisely the data is skewed. One might work better than another, but you decrease the skew, you decrease the range.
- The values are a lot more contained. You don't the extreme values are still being collapsed down to a much more manageable range.
- And this is. But we haven't lost the fact that a difference in this value is going to correlate with a difference in outcome.
- We haven't lost the ability to use it to distinguish. We've just compressed, condensed that ability down into a more reasonable range of values.
- So the resulting mathematical model is going to be better behaved. Another example is descript haisong.
- So sometimes this might come from outside knowledge.
- So one of the hints I've given you in the assignment is that you might want to disk critize term
- so that greater than or equal to two hundred and forty months is considered a real estate loan.
- We know that's a reasonable thing to do from the reading that came with the data set and explain what happens with the real estate loans.
- You might also, as I said earlier, talk about going from greater than three stars to I liked it.
- It can reduce the noise in the rating data. One sign that you might want to discuss ties.
- If there's a non-linear response with a sharp change, it might be that you want rather than.
- Trying to fit a continuous model may be the OK. There's this really sharp jump, a really sharp change at a particular point.
- Let's let's turn that into a binary feature. One way you can think and sometimes what you might want to do is just split the data in half.
- So your median becomes your threshold. You might want to look for an inflection point, increase in the response curve of some other variable to it.
- One thing to note, though, is that discrimination can have very subtle effects on your model performance.
- You have to be careful with it. If you're measuring all the things you're measuring about your model,
- make sure you measure them after you change your disparate causation to see what happens as the results change.
- So there's other transformations you can do to a box.
- Cox or a power transformation learns a monotonic function of data to transform all your points quite a bit into something that's close to normal.
- And effectively, you've learned the parameters of this transformation to optimize like with the objective function.
- That is that back that minimizes its distance from normality and the distribution of the resulting form of the resulting data.
- Psychic learned gives you methods for doing boxcutter, for doing power transformations, splain functions,
- allow you to learn complex functions that don't have to be monotonic of a single variable.
- We're not going to touch on them. I just want you to know that they exist.
- But also sometimes we're going to need to deal with multi feature normalization. So we have a group of related features.
- These might be we've got some of our features are accounts of different tack.
- Like how often users have tagged the ah item with different tags.
- And sometimes it's going to be useful to normalize those together that either some to one or they form a unit vector.
- The sum of squares is equal to one. And this. This for the unit vector one,
- it puts makes its that all of them are on a unit hyper sphere and so like you can compute similarities between them easily.
- Distances become more normalized if it's word counts or tag count.
- So if you if your features are how many times you've say you've got ten different tags and you allow users to tag it
- with those tags or you have the Facebook emoji responses and users can respond with five or six different emojis,
- and your features are how often they've responded to each of these emojis?
- Well, there's two components of each of those features.
- The two in the two components are how many people have interacted with this because if a million people interact with a message.
- And five and a thousand people interact with a message or 10 people interact with the message.
- They're going to have and the same fraction of them use the wow emoji.
- You're going to have the feature values are going to be dominated by the popularity, not by the wow.
- How much it's wow versus how much it's care or how much it's t.
- It's cry.
- And so if you make them some to one or if you turn them into a unit vector, then you make it to these features are no longer proxies for popularity.
- But they are specifically. What fraction of the interactions are wow or cry or heart or care?
- And if you all you probably also want to keep a popularity and there you.
- There's a good chance you want to take take the log of it. But it really treant changes the meaning of the feature.
- How many caires is a different feature from what fraction of responses work hair.
- And depending on your modeling task, one of those might be a more useful model than the other.
- And so you can get this multi feature normalization that you want to do together.
- Another thing you can do with multiple features is an eye interaction term that we've seen a product.
- You can combine the effects of two features by multiplying them together. And if they're numeric, then it's the product of them.
- If one of them is logical or it's the dummy for a categorical, then effectively what it does is it turns on or off the other feature.
- It gives you an F in your linear model.
- And so B one two is the influence of if if X one is logical and B one, two is the influence of X two when X one is equal to one.
- But if one is equal to zero, it just uses the default term.
- B two X two. Now the effects are additive. So one X one is one.
- The result is B two times. So if, if we have B two X two and we have B one two x one.
- X two. And X is equal to one or X, one is equal to one, then the total coefficient.
- On X two is Beda two plus beta one to.
- And that's how you'd interpret it. You're going to go interpret the coefficients.
- But this is a really useful way to allow a feature to have additional effect for some of the.
- Additional linear effect for some of your of your data point and not for others.
- Another one you can do as you can, computer ratio or a fraction.
- And one example of this, if we're trying to model something that happens when students are in a class, again, this is a useful thing to do.
- And you've got something that's dominated by popularity. You have to be careful with counts.
- Counts from user activity often become really heavily skewed.
- And if you've got anything that's a count of something, you really it's going to be dominated by popularity.
- And it often times it's going to be more effective to separate popularity as its own feature
- from how much of that popularity is being allocated to different things in your model.
- So if a an x ray is the number of students in a class and X F is the number of first year students in a class,
- and we want to understand like something about the B,
- the impact of a class or the dynamics of a class based on what frat based on how many first years are taking it?
- Well, a small class is going to a small number of first years and it's going to have a small number of students.
- So we might take a ratio or proportion.
- And so we might take the X, F, F for a fraction first year, which is the first year divided by the total students.
- And that's going to give us the fraction of students for first year. And that allows us to build a model that rather than being.
- I have two proxies for popularity. Number of students, a number of first years.
- It allows us to separate. OK. What is the influence of having a lot of students in the class as a separate component of our predictive
- modeling from what is the influence of having a large portion of the class being first year students?
- And it also reduces our clinic already because X, say, at X, F are going to be quite highly correlated.
- But X say an X, F, F probably won't or won't be.
- At least they likely won't be unless larger classes are more likely to have more first year for a higher fraction of first years.
- So it's another piece of the toolbag, an assignment five.
- I give you a suggestion to possibly consider computing a ratio or a fraction feature.
- There are other combinations we can do too, such as the difference. So if you take if you have a list of of actions, say,
- forum posts or account transactions and you take the activity date minus the accounts creation date,
- then what you get is the account age at the time of activity. And if you're trying to classify something that you're trying to understand.
- OK, what is this more happened with new accounts. It allows you to make the feature rather than being when did the thing happen?
- It allows the feature to be how established was the account when it was the account when it happened.
- You then might want to describe disparities. It is this account at least a year old,
- because it might be that established accounts of that established accounts are going to have different behavior than new accounts.
- You might also want to some you want to combine with.
- You can also combine with single feature transforms so you can do a product where one of the features is logged or square rooted or whatever.
- So you can combine these things in arbitrary ways to get the final set of features that you need.
- Actually, figuring out what are these you need to do takes practice and creativity.
- I've given you a few hints. Look, to try to make things normal. Look to try to build linear relationships.
- Look for hard jumps. Like if you if you haven't if you plot an X value and you plot a Y response, you see this jump at some point.
- That suggests democratization might be useful. One thing that's super useful, though, is to read read other data.
- Scientists working in Europe, brain working in other domains. If you're doing research, you should be reading papers.
- Pay attention to what they do in their feature engineering. What features do they pick?
- Why do they pick them? That can give you a lot of good ideas for what to go do when you're doing your own projects.
- Also, it's important to note that you do all this feature exploration and design on the training data.
- You don't get to look at your testing data while you're doing your feature exploration and development.
- So to wrap up transforming and building features is a really important and powerful part of model building.
- The model can only accept. So one of the things some deep learning models can do is do some of their own feature engineering,
- like they can work with raw features and learn sophisticated functions of them on somewhat on their own.
- That works well, very well for some domains, but for a lot simpler models.
- The model can only work with the features you give it.
- We're gonna see some techniques for automatic feature selection, but though they can't generate new features, they can.
- If you if you give them a product, they can decide whether or not it's actually helping the model.
- But they can't create the product. If you didn't give them product feature to start with.
- So it's important to get your features right and to give your model a good set of features to work with.
- Even when you're doing automated feature selection.
🎥 Workflow#
How do you do feature engineering and model selection in a machine learning workflow? What is the iterative process involved?
- Though, in this video, I want to talk more about the workflow and the iterative process of model, building and refinement.
- We talk about how to properly split training, tuning in evaluation data,
- understand better what is and is not cheating for evaluating a predictive model.
- So we're setting up our setup. So we split our testing data. We have our main dataset, all the data.
- We split it into training data and testing data. And then on our training data, we print our model.
- We experiment with different model designs, different features. We select hyper parameters.
- We can do this based on the models internal goodness of fit statistics.
- So you can if you're training a linear regression model, you can be looking at your R-squared.
- You can look and be looking at your adjusted R-squared. You can be looking your AIC for a for logistic regression model.
- You can be looking at your log likelihood or you can do it by testing, by running a classifier evaluation metric on some tuning data.
- So you further subdivide your training data into train. Antoon.
- Or you may do cross-validation where you split your training data into five or 10 pieces.
- And for each piece you trade, the rest of the data, predict that piece and measure your metric.
- You can do all of these things basically so long as you don't touch your training.
- You're testing data. You can do whatever you want with your training data to better improve and understand your model.
- Well, not all things are reasonable to do, but you can do it. You can do. You're not cheating with whatever you do there in your training data.
- What you can't do is use knowledge from the testing data to refine.
- Your modeling process, and this includes exploratory analysis of the testing data, because the idea here is that.
- There is sort of the the motivation of what we're trying to do with this predictive
- modeling is to build models that are going to be able to process new data.
- So predicting our testing data isn't the point. If you're training something to detect.
- Fraudulent transactions in your online gaming platform.
- Your goal isn't to predict that like you're the purpose of your model is never to predict the fraudulent transactions in your historical data.
- For the purpose of the motto is to be able to run it. And as new transactions happen, categorize them as likely fraud or not.
- And so the goal of our evaluation is to simulate the model's ability to generalize to new data that it hasn't seen yet.
- And the way we do this is we hide some of the data and pretend it's new.
- And as soon as you allow this data that's supposed to be new.
- If you're simulating what's gonna happen, if you run this for a week and try to classify the new transactions,
- what you're doing is you're giving the model. Or the modeling process data that it's not going to be allowed to have in real life.
- We call this leakage. Information leaks into the model than its actual application.
- It's not going to be able to have in some ways, it's the opposite of the problem that we have when we're trying to give you tests in class and tests.
- We say you can't have a textbook, you can't have notes, you can't use the Internet, answer these questions.
- But in real life, you can use all of the reference material you want.
- Anytime you want have to actually solve that problem. In practice, there's still value in internalizing.
- A lot of it's that you can detect when you because you need if you haven't internalized a lot
- of the knowledge that it's hard to detect when you're going to go need to look something up.
- If you don't know that overfitting is a problem, then you don't know when you need to go read more about Overfitting.
- You just don't even think about it.
- But when it comes to actually doing things about things, you have all these resources available in machine learning.
- We have the opposite problem. Because in real life,
- the model is not going to have access to the test data because you're trying to use it to classify new transactions as they come in.
- You're trying to use it to predict the purchasing behavior of users as they come in.
- You're trying to use it to forecast the load that's going to be on your power grid or on your transportation network for a time in the future.
- And you don't get to look ahead and see any of that information. So in the real world,
- your model does not have access to any information about what it's trying to predict other than what it can learn from historical data.
- And so if you do anything with your test data that leaks information about the unknown,
- it's supposed to be predicting into your true model building process,
- either learning the model itself or the process of figuring out what feature parameters and values and whatever are going to be useful for your model.
- Then you effectively allow the model to cheat.
- And it's going to get better and you reduce it's gonna get better performance than who actually will in reality.
- And you reduce the ability of the evaluation process to simulate what you actually care about.
- Can my model effectively predict how much traffic is going to be on the freeway in December?
- Based on. Previous Decembers and on the date earlier in the air, like let's say we've got 10 years of traffic data.
- Can I accurately predict what the freeway data is going to be this December?
- You don't get to look at this December if you do. I think the physics department would like to have a word with you.
- So. When we have within this setup, we have an iterative model process.
- So with our training data, we can do exploratory analysis. We can try features and transforms.
- We can try different hyper parameters, talking to a hyper print.
- The parameters are what we learn in the model. Your logistic regression coefficients, those are parameters.
- We learn them from the data. Hyper parameters are additional values that control how the model learning process works.
- Oh, we can try different models like a logistic regression or random forest.
- We can test effectiveness with the tuning set so we can take our training set, split it into tuning and real training.
- We can do cross-validation, as I talked about, where we can split into many separate things.
- Some of the circuit models Saikat learn models have built in selection for some of their hyper parameters using cross-validation.
- Once you see regularization, you can pick the regular.
- You can tell logistic regression c.v to automatically find the regularization strength using cross-validation on on your training data.
- Then, though, you need to apply it to your test data. And a couple of things here.
- First, you need to apply your feature transformations and combinations to your test data.
- You have to apply them to test data because your model is built on these, your model, your linear model or whatever is built on these.
- These transformed features. These combine features, all of your feature engineering.
- The results of that are what the model is trained on. If you just try to apply the model to the raw data, it's not going to work.
- It's not going to have the features it needs. But the difference is you apply the feature transformations, the test data,
- but you don't use the test data to a site to test to assess which feature transformations are useful.
- You did all of that in the training data. And you take that as a pattern or a recipe.
- I'm going to show you in a future video how you can change Saikat, learn pipelines together to do this, to automate some of this.
- If you aren't using. Saikat learned you might write a function that, given raw data,
- will return transformed feature Saul or will you turn data with the final set of features?
- That's a very good design as well. You just take this as a pre canned recipe and you apply it to your test data.
- Then you run the model, predict the test outcome data and you measure accuracy precision area under the ROIC curve,
- whatever measure you're going to measure of of your model effectiveness on those results.
- The outcome of the iterative modeling process in the preceding slide, though,
- is one model or possibly like one model from each of three or different families
- that you want to finally evaluate for effectiveness using the test data.
- So. A few, too, does kind of synthesize what I've been talking about here.
- It's fine to split the training data and a small and additional subsets.
- You can do train test things within your training process as an iterative process to figure out.
- Does this does this feature give me? Does this feature transformation? Give me a more accurate classifier.
- Will do a train tune split. Add the feature. Measure the accuracy on the tuning data.
- Does it help? Does it not? Does a square root give me better give me better classifications or does a log give me better classifications.
- Do that on splits of your training data and leave the test data alone.
- Go put it on the shelf, lock it in the cupboard, whatever you're going to do with it. You can iteratively refining the models, predictive quality,
- you can explore and test all of your features using, as I said, using cross-validation or using train,
- using tuning splits of your training data,
- allow you to use predictive accuracy as part of your decision for what features do include how to transform them,
- how to construct new combinations, etc. Don't know if it was a once you then you take you take your model, you go, you run on your test data.
- You don't get to go back and fix the model if it performs poorly on the test data.
- That's what you need.
- In get it to do all of those fixes, because as soon as you say, oh, it didn't perform on the testing data, let me go back and fix something.
- Then you're giving your model development process access to information it doesn't have in reality.
- And testing on that test data is no longer a reliable test of what's going to happen when your model meets new data in the field.
- You also can't use the test data to inform model or future decisions, at least within the scope of one project.
- You can't say you've got your project, your test data thing. You've got to learn things from that.
- You're going to publish a paper.
- If you're doing this for a graduate, research the results of that learning you're going to carry into the next project.
- Arguably, that's that can induce a little bit of leakage because you or someone else is going to use them and they might work on the same dataset.
- Get a different data set. Arguably, it's a little bit of leakage if they read your paper on your test data.
- OK. We have these things on test data. I'm going to make a neutron test split of the same dataset and I want to do things.
- Arguably, we have some leakage. We can't plug all the leaks. The goal is to have the goal is not to be perfect.
- The goal is to have a good and credible emulation of the actual production environment for what
- we're trying to do so that we have an effective test of our models ability to do its job.
- And its job is almost never classify preexisting data.
- And the trick in the test data. That's how we study the model's effectiveness.
- But that's not how we deploy the model to improve our lives and improve our businesses.
- Production systems often have new streams of test data coming in every day.
- If you're doing online. If if you're doing an online.
- Shopping center. If you are monitoring quality control processes in a chip fab you've gotten used to in the test data.
- The things keep running next week, next month. And so you could knowledge from today's test data.
- So you run things. You predict the month of October. You predict last December.
- You're doing these tests on your models effectiveness. You run it.
- You're predicting this coming December. It's fine to learn what you learn about predicting this December for next December.
- There's no cheating. You aren't seeing next December.
- You're accusing the year over year, the month over month, etc, trends to be able to predict what's gonna happen next December.
- So in the industrial setting, because we can continually acquire new test data,
- it makes some of the iterative problems that happen in academic research with a static dataset significantly less of a problem because.
- OK, well, that's so I learned something from this project. What am I going to do next?
- Project when next project you have new test data because your chip plant has been running for another two UTS.
- You can use some of that data and your what you learn from the data you captured before, that isn't cheating at all.
- It's a problem in academic research where we have a static dataset.
- The movie lends data that wreck data edness pick your data set.
- We're all working with test sets on the same data set. I read your paper.
- And that's effectively a form of leakage. That's not a whole, as I said, that we can completely plug.
- So we carry knowledge forward and we use what we learn that our test data for the next
- project in production with new test data that comes from new runs of the system.
- As I said, technical violation. But if we pick a new test sample, it's less of a problem.
- We're all using the same test set. Then we have a real problem.
- But we don't often have a choice when we're trying to do academic research with these data sets.
- So to wrap up trend test splits are to help us test the model's ability to predict future unseen data that it didn't have a chance to learn from.
- So we're doing what we've been doing here with this predictive modeling its machine, learning the machine that the system is learning about the data.
- So we can generate predictions for future data. And we test it by giving it data hasn't been able to see using test data knowledge going back.
- It's like making a loop and using knowledge from our test data to inform our modeling decisions, breaks down that barrier.
- And it means testing on our test data is no longer an effective test of the model's ability to generalize to data.
- It wasn't able to learn from. This isn't a problem within your training process.
- The train tunes that like OK, train and try something, train data, see it on tuning data,
- go back and keep doing with the same tuning data because we're not using the
- tuning data results as the conclusive evidence of our model's effectiveness.
- We're just using them for our own internal debugging. It's why we want to go see this model works better.
- That we do that, then we go. We do that with the test data,
- we're not allowed to go back because otherwise we're optimizing for our ability to predict that specific set of test data.
- Which means predicting that specific set of tests we want is predicting that test
- data to be representative of predicting data that we haven't been able to see yet.
🎥 SciKit Pipelines#
In this video, I introduce SciKit pipelines that put multiple transformations together.
- This video, I'm going to introduce psychic learn transformers and pipelines that are going to allow
- you to put your feature transformation and your modeling process into one pipeline that's
- reproducible across your training and your test data learning objectives are for you to
- be able to use a psychic learning pipeline to combine feature transforms and prediction.
- So our data often takes the form conceptually of a pipeline.
- We're going to transform some features that are going to fit a model.
- And then in prediction, we need to transform the features and generate the predictions.
- Both of these steps you have, the transformation, the transformation may have parameters, for example, the standardization that we talked about.
- We have to learn the mean and the standard deviation that we're going to subtract and scale by in order to do the transformation.
- And in the test data, we want to transform by the training data properties, not the test data properties, for two reasons.
- One, they might be different to an actual production.
- You don't necessarily have a whole batch of test data. If you've got a new.
- So if you've got a new user coming to your online shop and you're going to predict for you want to
- predict whether they're going to which of the three specials they're most likely to be interested in.
- You just have this customer coming and you need to be able to transform their features to put them in the model
- or the way you do that as you use the transformation parameters that you learned from the training data.
- So Psyched Learn has an object called a pipeline that allows us to create a sequence of models.
- And the typically this is one or more transformer algorithms or models followed by finally a regressive or a classifier or some other kind of model,
- though, to output.
- There's other things you can put in the middle, like Matrix decompositions and other things like that that we're going to see a little bit of later.
- But if you have if you put this into a pipeline and then you tell and then you fit the pipeline, the pipeline exposes the fit method.
- It will fit its inner models and it will transform the data through in sequence.
- So if you've got so if you've got a if you've got a pipeline.
- And it has a transform. And it has a classifier.
- When you call fit, what it's going to do is it's going to, one, fit the data.
- Or fit the transform to transform data. Using the parameters at Just Fit.
- And then three fit the classifier. On the Transform data.
- So it automates this process of managing your data pipelines.
- So to talk a little bit about Transformers, the learned use case we've seen so far is that we train something on data with fit.
- We give it our input features, we give it our output class, and then we generate predictions with predict transformers.
- Add another modeled another function to this paradigm.
- Transform some functions can do both transformation and prediction, but transform returns a copy of your input data with the features adjusted.
- So if you fit so for the scale the standardization transformer.
- Fitt. What it does is it computes. It computes X Bar and S and then transform.
- Return. X minus X bar over s.
- And it. It does this separately for each column,
- for each of your input features is going to learn a separate mean and a separate scale for each of your input features.
- But that's what fit and transform. And so you can then.
- So if you fit the transformer and then you transform that, you can then use the transformed data as input to the next stage in the pipeline,
- another transformer or your your final classification of regression model.
- If you want to transform your columns differently, so Transformer's if you have a transformer, it's going to fly to every column in the asset column.
- Transformer allows you to apply different transformations to different columns in your input data.
- It's also one of the few Pay Saikat learn classes that actually knows about Panda's data frames.
- And so you give it a list of triples, you give it name transformer and column triples,
- and it will learn this transformer for these columns in this transformer for these columns.
- And then there's a remainder option, which you can say either a transformer that apply to all of them or a drop or there's some other options as well.
- And so what you can do is if you've got, say, three different categorical transfer functions, you want to do something.
- Or call it you want to do something, too. And you have a number of numerics you can apply.
- OK, here's one transformer for one of the the the categorical calls transformer
- for another one and then remainder just standardize all my numeric variables.
- Lets you do that conveniently. But the column I'm going to refer you to the documentation.
- I've got links to the documentation in the notes for this week.
- I'm going to refer you to that to learn more about how to apply column transformers, but they allow you to transform columns differently.
- Some of the useful transformers that psychic learn gives you are the standard scalar that standardizes variables.
- There's a power transformer that does power box or does box Cox style power transformations binary or
- converts numeric data to zero one by applying a threshold's be one of its greater one out encoder.
- We'll take a particle Virk Oracle variable. So a transformer is not limited to just returning one output column.
- It can expand the column in the multiple columns. So one hot encoder.
- We'll take your categorical column and it will return multiple columns by encoding
- by dummy encoding the categorical variable and then the function transformer.
- You can give an arbitrary function that will use that to transform the data.
- So the Transformers, though, they only apply to features if you also need to transform your outcome variable.
- The transformer target regressed classes. What you need to use and it does not go into a pipeline.
- You could use it as the as the last stage of a pipeline or as a stage in a pipeline.
- But it wraps an underlying predictors.
- You pass a predictor and a transformer in its constructor parameters and it transforms the target before calling the predict method or the fit method.
- And then if you when you call, predict it untransformed, the results, you get the results back out in the original scale.
- So to wrap up pipelines, let us combine multiple data steps into a single operation.
- One of the things this is really useful for is being able to apply your training data transforms to your test data.
- You fit the whole pipeline that transforms. You're going to learn the parameters from the training data.
- You then go apply them to the test data and it just does the right thing for you automatically.
- Now, one thing you have to do throughout your work with Saikat learn is pay very close attention to defaults.
- The defaults are not always what you expected.
- You need to pay close attention to them in order to understand that the model is doing exactly what you think that it's doing.
📃 SciKit Learn Pipelines#
📃 SciKit Learn Preprocessing#
🎥 Regularization#
This video introduces regularization: ridge regression, lasso regression, and the elasticnet. Lasso regression can help with (semi-)automatic feature selection.
- But now it's time for a topic that I've mentioned a few times, when we're actually going to learn what it is, regularization.
- So the goal here is for you to understand the function of regularization,
- terminal lost function and apply regularization to your logistic regression models and then finally tune regularization parameters.
- I want to start by reviewing MultiKulti Darity. So remember that if we have correlated predictors that can cause poor model fit.
- So if we've got X1 and x2 and they cause why we've got this correlation between them.
- We don't know particularly where the common affects, so we can have a look.
- So we can factor this out as X one two plus X one plus X two.
- Except we don't actually have X one to. It's hidden behind the wall.
- Where does it. Where does its its value go when the coefficients?
- Does it go on X one, does it go next to you split it between them?
- The linear model itself has no way to determine where the common component should actually be allocated.
- And so one way we can deal with this and several other problems is by introducing what we call a regularization.
- So rather than just solving the problem, minimize lost function.
- And so if this is a linear regression, this might be squared loss or suspect B squared error.
- This might be negative log likelihood. Log likelihood is a utility function, a negative log likelihood to be a positive value,
- because the log likelihood they're negative is a lost function. You want to minimize your negative log likelihood.
- And what we do is we add to that another term, which is we call the regularization term.
- And all it is, is it is a parameter of the regularization strength times,
- the magnitude of our parameters lost function now as two terms, the error and the and the magnitude of the coefficients.
- When we're doing the squared magnitude here, we call it the ridge regression.
- So quick detour on some of the notation. I'm using a norm as a measure of the magnitude of a vector.
- So when we say X, we say the L2 norm, which is indicated with the subscript two, there's called it L2 Norma Euclidean norm.
- What it is, is it's the square root of the sum of the squares that the elements of the vector.
- If you take the L2 norm of Y minus Z, that's the Euclidean distance between Y and Axis.
- If they're two dimensional vectors, it's the straight line distance between them.
- So if you've got Y. And you've got X, Y, Z.
- It's the straight line distance between them, the.
- And then we can square it. So subscripts two means L2 Naum superscript, two means square.
- And that's the sum of the squares of the element. So we get rid of this square root and we get the some of the squares useful, really useful.
- It it simplifies the computation just a little bit. And it's how the retrogression normalization is defined or regularization is to find the L1 norm.
- Subscript one is the sum of the absolute values and we call this the Manhattan or taxicab distance
- because it's the distance you would have to travel if you could only travel in straight lines.
- So if you want to go from X to Y, it's the it's the total length of that path.
- So but it's also it's also useful, some of absolute values.
- I'll want to some of absolute values. L2 is the sum as the square root of the sum of the squares.
- You can generalize to get other norms as well. But this is what this notation means.
- The magnitude of the vector. And so when we build up this rig, we build up our regularized model the way we increase.
- The. The way we increase this component, the loss.
- Remember the way we want to think about it. One of the tools we want to use for understanding a metric is how do you how do you make them change?
- How do you increase them or decrease them? And the way you increase or decrease this part of the lost function is you.
- You increase the coefficient and that can happen where having a strong relationship
- that can happen by putting more the common factor on one than another. So when you have this multiple linearity,
- one thing the retrogression regression is going to do is it's going to encourage the
- model to distribute the influence of the common factor between the different sub factors.
- Because if I put it all on one, that would increase the square more than if it divides it evenly between the two,
- the way you minimize the squares as you divide the common the common components evenly between the two features.
- It's it's a part of and it gives us a solution. So a multi linearity our system is under determined.
- We don't have enough information to know where the coefficient is by adding regularization to our to our lost function.
- We introduce this additional this additional loss that.
- Tells it where to put it. By making the least expensive solution be the one where it's evenly distributed between all of the correlated features.
- So where do we have this lost function, we have our our error loss plus our coefficient strength.
- We can minimize this in two ways. We can minimize it by decreasing our error and we can minimize it by having small coefficients.
- And what effectively, though, what that means is in order for a coefficient value to be large, it has to earn its keep and it has to earn its keep.
- By decreasing training error. If if if if you've got a minimum, if you if you've got a particular value,
- we're going to try to try to increase the coefficient and increase the coefficient.
- That might give us a lower error. We only get a lower total loss if it decreases the error by more than it increases
- the coefficient after take into account square and our regularization strength term.
- And it gives this it encourages the coefficients to be small values unless a large value contribute significantly
- to decreasing the models error on the training data squared error or increasing its log likelihood.
- We're talking about a logistic regression.
- The regularization parameter lambda is what we call a hyper parameter because we don't learn lambda from the data in general,
- like within a single linear model. We don't learn lambda from the data.
- We have to come in from outside the exact impact. The value depends somewhat implementation details such as how difficult one thing is,
- the loss function itself, a mean or a some different psychic models actually make it.
- You can't just take a regularization term for once I get model and use it for another,
- even if they're both doing L2, because other details of lost function mean the value doesn't transfer,
- because if it's using a sum of squared error,
- then the regularization strength needs to depend on the data side because for the same amount, for the same amount of average error.
- The sum of squared error is going to be larger just for having more data.
- If it's a mean, then it's going to then you're right, Visitacion term is not going to depend on your data size.
- Some Saikat models also use a concentration parameter C, which is one over Lambda Lambda,
- and it's multiplied by the error instead of being multiplied by the by the, the coefficients.
- Because the strict parameters. So an increased value of lambda or a decreased value of C results in stronger regularization,
- a coefficient has to contribute more to the model performance to earn the keep for for a large value than it does with weaker regularization.
- Now one good way to learn to write a good value for Lambda is to optimize with the training and tuning split of the training data.
- Saikat learned. We'll do this automatically if you use.
- So a lot of the repressors also of a CVO class logistic regression c.v you're going to have REJ CV.
- Quite a few others have a CV variant.
- And what happens with the CV variant is it will learn values for one or more hyper parameters by doing Krait when you call fit with training data.
- It will cross validate on the training data to learn and you can give it a range of Perama,
- a range of hyper parameter values to consider a list of them.
- It will do. It will do the cross validation to automatically learn good values, the best values it can for these regularization parameters.
- There is also a class grid search CV that allows you to do hyper cross validation to search for good hyper parameter values,
- for any parameter, for any hyper parameter for a psychic. Learn model.
- I encourage you to go play with that at some point. But logistic regression CV will do that automatically just in the fit call within
- itself's with all it'll find a good and a good regularization strength value.
- So the lasso regression. This looks very, very similar, except every place, that square at L2, nor in the sum of squares.
- With the L1 norm, we're now looking at some of the absolute values and so the Elst, the square, the L2 norm allows it encourages values to be small.
- But if the value is close to zero, it doesn't like it's close to zero. Fine.
- What the oh one naum one of the effects it has is it doesn't like small noun's zero values.
- If a coexistent value was small as L1, Naum is going to push it to zero.
- And what this does is it makes the coefficient spot what we call sparse, sparse data is data with a lot of zeros.
- And so. If a coefficient is not contributing very much to classification, it's going to go to zero.
- And you can use that to see which class, which features are actually being used in the classification.
- And it effectively becomes an automatic feature selection technique because it's going to push the it's going to push the.
- Coefficients for features that don't contribute very much to decreasing your training error to zero.
- You can then put them together in what's called the elastic net, which combines L1 and L2 regularization.
- And you have an overall regularization strength lambda that controls your regularization or Seage was one over lambdas.
- What's going to multiply the loss function by sea? And then we have L1 regularization and L2 regularization.
- And they're balanced and they're balanced with this parameter ro.
- And so you could parameter Ryze. It's you just have your L1 strengthen your L2 strength.
- But most elastic net implementations have a regularization strength.
- Your out your lambda area C. And some of the psychic docs that use Alpha for this.
- And then you have a balance that says how much of the regularization to put on a one?
- And how much to put on L2 and these parameters both need to be chosen by cross-validation.
- That's really the only way to find good values if you use logistic regression.
- So logistic regression and logistic regression CVA can do elastic net.
- There's also an elastic net and elastic net CV classes. And by default, if you use logistic regression CV, it's only going to use.
- It's only going to search for the it's going to default L2 regularization and search for the regularization strength.
- If you want elastic net, you change the penalty option.
- You also have to change the solver because only one of the logistic regression can you several solvers to learn the logistic regression parameters.
- Only one of them supports elastic net. And then you're going to need some additional options in order to tell it to also search for for that L1 ratio.
- But it can do all of that for you.
- I refer you to the documentation for though, with logistic regression, logistic regression, CV classes to see how to do that.
- You're gonna find it useful in assignment five. I'm also gonna be giving you an example in the synchronous session that is dealing with some of this.
- So some notes on applying regularization, though.
- Regularization really works best when you're numeric variables are standardized because the coefficients.
- It's it's looking at the total magnitude of your coefficient vector.
- And if one of your coey if one of your features is in units of millimeters and one of your features is in units of KG's,
- the coefficient values have nothing to do with each other. And so looking at the total magnitude, treating them as elements of a vector,
- it becomes really difficult and it's going to penalize one just for having to have a larger range because of the underlying units.
- If you standardize your numeric variables, then each one is in terms of standard deviation.
- The coefficients become a lot more directly comparable with each other and your regression is going to be better be your your regularization
- is going to be better behaved than you want to select your hyper parameters based on performance and the tuning getter and the CV classes,
- as I said, get help with this. I'm giving you an example and one of the notebooks that does so give you an example,
- a notebook that does uses logistic regression CV to do hyper parameter search for L2 regularization.
- So you can see that in action with a simple example. So to conclude.
- So to conclude regularization. Imposes costs on the model for large coefficient values, either large squared values, the large, absolute values.
- Squared costs, which we call Rig Ridge Regularization, encourages values to be small.
- Absolute value loss, so to call L1 or lasso regularization encourages small values to be zero.
- If you put those together, it encourages values to be either zero or large enough to be meaningful, but not super large.
- L2 regularization or Vage regularization is useful for controlling the effects of multicam linearity.
- And together they're useful for decreasing your moral complexity. Allow making coefficient values to earn their keep.
- Another way.
- So another thing that they do is if everything's standardized or at least means centered, then small coefficients results in small effects.
- And effectively what it means is assume everything's average. Unless we have enough evidence,
- enough data to justify stronger beliefs and beliefs and stronger relationships that
- are justified in terms of their ability to reduce our error on the training data.
📓 Pipeline and Regularization#
This notebook demonstrates pipelines and
It also shows a training of a decision tree (next video).
📓 Advanced Pipelines#
The Advanced Pipelines notebook demonstrates a much more advanced SciKit-Learn pipeline.
🎥 Models and Depth#
What does the world look like beyond logistic regression? Can a model output be a feature?
- Bo and this video, I want to move beyond logistic regression to talk about some additional classification
- models and also introduce the idea of putting models in features for other models.
- So learning outcomes are to do exactly what I just said.
- So so far, we've been estimating the probability of Y equals one by using a linear linear model, a Y hat equals actually would logistic.
- Of this. And we can use any estimate of this probability or we can just use models that output decisions,
- these may be based on scores, the scores that aren't estimated probabilities.
- For example, a support vector machine uses distance from either plane as its score.
- But we're not limited to just using a logistic regression, of course.
- So for one model, a decision tree is a tree of nodes where each node is a decision point.
- So I made a little decision tree here for the grad student admissions example.
- And at the first node, it's going to check if the GPA is less than or equal to three point four three five.
- And if it's less, it's gonna go to the left hand side. And there's extra nodes here, but it's going to deny admission.
- And if it's greater than three point four, three, five, it's then going to look at their class rank,
- their school rank, and if their school rank is less than one point five, it's going to do.
- It's going to admit. And if it's greater than one point five, it's going to deny.
- Really simple model. It would be absolutely terrible to actually use this model for regression, for admissions decisions, but for predicting the.
- But here we aren't. We aren't trying to build a model that will admit we're trying to build a model
- is gonna predict whether someone is going to get admitted it might work. But this illustrates how the decision tree actually works.
- They can learn complex interaction effects on their own because you can have the threshold.
- And what happens with the features changes as you go down to the node? Now, one of the problems, though, they have high variance,
- they can effectively memory memorize all of the training data by building themselves a lookup table that looks up the outcomes for training data.
- By the by the feature values, you can get extremely good training, accuracy.
- I trained one on this data with with unlimited feet, feature depth and I got training accuracy of over 99 percent.
- And I got tested accuracy of point five to.
- But a random forest, what a random forest does is it takes bootstrap samples by default psych, it learns random for us.
- We'll take complete bootstrap samples. You can tell it to take smaller ones.
- It's not actually a bootstrap sample, but it's a subsample of the dataset. And it fits a decision tree to that sample.
- And then it does that 100 times or however many times to get a bunch, you get one hundred decision trees.
- And then for a final classification, when you tell it to predict what it's going to do is it asks all of the decision, trees to vote.
- It's building up this random forest of happy trees. They're happy because they have a functioning democracy.
- They all get to vote on the final outcome. And the random forest takes the vote and returns the majority of the classification.
- Or if the if the individual values are producing scores, that it then it might average the scores and use that as an output.
- So but you build up, you decrease your variance.
- That you would get from training, it is, isn't she, on one set of data training decisions and another set of data by train?
- The decision tree on a bunch of sets of data by sub sampling your training data and then averaging over that in order to produce your final output.
- Brandon Forest is one of the classifiers that I want you to use in your assignment.
- Another thing, though, that I want to introduce is that feature output features don't have to directly come from data.
- So a lot of our features are going to come from data.
- But sometimes they're when they come from other models, sometimes they're a transformation model, some kind of what we call unsupervised learning,
- where it's computing things,
- but it doesn't have an output class that it's that's known that it's trying to predict or prediction models for other tasks.
- For example, in link to end their job ad recommender, the last I knew just a few years ago, it was it was at a high level.
- It was a logistic regression. You're going to LinkedIn. It says, here's a job ad for you.
- Well, that's coming from a logistic regression.
- But that logistic regression has very complex features, some of which are the outputs of other machine learning models.
- And so you're gonna get features from the job text, the job description features in the user's profile.
- One particularly interesting feature they use is a transition probability estimate.
- So they have a model. This is another. This is a statistical model that tries to predict.
- So if you are currently working and Boise as a data scientist,
- what's the likelihood that you would transition to a job title of senior data scientist in Salt Lake City?
- And so it takes into account job transitions like data.
- Scientists might leave the senior data scientists, software engineers to staff, software engineers or principal software engineers.
- It takes into account current migration patterns in the industry and various things like that to get this.
- How likely are you to even go move someone at a staff?
- Software engineering position is unlikely to take a job that where the title is Junior Software Engineer.
- And the output of this transition probability model is one of the input features to their logistic regression that's computed.
- That's estimating. Would you like to see this job ad for a senior data scientist in Salt Lake City?
- Also, you also get things where you might have might come from some kind of a deep learning thing,
- a deep learning object detection mechanism, a deep learning image similarity mechanism.
- So Pinterest gets a lot of mileage out of doing nearest neighbor calculations where the the neighbor nearest
- is defined by a deep learning model for assessing whether two images that are being pinned or similar.
- So we can there are many different models that we can look at.
- Linear models with their extensions, a generalized linear model and the logistic regression that we've been seeing, generalized adaptive models.
- There's also thing the support vector machine, which is another linear model, but it's not a regression model.
- The naive, naive Bayes classifier, we're going to see those later, a neural net.
- Whether shallow or deep, a lot of models, pretty like a lot of neural nets.
- They do a similar thing in logistic regression. They're computing a score and then you pass it through a logistic function or some other sigmoid in
- order to convert the model score to probabilities for making your final classification decisions.
- So wrap up. There are many different models for classification and for regression.
- I'm just the my goal in this class is to teach you what regression and
- classification are and how to get started with applying them and evaluating them,
- not to teach you a bunch of models in depth.
- The machine learning class is going to go into a lot more about how these different models work and how to get them to work.
- Well, model outputs also, though, can be features used as input features for other models, often linear.
- Not always, though. And so you can get models that build on top of other models.
🎥 Inference and Ablation#
How do we understand, robustly, the performance of our system? What contributes to its performance?
- Oh, in this video, I want to talk with you about inference from auto effectiveness and introduced the idea of an inflation study.
- So our goals are for you to be able to make inferences about model accuracy and underpin
- understand a little bit better the interplay of cross validation and inference,
- remembering that we can't be perfect. The goal is to do a good and an incredible job.
- And then also to be able to use an ablation study to make inferences about the particular
- contributions and value of different features or subsets of your subcomponents of your model.
- So remember, we've got this train test split training.
- We have the training data. We're doing all of our iterative process.
- It's a big, loopy thing. And then we've got we evaluate our effectiveness.
- One thing we haven't talked about yet is, is the is the effectiveness significant?
- To go and wait for our test data, we have a few outputs.
- We have the individual classifications of predictions and four classifications we have, whether they're right or wrong for predictions,
- we have the error and then we have a metric value, accuracy, precision, etc. for each classifier.
- One of the challenges is, though, for the classifier and the test data, we just have accuracy is point nine nine or precision is point four.
- We can't significance test that value.
- But I want to talk in order to set up how we can significance test, I'm first going to or otherwise do inference, I should say, because significant,
- significant testing, as we discussed earlier, a lot of times we might actually care about like an effect size estimate with confidence intervals.
- A lot more than we care about significance test.
- But there's a few questions that we want to answer as the results of an evaluation. What does my classifier perform better than some benchmark value?
- Well, you might have a value we want to beat, say, a value we know was good enough.
- And we want to know if my classifier performs better than that value.
- We might want to get an estimate of our classifiers, accuracy or precision or recall our pick our metric that has a confidence interval on it.
- So we know how precise ice this estimated performance measure is.
- And then we may also want to answer the question, is classifier A perform better than B?
- Maybe B is our current system, or B is the existing known state of the art.
- And we want to know if A does better. We might want to p value.
- We might want a confidence interval for the improvement or the difference in performance between A and B.
- So. To get started one way, we can compute a confidence interval.
- We can treat each item as a binary measurement. So you are each test item.
- So you've got hundred thousand test items because you've got a very large dataset.
- And hundred thousand to 20 percent split. Or it's a 10 percent split.
- You've got a million data points, one 100000 test points for each of these.
- You have the true value. Yes or no. You have prediction. Yes or no.
- If the metric denominator comes from the test data accuracy, it definitely does, because the denominate because accuracy is correct overall.
- You can also do this for false positive ratio, false negative ratio.
- Recall specificity, anything where the denominator is completely determined by the test data, not by the classifier results.
- You can use a Willcocks confit or a Wilson confidence interval.
- Stats models does this with proportion confident and a Wilson confidence interval is a confidence interval for a proportion.
- Any metric you can bootstrap that you can take your to your test samples, you can do.
- You can take bootstrap samples of them and then you can compute your classifier metric over your bootstrap samples.
- Now, you have to be careful when you're doing your bootstrap samples to make sure that
- when you're sample you're when you're doing the bootstrap and you keep the labels, the ground truth labels and the classifier outputs together.
- And if you're doing multiple classifiers, you have to keep all the classifier outputs together as you're computing these bootstrap samples.
- You can bootstrap from your test data and get a confidence interval for any of your classifier performance metrics.
- You can also do a computer P-value for the accuracy metric. This specific technique only works for accuracy.
- It does not work for any of our other classifier metrics. But you can get a P value for the null hypothesis that the two classifiers have the same
- accuracy by using what's called a contingency table and a contingency table for this purpose.
- You have you go from reclassifications to whether or not it was right or wrong.
- So. Here we have the number of times both classifiers were right in here, the number of times they were both wrong.
- And here we have where classifier one was right and classifier two was wrong.
- How often did that happen? We can do the same the other way around.
- And then we compute. What we do what's called a McNee ma test, and it uses these values and NY n is the value and one is wrong.
- And two is right. And then why is. One excuse excuse me.
- And this is. So here we have an.
- And why? And here we have. And why an.
- And so we take the squared difference and the their wrongness is and divide it by the sum of their wrongness is and this gives us a statistic.
- And my test statistic and under H zero under the null hypothesis and follows what's called
- a chi squared distribution with one degree of freedom to probability distribution,
- you can get CGF from stat's models or from sci fi.
- And you can use that to compute a P value. What's the probability of having an MS statistic, at least this large?
- And it's it's you don't have to deal with absolute values on it because it's it's a non-negative statistic in a non-negative distribution.
- We can't just. There is something called a proportion test, but proportion test is for independent proportions and independent samples.
- But we don't have independent samples. We have one sample of our test data.
- And for each test point, we have two measurements, class of a classifier one and classifier two.
- So we can't use a proportion test.
- But the Mackney MA test basically that says do this paired proportion test kind of thing and allows us to get a P value for whether this classifier,
- whether the classifiers have the same accuracy or not.
- And this one, the P value, does not allow us to reject the null hypothesis that they have the same accuracy.
- The P value is about one. So.
- We can also test regression. So each sample is a continuous measurement of the model's prediction error.
- So we have Y minus Y hat Y.
- I from C one for CROSSFIRE a. And we have Y minus Y.
- Hat I. From Classifier B. And those are two different measurements,
- we can use a paired t test or we can use an appropriate bootstrapping mechanism in order to assess the accuracy of a regression model.
- Now, when we have when we do a cross validation, so one technique, the sun, sometimes you do cross validation, say 10 tenfold cross validation.
- That gives you 10 accuracy's for each classifier and you can compute paired t test.
- So each of your each of your folds and your cross validation is a sample.
- Is one data point in your sample. So you've got N equals 10.
- You can do a T test that actually doesn't work very well because your your samples are not independent.
- If you're doing capable cross validation.
- Also if you just repeatedly draw a 10 percent sample and draw a 10 percent sample and do that, say, 30 times,
- you also have the same problem of the same data points are going to show up and you're sent to monitor your samples.
- Also, your training data classifiers are being trained in the same data too much.
- And the ideal is to be able to draw, say, 30 completely independent training and testing sets from your big population.
- But yeah, but if you can't do that, you're trying to simulate with cross validation,
- you wind up with the non independence just causes the resulting come statistical test to not be reliable.
- One thing you can do is you can do repeated cross validation where five times you do a two fold cross validation.
- I'm going to refer you to one of the readings I put in the notes for a lot more details on this.
- Just wanted to bring it up so that, you know, it's there. Cross-validation is sometimes used for final evaluation.
- You'll find this in papers sometimes.
- One of the problems, though, is this allows data leakage because you're testing on data that was available and you're trying it.
- You're testing on all of the data data that was available in your training set.
- This is a this can be a significant problem if we've got a large enough data
- set that we can just use a single test split or maybe two or three test split.
- That's going to allow us to much better simulate to avoid leakage, much better simulate what's going to happen.
- We put the model in production. Cross validation is really useful for a couple of contexts.
- One where you're not doing much model design or feature engineering. You just want to take you have data.
- Want to take a model. Apply it, see how it works. Cross-validation is great for that.
- You're not. You don't have the iterative process of how am I really getting this model to work?
- You can cross validate if you've got hyper parameter search,
- do a hyper parameter search separately for each needful, like make it part of your training process.
- Logit like that logistic regression c.v kinds of things. Help with that.
- But if you've got a model and just want to see how well it works in the data, cross-validation can work pretty well.
- Also with when you are doing cross-validation on the training data to iteratively, do improve your model and feature design.
- That can work really, really well as well. The problem arises when you're doing a lot of engineering on your model.
- And you get access to the test data, which you effectively have on a cross validation setup.
- Because even if you've got it's your say you do 10 cross-validation, do you pick one of them that's gonna be your.
- That's gonna be what you're really doing, your development? Well, all of your other test data is in this.
- This initial development part.
- So it's part of how you're effectively you're using the test data as part of your tuning process for your hyper parameter selection,
- as part of your exploratory data analysis. And that that is a cause of leakage.
- Again, though, I guess. We're can never be perfect, but it's important to be aware of as a cause of leakage.
- I really recommend having the designated test set that you hold out.
- You don't touch. That's the basis of your evaluation. Even if it makes the statistical inference a little bit harder.
- Now, another thing, though, I want to talk about is suppose.
- So let's suppose you've got a complex model and we've got we're detecting spam where we're working for, say, a.
- Telecommunications company were detecting text message spam or were detecting an e-mail spam for any mail company.
- We have text features. We've got made a day to features when they're sending you are else.
- You've got features of the you are all itself. Maybe we even hit the server.
- Let's say we've got another couple of sophisticated models that do that score.
- You are else by their reputation and they're sent and also score senders by theirs,
- by a reputation score that large e-mail search spam, antispam efforts such as the one built in the GMAT.
- I'll do this. I'm not just making that up. It's a part of of antispam at scale is building reputations for you are else and centers.
- And we've got let's say art, let's say our spam detector works well. Precision of ninety nine point five or ninety nine point nine.
- Recall of 80 percent. But what makes it work?
- Which of these features is contributing, how much to its success?
- The answer is to do what's called an oblation study and an oblation study takes our model.
- We take our whole model. We see how accurate it is, but then we turn off individual features of it.
- So we might turn off the sender reputation. How what how exactly you turn off depends on the model design.
- It might be if it's an honor. You just take that part out of your neuron, that graph syllogistic model, you know, everything's well standardized.
- You can put it zeroes for the feature and not retrain or even just take that term out of the model,
- trying to on your training data and try to predict your testing data. And what this lets you see and you probably want to do that just in case.
- Just to make sure the parameters are being tuned without the peace. What does that seat you see, though, is how much each component contributes?
- You can say, OK, my model gets ninety nine percent precision on spams and it gets 98 percent precision if I turn off the sender reputation.
- Well, that lets you see, OK, the sender reputation is responsible for one percent of my precision.
- Now, it's important to be careful how you use this, because you can use this for production decisions and for future work.
- You do this oblation study, you discover, OK, the center reputations only contributing 98 or one percent,
- or maybe it's contributing point one percent. And it's really expensive in terms of compute time and engineer time to maintain maybe stop using it.
- You could also use it for your future research work. What you can't do, particularly within the scope of one study,
- is use the results of your oblation study to go back and revisit your model design that gets you your leakage again.
- As again, as I said in the academic setting,
- we're doing multiple studies in the same data that we do get some leakage and we carry it forward to the next study.
- We again, we can't be perfect, but.
- There's a difference between the oblation study and the feature engineering, the feature engineering, I'm trying a bunch of things that keep things.
- I'm not going to keep things up doing it with this tuning data. Things are going back. I'm not being keeping my careful firewalls.
- In the oblation study, I have my top line performance monitor. Here's my model, I ran it.
- It got 99 percent precision and then. I'm trying to understand.
- Well, what are the drivers of that? I'm not putting it iteratively back into my life.
- Going back and rerunning my my stuff in my training data with it, I'm just using it to get knowledge to carry forward.
- That doesn't cause leakage within the context of the specific study we're talking about.
- And it is of acceptable practice.
- And it's a very, very useful practice for understanding the contributing factors to the performance of a complex model.
- So wrap up inference for classified performance is not immediately straightforward.
- There are several helpful techniques that pointed you here at pointing to you two in the readings and be careful about data leakage.
- But again, sometimes tradeoffs are.
📃 Statistical Significance Tests#
Read Statistical Significance Tests for Comparing Machine Learning Algorithms.
Note
In the Week 9 activity, we used the paired t-test for comparing the output of two regression models. Our use of this test did not violate the guidance in this reading — why is that?
For further reading, you can also see Approximate Statistical Tests.
🎥 Dates#
This video discusses how to use work with dates in Pandas.
- So in this video, I want to talk with you about dates,
- learning outcomes are for you to be able to pass and transform dates and adjust dates using date offsets.
- So first, I want to talk just briefly about the difference between a date as we say it, like.
- OK. It's November the 3rd, 2020 and underlying time.
- So dates to all kinds of funny things. When we change to or from Daylight Savings Time, we skip an hour or we repeat an hour.
- But the underlying time stream doesn't repeat.
- It's just that our way of mapping that to the way we write it down repeats.
- So we can think of underlying time as moving forward at a constant rate.
- Generally, there's relativity and all of those things.
- But the time is moving forward and how we record it changes and is complex and subject to a lot of rules.
- The key thing is like with text being different, like the text content is different from its encoding.
- Time is different from its representation.
- Well, one of the implications this says, is that we typically store time in more of its monotonic form, like seconds since a particular date, UTC.
- And then we translate for presentation. And so you'll see your store the time it offset UTC and then you will you'll translate
- that to the local time zone with all of the daylight savings rules and everything,
- we are going to go actually display it.
- So internally, there are a few ways we can represent time numerically, and sometimes you'll need to do this yourself.
- So one one is Unix timestamps, which is time since or before that can be negative.
- Midnight, UTC, January 1st, 1970. Often this is stored in seconds.
- Pythons like not in pandas or not pie, but Python, the Pathfinder standard library tends to do time and second floating point seconds since midnight.
- The reference point, as I said, the reference point for this is UTC. You can also store at milliseconds or nanoseconds since that time.
- If you have a data, if you have a data file that has a file, a column that's labeled as a timestamp and it's an end.
- It's a number. There's a very good chance it's a Unix timestamp. That's very common way to store dates and times.
- We can also store Julian Day numbers, which are days since January 1st.
- Forty seven 13 B.C. And you can you you can store a time by using a floating point numbers,
- it might be twenty two million, three hundred and seventy five point eight days.
- There's also other origins. You can use a lot of different origins. Pandas actually lets you specify arbitrary origins.
- But the nineteen hundred system that's used by Excel and other spreadsheets stores days since January 1st.
- Nineteen hundred. So we can also store data strings, so the ISO format is year, month, day.
- This has the nice advantage that at least until the year ten thousand.
- It sorts by date. If you sorted Alphabet Alpha numerically, it's it's going to sort the resulting dates by date.
- So if you're going to name files after dates, this with dates at the beginning of the file name.
- This is super useful.
- There's also localized numeric forms such as eleven three, twenty twenty, which is how we write the dates in the United States, Europe and the UK.
- Right. Generally right at day, month, year three. Eleven, twenty, twenty.
- So if you see a date that's two digits, two digits year, that's not enough information to know when we're talking about.
- Are we talking about November 3rd or are we talking about March 11th?
- You need to know the country locale in which the date came from to know how to correctly interpret it.
- Sometimes you can infer it by looking for, say, November twenty eighth,
- because 28 isn't a valid month number that I'll let you figure out which one you're dealing with.
- But this localized form, just if you get a date, it's often ambiguous.
- You could also have longer string forms like you did right out November 3rd.
- Twenty Twenty Panels provides a function called Date Time 64 that allows you to store dates and times.
- And even if you just have a date, you usually store it. It's a date time with midnight.
- At least that's how you work with it. And pan those pandas doesn't have a time free date type.
- You can create a date time from number and units as an origin.
- So you can say we want. We want two hundred and thirty million.
- Seconds since the Unix epoch, we want three point eight million second days since the Julian Origin D Funk.
- This function also supports the number can be a series or an array in addition to a single numbers.
- You can create a series or an array of pand date time objects.
- You can also convert it from a string. This also can be up a series or array of strings.
- So in an assignment, if you've got it, if you've got a column that string dates,
- you can convert that to a column of date times by using these functions.
- And by default, it's going to pass the time for my S.O.
- But you can also tell it to pass other time, but providing a format string that describes how the time is laid out.
- And there's a link in the pandas documentation to the way these format strings work can also be provided.
- That link in the notes that go with this video. Then so we've got daytime's pandas has also had an object called a Time Delta,
- which stores of a difference between two times, if you subtract one date time from another,
- what you're going to get as a time delta,
- you can create one from a number of plus units or a string that describes that like I can create the the the time delta one day,
- thirty minutes and twenty two seconds. The time Delta marks advances in linear times.
- You can't create a time delta for example, of one month. The date offset is what you use to get one month.
- So it's you can create it from a number and units and it correctly offsets the dates,
- even if you it knows it can know whether it needs to extend by 30 days or thirty one or twenty eight.
- It handles Daylight Savings Time, a Hannahs leap year, a handle's leap seconds and deals with being able to offset dates properly.
- Date Offset does not natively support series, so date times and time deltas, both native pandas natively supports them in series.
- You can't, however, create a series of an object series that contains data offsets.
- So if Month series is a series of numeric series that contains numbers of months, then we can use apply.
- And it's it's it's a little slow because it's doing a python loop effectively.
- But we can use apply to convert these data, offset these numbers of month into data, offset objects.
- We get a series of those which we can then say add to add to a series of date times in order to produce offset date times.
- For example, to add if we've got a column that has the term the number of months on loan is for and when the loan was issued,
- we can add we can convert the month to a date offset. Add it to the issue date and we can find when the loan is due.
- When you're doing arithmetic with dates, if you add a date time and the time Delta,
- you're going to get a date time, date time plus a tight offset is also a date time. You could subtract as well as add.
- As I said, if you subtract two date times, you're gonna get a time Delta.
- You can also multiply a date offset by a number and it's going to give you another date offset that's multiplied.
- So you can if you've got two months, you can multiply it by five and you'll get 10 months.
- You can also compare you can compare date times using comparison operators.
- You do need to create date times on both sides. If you've got something that say strings, you need to convert it to a date time object.
- So then you can do the comparison. So in conclusion, dates and times are typically stored internally using offsets from an origin.
- Usually store. Usually we store them in UTC and then we translate them to local time.
- When we go on display. PANDAS provides a number of functions and types for working with dates and times.
- In addition, NUM Pi provides some of its own. I generally work with pandas, but not PI does provide time.
- Delta date time objects at work just a little bit differently. Python also does it its standard library for our purposes.
- I recommend generally sticking with the pandas ones.
Links#
Date operations notebook
🚩 Quiz 11#
Quiz 11 is in Canvas.
📩 Assignment 5#
Assignment 5 is due November 6.