# Week 11 — More Modeling (10/31–11/4)

In this week, we’re going to learn more about model building, that will be useful in Assignment 5:

## 🧐 Content Overview

This week has **1h33m** of video and **3400 words** of assigned readings. This week’s videos are available in a Panopto folder.

## 🎥 Intro & Context

In this video, I review where we are at conceptually, and recap the ideas of estimating conditional probability and expectation.

CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
BUILDING AND EVALUATING MODELS
Learning Outcomes (Week)
Build and refine a predictive model
Construct features for a model
Apply regularization to control features and their interaction
Measure a model’s effectiveness and other behavior
Photo by Hannah Olinger on Unsplash
Where We’re At
Linear regression (continuous prediction)
Logistic regression (binary classification)
Optimizing objective functions
Minimize loss functions (e.g. squared error)
Maximize utility functions (e.g. log likelihood)
Conditional Estimates
Distinction: Model Probability vs. Make Decision
Trick: Probability and Expectation
Wrapping Up
We are building models that estimate conditional probability or expectation and use them to make classifications.
We’re going to see more about their inputs and outputs this week.
Photo by Klim Musalimov on Unsplash

- Oh, this video, I'm going to introduce our week's topic about building and evaluating models, talking more in detail about how we go about doing that,
- learning outcomes for the week or for you to be able to build and refine a predictive model,
- construct features for that model, apply regularization to control features and then her interaction and to give us models that
- generalize better and to model measure of a model's effectiveness and its other behavior.
- So where we're at right now, we've seen linear regression and we have seen continue to be able to do continuous prediction,
- we want to predict a continuous outcome or target variable.
- We've seen logistic regression that lets us take the concept of linear modeling and move it into the realm of binary classification,
- where rather than having a continuous outcome variable, we have a binary outcome such as defaulted on the loan or is spam or fraud.
- We've also seen the idea of minimize it, of optimizing objective functions, we might minimize a loss function such as the squared error.
- We might maximize utility functions such as log likelihood. These are equivalent to each other.
- And if you've got a utility function in the minimize or you can minimize the negative of the utility function.
- We've also seen that we can think about what we're doing with modeling is doing conditional estimation.
- So in a regression model, we're trying to estimate the conditional expectation, given a particular set of values for my input features X.
- What's the expected value of Y? We might we might do some transformations to all these variables.
- But we're trying to compute this conditional expectation function.
- What's the expected value of Y condition done by feature values, X and classification?
- We're trying to solve a conditional probability problem.
- What's the probability of a particular outcome given that I have some particular feature values x.
- Also so. There's another, though, thing in here that's useful to thinking about,
- so that would just add regression at its heart is trying to model the probability of your data.
- So what we've been doing is with stats, models.
- We do model that predict and we get some scores and then we use the scores to make a decision because internally,
- the logistic regression mathematically with solving this problem of maximizing the log likelihood.
- Mathematically, what the logistic regression is doing is it's trying to build a probabilistic model of the data and the
- parameters are estimated based on their ability to accurately model probabilities in your training data.
- We then use these output probabilities to make decisions. So we'll say success if y had is greater than point five.
- Saikat Learn uses the logistic regression to directly classify by using the threshold of point five.
- But you can get those estimated probabilities out of it with decision, the decision function.
- This is important to note.
- So the log likelihood that you get out of a logistic regression is not based on its actual actual decisions that it's making.
- It's based on its ability to model probabilistically what the labels look like in your training data.
- And it's the more it's the probability that it assigns to those labels with the final fitted versions of the parameters.
- I want to mention briefly again, a trick that I mentioned, I believe, last week where.
- Expected value and probability are closely related. The expected value is the integral or the somewhere of values weighted by their probabilities.
- But also if we have an indicator function, ie, which is one if.
- X is in the set and and zero, if it is not with what one?
- Basically, given a value, it decides whether or not it's in the set. If that said as an event, it says whether or not the event happened,
- the probability and the expected value of the indicator function are the same thing.
- So we can think about estimating conditional expectation probable, but we can think about everything is estimated conditional expectation.
- When we're estimating a probability, we're estimating the conditional expectation of the characteristic or indicator function.
- So to wrap up, we're building models that estimate conditional probability and expectation.
- We've been doing this in a variety of ways. We use these models to make decisions.
- This week we're gonna see more. So we've got the idea of doing the modeling. This week, we're looking more at how do we build inputs for these models?
- And how do we evaluate the outputs that we get out of them?

## 🎥 Workflow

How do you do feature engineering and model selection in a machine learning workflow?
What is the iterative process involved?

## 🎥 SciKit Pipelines

In this video, I introduce SciKit *pipelines* that put multiple transformations together.

## 🎥 Regularization

This video introduces regularization: ridge regression, lasso regression, and the elasticnet.
Lasso regression can help with (semi-)automatic feature selection.

CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
REGULARIZATION
Learning Outcomes
Understand the function of a regularization term in a loss function
Apply regularization to logistic regression models
Tune regularization parameters
Photo by Joshua Hoehne on Unsplash
Multicollinearity
Correlated predictors cause poor model fit
X1
X2
Y
Loss and Regularization
Vector Norms
Understanding Regularization
How do you increase loss?
Increase a coefficient
How does that happen?
Strong relationship
More of the common factor on one than another
Balancing
Regularization Factor
Lasso Regression
The Elastic Net
Applying Regularization
Standardize numeric first
Coefficient strengths are comparable
0 is neutral, coefficient magnitude is strength of relationship
Select hyperparameters based on performance on tuning data
Scikit-Learn CV classes help with this
See notebook for example
Wrapping Up
Regularization penalizes large values of coefficients.
This controls model structure and, combined with standardization, requires “strong” beliefs in relationships to be justified by reducing training error.
Photo by Jasper Garratt on Unsplash

- But now it's time for a topic that I've mentioned a few times, when we're actually going to learn what it is, regularization.
- So the goal here is for you to understand the function of regularization,
- terminal lost function and apply regularization to your logistic regression models and then finally tune regularization parameters.
- I want to start by reviewing MultiKulti Darity. So remember that if we have correlated predictors that can cause poor model fit.
- So if we've got X1 and x2 and they cause why we've got this correlation between them.
- We don't know particularly where the common affects, so we can have a look.
- So we can factor this out as X one two plus X one plus X two.
- Except we don't actually have X one to. It's hidden behind the wall.
- Where does it. Where does its its value go when the coefficients?
- Does it go on X one, does it go next to you split it between them?
- The linear model itself has no way to determine where the common component should actually be allocated.
- And so one way we can deal with this and several other problems is by introducing what we call a regularization.
- So rather than just solving the problem, minimize lost function.
- And so if this is a linear regression, this might be squared loss or suspect B squared error.
- This might be negative log likelihood. Log likelihood is a utility function, a negative log likelihood to be a positive value,
- because the log likelihood they're negative is a lost function. You want to minimize your negative log likelihood.
- And what we do is we add to that another term, which is we call the regularization term.
- And all it is, is it is a parameter of the regularization strength times,
- the magnitude of our parameters lost function now as two terms, the error and the and the magnitude of the coefficients.
- When we're doing the squared magnitude here, we call it the ridge regression.
- So quick detour on some of the notation. I'm using a norm as a measure of the magnitude of a vector.
- So when we say X, we say the L2 norm, which is indicated with the subscript two, there's called it L2 Norma Euclidean norm.
- What it is, is it's the square root of the sum of the squares that the elements of the vector.
- If you take the L2 norm of Y minus Z, that's the Euclidean distance between Y and Axis.
- If they're two dimensional vectors, it's the straight line distance between them.
- So if you've got Y. And you've got X, Y, Z.
- It's the straight line distance between them, the.
- And then we can square it. So subscripts two means L2 Naum superscript, two means square.
- And that's the sum of the squares of the element. So we get rid of this square root and we get the some of the squares useful, really useful.
- It it simplifies the computation just a little bit. And it's how the retrogression normalization is defined or regularization is to find the L1 norm.
- Subscript one is the sum of the absolute values and we call this the Manhattan or taxicab distance
- because it's the distance you would have to travel if you could only travel in straight lines.
- So if you want to go from X to Y, it's the it's the total length of that path.
- So but it's also it's also useful, some of absolute values.
- I'll want to some of absolute values. L2 is the sum as the square root of the sum of the squares.
- You can generalize to get other norms as well. But this is what this notation means.
- The magnitude of the vector. And so when we build up this rig, we build up our regularized model the way we increase.
- The. The way we increase this component, the loss.
- Remember the way we want to think about it. One of the tools we want to use for understanding a metric is how do you how do you make them change?
- How do you increase them or decrease them? And the way you increase or decrease this part of the lost function is you.
- You increase the coefficient and that can happen where having a strong relationship
- that can happen by putting more the common factor on one than another. So when you have this multiple linearity,
- one thing the retrogression regression is going to do is it's going to encourage the
- model to distribute the influence of the common factor between the different sub factors.
- Because if I put it all on one, that would increase the square more than if it divides it evenly between the two,
- the way you minimize the squares as you divide the common the common components evenly between the two features.
- It's it's a part of and it gives us a solution. So a multi linearity our system is under determined.
- We don't have enough information to know where the coefficient is by adding regularization to our to our lost function.
- We introduce this additional this additional loss that.
- Tells it where to put it. By making the least expensive solution be the one where it's evenly distributed between all of the correlated features.
- So where do we have this lost function, we have our our error loss plus our coefficient strength.
- We can minimize this in two ways. We can minimize it by decreasing our error and we can minimize it by having small coefficients.
- And what effectively, though, what that means is in order for a coefficient value to be large, it has to earn its keep and it has to earn its keep.
- By decreasing training error. If if if if you've got a minimum, if you if you've got a particular value,
- we're going to try to try to increase the coefficient and increase the coefficient.
- That might give us a lower error. We only get a lower total loss if it decreases the error by more than it increases
- the coefficient after take into account square and our regularization strength term.
- And it gives this it encourages the coefficients to be small values unless a large value contribute significantly
- to decreasing the models error on the training data squared error or increasing its log likelihood.
- We're talking about a logistic regression.
- The regularization parameter lambda is what we call a hyper parameter because we don't learn lambda from the data in general,
- like within a single linear model. We don't learn lambda from the data.
- We have to come in from outside the exact impact. The value depends somewhat implementation details such as how difficult one thing is,
- the loss function itself, a mean or a some different psychic models actually make it.
- You can't just take a regularization term for once I get model and use it for another,
- even if they're both doing L2, because other details of lost function mean the value doesn't transfer,
- because if it's using a sum of squared error,
- then the regularization strength needs to depend on the data side because for the same amount, for the same amount of average error.
- The sum of squared error is going to be larger just for having more data.
- If it's a mean, then it's going to then you're right, Visitacion term is not going to depend on your data size.
- Some Saikat models also use a concentration parameter C, which is one over Lambda Lambda,
- and it's multiplied by the error instead of being multiplied by the by the, the coefficients.
- Because the strict parameters. So an increased value of lambda or a decreased value of C results in stronger regularization,
- a coefficient has to contribute more to the model performance to earn the keep for for a large value than it does with weaker regularization.
- Now one good way to learn to write a good value for Lambda is to optimize with the training and tuning split of the training data.
- Saikat learned. We'll do this automatically if you use.
- So a lot of the repressors also of a CVO class logistic regression c.v you're going to have REJ CV.
- Quite a few others have a CV variant.
- And what happens with the CV variant is it will learn values for one or more hyper parameters by doing Krait when you call fit with training data.
- It will cross validate on the training data to learn and you can give it a range of Perama,
- a range of hyper parameter values to consider a list of them.
- It will do. It will do the cross validation to automatically learn good values, the best values it can for these regularization parameters.
- There is also a class grid search CV that allows you to do hyper cross validation to search for good hyper parameter values,
- for any parameter, for any hyper parameter for a psychic. Learn model.
- I encourage you to go play with that at some point. But logistic regression CV will do that automatically just in the fit call within
- itself's with all it'll find a good and a good regularization strength value.
- So the lasso regression. This looks very, very similar, except every place, that square at L2, nor in the sum of squares.
- With the L1 norm, we're now looking at some of the absolute values and so the Elst, the square, the L2 norm allows it encourages values to be small.
- But if the value is close to zero, it doesn't like it's close to zero. Fine.
- What the oh one naum one of the effects it has is it doesn't like small noun's zero values.
- If a coexistent value was small as L1, Naum is going to push it to zero.
- And what this does is it makes the coefficient spot what we call sparse, sparse data is data with a lot of zeros.
- And so. If a coefficient is not contributing very much to classification, it's going to go to zero.
- And you can use that to see which class, which features are actually being used in the classification.
- And it effectively becomes an automatic feature selection technique because it's going to push the it's going to push the.
- Coefficients for features that don't contribute very much to decreasing your training error to zero.
- You can then put them together in what's called the elastic net, which combines L1 and L2 regularization.
- And you have an overall regularization strength lambda that controls your regularization or Seage was one over lambdas.
- What's going to multiply the loss function by sea? And then we have L1 regularization and L2 regularization.
- And they're balanced and they're balanced with this parameter ro.
- And so you could parameter Ryze. It's you just have your L1 strengthen your L2 strength.
- But most elastic net implementations have a regularization strength.
- Your out your lambda area C. And some of the psychic docs that use Alpha for this.
- And then you have a balance that says how much of the regularization to put on a one?
- And how much to put on L2 and these parameters both need to be chosen by cross-validation.
- That's really the only way to find good values if you use logistic regression.
- So logistic regression and logistic regression CVA can do elastic net.
- There's also an elastic net and elastic net CV classes. And by default, if you use logistic regression CV, it's only going to use.
- It's only going to search for the it's going to default L2 regularization and search for the regularization strength.
- If you want elastic net, you change the penalty option.
- You also have to change the solver because only one of the logistic regression can you several solvers to learn the logistic regression parameters.
- Only one of them supports elastic net. And then you're going to need some additional options in order to tell it to also search for for that L1 ratio.
- But it can do all of that for you.
- I refer you to the documentation for though, with logistic regression, logistic regression, CV classes to see how to do that.
- You're gonna find it useful in assignment five. I'm also gonna be giving you an example in the synchronous session that is dealing with some of this.
- So some notes on applying regularization, though.
- Regularization really works best when you're numeric variables are standardized because the coefficients.
- It's it's looking at the total magnitude of your coefficient vector.
- And if one of your coey if one of your features is in units of millimeters and one of your features is in units of KG's,
- the coefficient values have nothing to do with each other. And so looking at the total magnitude, treating them as elements of a vector,
- it becomes really difficult and it's going to penalize one just for having to have a larger range because of the underlying units.
- If you standardize your numeric variables, then each one is in terms of standard deviation.
- The coefficients become a lot more directly comparable with each other and your regression is going to be better be your your regularization
- is going to be better behaved than you want to select your hyper parameters based on performance and the tuning getter and the CV classes,
- as I said, get help with this. I'm giving you an example and one of the notebooks that does so give you an example,
- a notebook that does uses logistic regression CV to do hyper parameter search for L2 regularization.
- So you can see that in action with a simple example. So to conclude.
- So to conclude regularization. Imposes costs on the model for large coefficient values, either large squared values, the large, absolute values.
- Squared costs, which we call Rig Ridge Regularization, encourages values to be small.
- Absolute value loss, so to call L1 or lasso regularization encourages small values to be zero.
- If you put those together, it encourages values to be either zero or large enough to be meaningful, but not super large.
- L2 regularization or Vage regularization is useful for controlling the effects of multicam linearity.
- And together they're useful for decreasing your moral complexity. Allow making coefficient values to earn their keep.
- Another way.
- So another thing that they do is if everything's standardized or at least means centered, then small coefficients results in small effects.
- And effectively what it means is assume everything's average. Unless we have enough evidence,
- enough data to justify stronger beliefs and beliefs and stronger relationships that
- are justified in terms of their ability to reduce our error on the training data.

## 📓 Pipeline and Regularization

This notebook demonstrates pipelines and \(L_2\) regression, and performs a significance test of classifier improvement.

It also shows a training of a decision tree (next video).

## 🎥 Models and Depth

What does the world look like beyond logistic regression?
Can a model output be a feature?

CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
MODELS AND DEPTH
Learning Outcomes
Introduce models beyond linear and logistic models
Introduce the idea of a model output as a feature
Photo by Tim Mossholder on Unsplash
More Models
Decision Tree
Tree of nodes
Each node is a decision point
Result when you reach the bottom
Can learn complex interaction effects on their own
Random Forest
Decision trees can have high variance – they can memorize a lot of the model
Random forest:
Take (partial) bootstrap samples of the training data
Fit decision trees
Predict by taking vote from trees
Models as Features
Feature outputs don’t have to come directly from data
Transformation models (e.g. principal component analysis)
Prediction models for other tasks
Example: LinkedIn job ad recommender
Logistic regression
Features from text, job description, user, etc.
One feature: transition probability estimate
Many Models
Linear (GLM, GAM, etc.)
Support vector machines
Naïve Bayes classifiers (we’ll see those later)
Neural nets
Often play the logistic regression game – logit function to convert model scores to probabilities
Wrapping Up
There are many models for classification (and regression).
Model outputs can also be features for other (often linear) models.
Photo by Michal Janek on Unsplash

- Bo and this video, I want to move beyond logistic regression to talk about some additional classification
- models and also introduce the idea of putting models in features for other models.
- So learning outcomes are to do exactly what I just said.
- So so far, we've been estimating the probability of Y equals one by using a linear linear model, a Y hat equals actually would logistic.
- Of this. And we can use any estimate of this probability or we can just use models that output decisions,
- these may be based on scores, the scores that aren't estimated probabilities.
- For example, a support vector machine uses distance from either plane as its score.
- But we're not limited to just using a logistic regression, of course.
- So for one model, a decision tree is a tree of nodes where each node is a decision point.
- So I made a little decision tree here for the grad student admissions example.
- And at the first node, it's going to check if the GPA is less than or equal to three point four three five.
- And if it's less, it's gonna go to the left hand side. And there's extra nodes here, but it's going to deny admission.
- And if it's greater than three point four, three, five, it's then going to look at their class rank,
- their school rank, and if their school rank is less than one point five, it's going to do.
- It's going to admit. And if it's greater than one point five, it's going to deny.
- Really simple model. It would be absolutely terrible to actually use this model for regression, for admissions decisions, but for predicting the.
- But here we aren't. We aren't trying to build a model that will admit we're trying to build a model
- is gonna predict whether someone is going to get admitted it might work. But this illustrates how the decision tree actually works.
- They can learn complex interaction effects on their own because you can have the threshold.
- And what happens with the features changes as you go down to the node? Now, one of the problems, though, they have high variance,
- they can effectively memory memorize all of the training data by building themselves a lookup table that looks up the outcomes for training data.
- By the by the feature values, you can get extremely good training, accuracy.
- I trained one on this data with with unlimited feet, feature depth and I got training accuracy of over 99 percent.
- And I got tested accuracy of point five to.
- But a random forest, what a random forest does is it takes bootstrap samples by default psych, it learns random for us.
- We'll take complete bootstrap samples. You can tell it to take smaller ones.
- It's not actually a bootstrap sample, but it's a subsample of the dataset. And it fits a decision tree to that sample.
- And then it does that 100 times or however many times to get a bunch, you get one hundred decision trees.
- And then for a final classification, when you tell it to predict what it's going to do is it asks all of the decision, trees to vote.
- It's building up this random forest of happy trees. They're happy because they have a functioning democracy.
- They all get to vote on the final outcome. And the random forest takes the vote and returns the majority of the classification.
- Or if the if the individual values are producing scores, that it then it might average the scores and use that as an output.
- So but you build up, you decrease your variance.
- That you would get from training, it is, isn't she, on one set of data training decisions and another set of data by train?
- The decision tree on a bunch of sets of data by sub sampling your training data and then averaging over that in order to produce your final output.
- Brandon Forest is one of the classifiers that I want you to use in your assignment.
- Another thing, though, that I want to introduce is that feature output features don't have to directly come from data.
- So a lot of our features are going to come from data.
- But sometimes they're when they come from other models, sometimes they're a transformation model, some kind of what we call unsupervised learning,
- where it's computing things,
- but it doesn't have an output class that it's that's known that it's trying to predict or prediction models for other tasks.
- For example, in link to end their job ad recommender, the last I knew just a few years ago, it was it was at a high level.
- It was a logistic regression. You're going to LinkedIn. It says, here's a job ad for you.
- Well, that's coming from a logistic regression.
- But that logistic regression has very complex features, some of which are the outputs of other machine learning models.
- And so you're gonna get features from the job text, the job description features in the user's profile.
- One particularly interesting feature they use is a transition probability estimate.
- So they have a model. This is another. This is a statistical model that tries to predict.
- So if you are currently working and Boise as a data scientist,
- what's the likelihood that you would transition to a job title of senior data scientist in Salt Lake City?
- And so it takes into account job transitions like data.
- Scientists might leave the senior data scientists, software engineers to staff, software engineers or principal software engineers.
- It takes into account current migration patterns in the industry and various things like that to get this.
- How likely are you to even go move someone at a staff?
- Software engineering position is unlikely to take a job that where the title is Junior Software Engineer.
- And the output of this transition probability model is one of the input features to their logistic regression that's computed.
- That's estimating. Would you like to see this job ad for a senior data scientist in Salt Lake City?
- Also, you also get things where you might have might come from some kind of a deep learning thing,
- a deep learning object detection mechanism, a deep learning image similarity mechanism.
- So Pinterest gets a lot of mileage out of doing nearest neighbor calculations where the the neighbor nearest
- is defined by a deep learning model for assessing whether two images that are being pinned or similar.
- So we can there are many different models that we can look at.
- Linear models with their extensions, a generalized linear model and the logistic regression that we've been seeing, generalized adaptive models.
- There's also thing the support vector machine, which is another linear model, but it's not a regression model.
- The naive, naive Bayes classifier, we're going to see those later, a neural net.
- Whether shallow or deep, a lot of models, pretty like a lot of neural nets.
- They do a similar thing in logistic regression. They're computing a score and then you pass it through a logistic function or some other sigmoid in
- order to convert the model score to probabilities for making your final classification decisions.
- So wrap up. There are many different models for classification and for regression.
- I'm just the my goal in this class is to teach you what regression and
- classification are and how to get started with applying them and evaluating them,
- not to teach you a bunch of models in depth.
- The machine learning class is going to go into a lot more about how these different models work and how to get them to work.
- Well, model outputs also, though, can be features used as input features for other models, often linear.
- Not always, though. And so you can get models that build on top of other models.

## 🎥 Inference and Ablation

How do we understand, *robustly*, the performance of our system?
What contributes to its performance?

CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
INFERENCE AND ABLATION
Learning Outcomes
Make inferences about model accuracy
Understand interplay of cross-validation and inference
Use ablation studies to make inferences about feature or sub-model importance
Photo by Siora Photography on Unsplash
Train/Test Split
The Data
Training Data
Testing Data
One Way
Train model
Experiment with different model designs
Experiment with different features
Select hyperparameters
Evaluate Effectiveness
Significant?
Testing Effectiveness
For test data, we have:
Individual classifications, right or wrong
Single metric value (accuracy, precision, etc.) for each classifier
Can’t significance test a single value!
Testing Objectives
Does my classifier perform better than benchmark value?
What is the precision of my estimated classifier accuracy?
Confidence interval
Does classifier A perform better than B?
P-value
Confidence interval for the difference
Test Samples: Confidence Intervals
Solution 1: treat each test item as a binary measurement
If metric denominator from test data: Wilcox confidence interval
statsmodels: proportion_confint (with method=‘wilson’)
Works for accuracy, FPR, FNR, recall, specificity
Any metric:
Bootstrap the test samples
Compute metric from bootstrap samples
P-Value for Accuracy
Testing Regression
For regression, each sample is a continuous measurement of the model’s prediction error.
Use paired t-test or appropriate bootstrap
Repeated Testing
With repeated cross-validation, we can compute a t-statistic
Run 5 times
Each time, do 2-fold cross-validation
See reading.
Simple cross-validation not great– too much non-independence
Repeated test sampling unreliable – too much non-independence
Cross-Validation and Train/Test Split
Cross-validation sometimes used for final eval
Allows data leakage – what did you do your model & feature selection on?
Good for:
Limited engineering – just see how well the model works
Model and feature design – when done on training data
Understanding Performance & Behavior
Suppose you are detecting spam with:
Text features
Metadata features
URL features
URL reputation model
Sender reputation model
What makes it work?
Ablation Studies
An ablation study examines impact of individual components
Turn each off in turn
Measure classification performance
Lets you see how much each component contributes
Do use results for production decisions, future work
Do not use results to revisit model design (in this trial)
Wrapping Up
Inference for classifier performance is not immediately straightforward.
Several techniques helpful.
Be careful about data leakage. Sometimes tradeoffs are needed.
Photo by Bianca Ackermann on Unsplash

- Oh, in this video, I want to talk with you about inference from auto effectiveness and introduced the idea of an inflation study.
- So our goals are for you to be able to make inferences about model accuracy and underpin
- understand a little bit better the interplay of cross validation and inference,
- remembering that we can't be perfect. The goal is to do a good and an incredible job.
- And then also to be able to use an ablation study to make inferences about the particular
- contributions and value of different features or subsets of your subcomponents of your model.
- So remember, we've got this train test split training.
- We have the training data. We're doing all of our iterative process.
- It's a big, loopy thing. And then we've got we evaluate our effectiveness.
- One thing we haven't talked about yet is, is the is the effectiveness significant?
- To go and wait for our test data, we have a few outputs.
- We have the individual classifications of predictions and four classifications we have, whether they're right or wrong for predictions,
- we have the error and then we have a metric value, accuracy, precision, etc. for each classifier.
- One of the challenges is, though, for the classifier and the test data, we just have accuracy is point nine nine or precision is point four.
- We can't significance test that value.
- But I want to talk in order to set up how we can significance test, I'm first going to or otherwise do inference, I should say, because significant,
- significant testing, as we discussed earlier, a lot of times we might actually care about like an effect size estimate with confidence intervals.
- A lot more than we care about significance test.
- But there's a few questions that we want to answer as the results of an evaluation. What does my classifier perform better than some benchmark value?
- Well, you might have a value we want to beat, say, a value we know was good enough.
- And we want to know if my classifier performs better than that value.
- We might want to get an estimate of our classifiers, accuracy or precision or recall our pick our metric that has a confidence interval on it.
- So we know how precise ice this estimated performance measure is.
- And then we may also want to answer the question, is classifier A perform better than B?
- Maybe B is our current system, or B is the existing known state of the art.
- And we want to know if A does better. We might want to p value.
- We might want a confidence interval for the improvement or the difference in performance between A and B.
- So. To get started one way, we can compute a confidence interval.
- We can treat each item as a binary measurement. So you are each test item.
- So you've got hundred thousand test items because you've got a very large dataset.
- And hundred thousand to 20 percent split. Or it's a 10 percent split.
- You've got a million data points, one 100000 test points for each of these.
- You have the true value. Yes or no. You have prediction. Yes or no.
- If the metric denominator comes from the test data accuracy, it definitely does, because the denominate because accuracy is correct overall.
- You can also do this for false positive ratio, false negative ratio.
- Recall specificity, anything where the denominator is completely determined by the test data, not by the classifier results.
- You can use a Willcocks confit or a Wilson confidence interval.
- Stats models does this with proportion confident and a Wilson confidence interval is a confidence interval for a proportion.
- Any metric you can bootstrap that you can take your to your test samples, you can do.
- You can take bootstrap samples of them and then you can compute your classifier metric over your bootstrap samples.
- Now, you have to be careful when you're doing your bootstrap samples to make sure that
- when you're sample you're when you're doing the bootstrap and you keep the labels, the ground truth labels and the classifier outputs together.
- And if you're doing multiple classifiers, you have to keep all the classifier outputs together as you're computing these bootstrap samples.
- You can bootstrap from your test data and get a confidence interval for any of your classifier performance metrics.
- You can also do a computer P-value for the accuracy metric. This specific technique only works for accuracy.
- It does not work for any of our other classifier metrics. But you can get a P value for the null hypothesis that the two classifiers have the same
- accuracy by using what's called a contingency table and a contingency table for this purpose.
- You have you go from reclassifications to whether or not it was right or wrong.
- So. Here we have the number of times both classifiers were right in here, the number of times they were both wrong.
- And here we have where classifier one was right and classifier two was wrong.
- How often did that happen? We can do the same the other way around.
- And then we compute. What we do what's called a McNee ma test, and it uses these values and NY n is the value and one is wrong.
- And two is right. And then why is. One excuse excuse me.
- And this is. So here we have an.
- And why? And here we have. And why an.
- And so we take the squared difference and the their wrongness is and divide it by the sum of their wrongness is and this gives us a statistic.
- And my test statistic and under H zero under the null hypothesis and follows what's called
- a chi squared distribution with one degree of freedom to probability distribution,
- you can get CGF from stat's models or from sci fi.
- And you can use that to compute a P value. What's the probability of having an MS statistic, at least this large?
- And it's it's you don't have to deal with absolute values on it because it's it's a non-negative statistic in a non-negative distribution.
- We can't just. There is something called a proportion test, but proportion test is for independent proportions and independent samples.
- But we don't have independent samples. We have one sample of our test data.
- And for each test point, we have two measurements, class of a classifier one and classifier two.
- So we can't use a proportion test.
- But the Mackney MA test basically that says do this paired proportion test kind of thing and allows us to get a P value for whether this classifier,
- whether the classifiers have the same accuracy or not.
- And this one, the P value, does not allow us to reject the null hypothesis that they have the same accuracy.
- The P value is about one. So.
- We can also test regression. So each sample is a continuous measurement of the model's prediction error.
- So we have Y minus Y hat Y.
- I from C one for CROSSFIRE a. And we have Y minus Y.
- Hat I. From Classifier B. And those are two different measurements,
- we can use a paired t test or we can use an appropriate bootstrapping mechanism in order to assess the accuracy of a regression model.
- Now, when we have when we do a cross validation, so one technique, the sun, sometimes you do cross validation, say 10 tenfold cross validation.
- That gives you 10 accuracy's for each classifier and you can compute paired t test.
- So each of your each of your folds and your cross validation is a sample.
- Is one data point in your sample. So you've got N equals 10.
- You can do a T test that actually doesn't work very well because your your samples are not independent.
- If you're doing capable cross validation.
- Also if you just repeatedly draw a 10 percent sample and draw a 10 percent sample and do that, say, 30 times,
- you also have the same problem of the same data points are going to show up and you're sent to monitor your samples.
- Also, your training data classifiers are being trained in the same data too much.
- And the ideal is to be able to draw, say, 30 completely independent training and testing sets from your big population.
- But yeah, but if you can't do that, you're trying to simulate with cross validation,
- you wind up with the non independence just causes the resulting come statistical test to not be reliable.
- One thing you can do is you can do repeated cross validation where five times you do a two fold cross validation.
- I'm going to refer you to one of the readings I put in the notes for a lot more details on this.
- Just wanted to bring it up so that, you know, it's there. Cross-validation is sometimes used for final evaluation.
- You'll find this in papers sometimes.
- One of the problems, though, is this allows data leakage because you're testing on data that was available and you're trying it.
- You're testing on all of the data data that was available in your training set.
- This is a this can be a significant problem if we've got a large enough data
- set that we can just use a single test split or maybe two or three test split.
- That's going to allow us to much better simulate to avoid leakage, much better simulate what's going to happen.
- We put the model in production. Cross validation is really useful for a couple of contexts.
- One where you're not doing much model design or feature engineering. You just want to take you have data.
- Want to take a model. Apply it, see how it works. Cross-validation is great for that.
- You're not. You don't have the iterative process of how am I really getting this model to work?
- You can cross validate if you've got hyper parameter search,
- do a hyper parameter search separately for each needful, like make it part of your training process.
- Logit like that logistic regression c.v kinds of things. Help with that.
- But if you've got a model and just want to see how well it works in the data, cross-validation can work pretty well.
- Also with when you are doing cross-validation on the training data to iteratively, do improve your model and feature design.
- That can work really, really well as well. The problem arises when you're doing a lot of engineering on your model.
- And you get access to the test data, which you effectively have on a cross validation setup.
- Because even if you've got it's your say you do 10 cross-validation, do you pick one of them that's gonna be your.
- That's gonna be what you're really doing, your development? Well, all of your other test data is in this.
- This initial development part.
- So it's part of how you're effectively you're using the test data as part of your tuning process for your hyper parameter selection,
- as part of your exploratory data analysis. And that that is a cause of leakage.
- Again, though, I guess. We're can never be perfect, but it's important to be aware of as a cause of leakage.
- I really recommend having the designated test set that you hold out.
- You don't touch. That's the basis of your evaluation. Even if it makes the statistical inference a little bit harder.
- Now, another thing, though, I want to talk about is suppose.
- So let's suppose you've got a complex model and we've got we're detecting spam where we're working for, say, a.
- Telecommunications company were detecting text message spam or were detecting an e-mail spam for any mail company.
- We have text features. We've got made a day to features when they're sending you are else.
- You've got features of the you are all itself. Maybe we even hit the server.
- Let's say we've got another couple of sophisticated models that do that score.
- You are else by their reputation and they're sent and also score senders by theirs,
- by a reputation score that large e-mail search spam, antispam efforts such as the one built in the GMAT.
- I'll do this. I'm not just making that up. It's a part of of antispam at scale is building reputations for you are else and centers.
- And we've got let's say art, let's say our spam detector works well. Precision of ninety nine point five or ninety nine point nine.
- Recall of 80 percent. But what makes it work?
- Which of these features is contributing, how much to its success?
- The answer is to do what's called an oblation study and an oblation study takes our model.
- We take our whole model. We see how accurate it is, but then we turn off individual features of it.
- So we might turn off the sender reputation. How what how exactly you turn off depends on the model design.
- It might be if it's an honor. You just take that part out of your neuron, that graph syllogistic model, you know, everything's well standardized.
- You can put it zeroes for the feature and not retrain or even just take that term out of the model,
- trying to on your training data and try to predict your testing data. And what this lets you see and you probably want to do that just in case.
- Just to make sure the parameters are being tuned without the peace. What does that seat you see, though, is how much each component contributes?
- You can say, OK, my model gets ninety nine percent precision on spams and it gets 98 percent precision if I turn off the sender reputation.
- Well, that lets you see, OK, the sender reputation is responsible for one percent of my precision.
- Now, it's important to be careful how you use this, because you can use this for production decisions and for future work.
- You do this oblation study, you discover, OK, the center reputations only contributing 98 or one percent,
- or maybe it's contributing point one percent. And it's really expensive in terms of compute time and engineer time to maintain maybe stop using it.
- You could also use it for your future research work. What you can't do, particularly within the scope of one study,
- is use the results of your oblation study to go back and revisit your model design that gets you your leakage again.
- As again, as I said in the academic setting,
- we're doing multiple studies in the same data that we do get some leakage and we carry it forward to the next study.
- We again, we can't be perfect, but.
- There's a difference between the oblation study and the feature engineering, the feature engineering, I'm trying a bunch of things that keep things.
- I'm not going to keep things up doing it with this tuning data. Things are going back. I'm not being keeping my careful firewalls.
- In the oblation study, I have my top line performance monitor. Here's my model, I ran it.
- It got 99 percent precision and then. I'm trying to understand.
- Well, what are the drivers of that? I'm not putting it iteratively back into my life.
- Going back and rerunning my my stuff in my training data with it, I'm just using it to get knowledge to carry forward.
- That doesn't cause leakage within the context of the specific study we're talking about.
- And it is of acceptable practice.
- And it's a very, very useful practice for understanding the contributing factors to the performance of a complex model.
- So wrap up inference for classified performance is not immediately straightforward.
- There are several helpful techniques that pointed you here at pointing to you two in the readings and be careful about data leakage.
- But again, sometimes tradeoffs are.

## 🎥 Dates

This video discusses how to use work with dates in Pandas.

CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
DATES
Learning Outcomes
Parse and transform dates
Adjust dates using date offsets
Photo by J. on Unsplash
Dates and Representations
Time moves forward at a constant rate (generally…)
How we record it changes
Daylight savings time – the same hour happens twice
Key insight: time is different from representation
Typically: store time in monotonic form, translate for presentation
Numeric Representations
Unix timestamps — time since (or before) midnight Jan 1, 1970
Seconds, milliseconds, nanoseconds
Reference point is UTC
Julian day numbers
Days since January 1, 4713 BC
Floating point stores time as fraction of a day
“1900 system” (Excel): days since Jan. 1, 1900
String Representations
ISO: 2020-11-03
Alphabetic sorts by date (for AD, until 10000)
Localized numeric
US: 11/03/2020
EU: 03/11/2020
Longer
November 3, 2020
datetime64
The Pandas datetime64 type stores dates (and times).
Construct from:
Number, units, and epoch
pd.to_datetime(230481083, unit='s') — seconds since Unix epoch
pd.to_datetime(3810401, unit='D', origin='julian') — days since Julian epoch
String and format
pd.to_datetime('2020-11-03') — convert from ISO
pd.to_datetime('Nov 3 2020', format='%b %d %Y')
timedelta
The Pandas timedelta type stores time offsets
Create from number + units or string
‘1 day 00:30:22’ – one day, 30 minutes, and 22 seconds
Mark advances in linear time
DateOffset
DateOffset type stores date offsets to adjust calendar days
Create from number + units
pd.DateOffset(months=240)
Correctly offset dates, even with underlying nonlinearities
Months don’t have the same length
DST, leap years, leap seconds
Not directly supported by Series
Can use in ‘apply’:month_series.apply(lambda m: pd.DateOffset(months=m))
Date Arithmetic
datetime + timedelta = datetime
datetime + DateOffset = datetime
DateOffset * num = DateOffset
Comparisons
DateTime supports comparison operators (==, <, etc.)
Need to create DateTimes on both sides
Wrapping Up
Dates and times are typically stored internally using offsets from an origin.
Pandas provides several date and time features, including datetime, timedelta, and DateOffset.
Photo by Bundo Kim on Unsplash

- So in this video, I want to talk with you about dates,
- learning outcomes are for you to be able to pass and transform dates and adjust dates using date offsets.
- So first, I want to talk just briefly about the difference between a date as we say it, like.
- OK. It's November the 3rd, 2020 and underlying time.
- So dates to all kinds of funny things. When we change to or from Daylight Savings Time, we skip an hour or we repeat an hour.
- But the underlying time stream doesn't repeat.
- It's just that our way of mapping that to the way we write it down repeats.
- So we can think of underlying time as moving forward at a constant rate.
- Generally, there's relativity and all of those things.
- But the time is moving forward and how we record it changes and is complex and subject to a lot of rules.
- The key thing is like with text being different, like the text content is different from its encoding.
- Time is different from its representation.
- Well, one of the implications this says, is that we typically store time in more of its monotonic form, like seconds since a particular date, UTC.
- And then we translate for presentation. And so you'll see your store the time it offset UTC and then you will you'll translate
- that to the local time zone with all of the daylight savings rules and everything,
- we are going to go actually display it.
- So internally, there are a few ways we can represent time numerically, and sometimes you'll need to do this yourself.
- So one one is Unix timestamps, which is time since or before that can be negative.
- Midnight, UTC, January 1st, 1970. Often this is stored in seconds.
- Pythons like not in pandas or not pie, but Python, the Pathfinder standard library tends to do time and second floating point seconds since midnight.
- The reference point, as I said, the reference point for this is UTC. You can also store at milliseconds or nanoseconds since that time.
- If you have a data, if you have a data file that has a file, a column that's labeled as a timestamp and it's an end.
- It's a number. There's a very good chance it's a Unix timestamp. That's very common way to store dates and times.
- We can also store Julian Day numbers, which are days since January 1st.
- Forty seven 13 B.C. And you can you you can store a time by using a floating point numbers,
- it might be twenty two million, three hundred and seventy five point eight days.
- There's also other origins. You can use a lot of different origins. Pandas actually lets you specify arbitrary origins.
- But the nineteen hundred system that's used by Excel and other spreadsheets stores days since January 1st.
- Nineteen hundred. So we can also store data strings, so the ISO format is year, month, day.
- This has the nice advantage that at least until the year ten thousand.
- It sorts by date. If you sorted Alphabet Alpha numerically, it's it's going to sort the resulting dates by date.
- So if you're going to name files after dates, this with dates at the beginning of the file name.
- This is super useful.
- There's also localized numeric forms such as eleven three, twenty twenty, which is how we write the dates in the United States, Europe and the UK.
- Right. Generally right at day, month, year three. Eleven, twenty, twenty.
- So if you see a date that's two digits, two digits year, that's not enough information to know when we're talking about.
- Are we talking about November 3rd or are we talking about March 11th?
- You need to know the country locale in which the date came from to know how to correctly interpret it.
- Sometimes you can infer it by looking for, say, November twenty eighth,
- because 28 isn't a valid month number that I'll let you figure out which one you're dealing with.
- But this localized form, just if you get a date, it's often ambiguous.
- You could also have longer string forms like you did right out November 3rd.
- Twenty Twenty Panels provides a function called Date Time 64 that allows you to store dates and times.
- And even if you just have a date, you usually store it. It's a date time with midnight.
- At least that's how you work with it. And pan those pandas doesn't have a time free date type.
- You can create a date time from number and units as an origin.
- So you can say we want. We want two hundred and thirty million.
- Seconds since the Unix epoch, we want three point eight million second days since the Julian Origin D Funk.
- This function also supports the number can be a series or an array in addition to a single numbers.
- You can create a series or an array of pand date time objects.
- You can also convert it from a string. This also can be up a series or array of strings.
- So in an assignment, if you've got it, if you've got a column that string dates,
- you can convert that to a column of date times by using these functions.
- And by default, it's going to pass the time for my S.O.
- But you can also tell it to pass other time, but providing a format string that describes how the time is laid out.
- And there's a link in the pandas documentation to the way these format strings work can also be provided.
- That link in the notes that go with this video. Then so we've got daytime's pandas has also had an object called a Time Delta,
- which stores of a difference between two times, if you subtract one date time from another,
- what you're going to get as a time delta,
- you can create one from a number of plus units or a string that describes that like I can create the the the time delta one day,
- thirty minutes and twenty two seconds. The time Delta marks advances in linear times.
- You can't create a time delta for example, of one month. The date offset is what you use to get one month.
- So it's you can create it from a number and units and it correctly offsets the dates,
- even if you it knows it can know whether it needs to extend by 30 days or thirty one or twenty eight.
- It handles Daylight Savings Time, a Hannahs leap year, a handle's leap seconds and deals with being able to offset dates properly.
- Date Offset does not natively support series, so date times and time deltas, both native pandas natively supports them in series.
- You can't, however, create a series of an object series that contains data offsets.
- So if Month series is a series of numeric series that contains numbers of months, then we can use apply.
- And it's it's it's a little slow because it's doing a python loop effectively.
- But we can use apply to convert these data, offset these numbers of month into data, offset objects.
- We get a series of those which we can then say add to add to a series of date times in order to produce offset date times.
- For example, to add if we've got a column that has the term the number of months on loan is for and when the loan was issued,
- we can add we can convert the month to a date offset. Add it to the issue date and we can find when the loan is due.
- When you're doing arithmetic with dates, if you add a date time and the time Delta,
- you're going to get a date time, date time plus a tight offset is also a date time. You could subtract as well as add.
- As I said, if you subtract two date times, you're gonna get a time Delta.
- You can also multiply a date offset by a number and it's going to give you another date offset that's multiplied.
- So you can if you've got two months, you can multiply it by five and you'll get 10 months.
- You can also compare you can compare date times using comparison operators.
- You do need to create date times on both sides. If you've got something that say strings, you need to convert it to a date time object.
- So then you can do the comparison. So in conclusion, dates and times are typically stored internally using offsets from an origin.
- Usually store. Usually we store them in UTC and then we translate them to local time.
- When we go on display. PANDAS provides a number of functions and types for working with dates and times.
- In addition, NUM Pi provides some of its own. I generally work with pandas, but not PI does provide time.
- Delta date time objects at work just a little bit differently. Python also does it its standard library for our purposes.
- I recommend generally sticking with the pandas ones.

## 🚩 Quiz 11

Quiz 11 is in Canvas.