# Week 11 — More Modeling (10/31–11/4)

In this week, we’re going to learn more about model building, that will be useful in Assignment 5:

## 🧐 Content Overview

This week has **1h33m** of video and **3400 words** of assigned readings. This week’s videos are available in a Panopto folder.

## 🎥 Intro & Context

In this video, I review where we are at conceptually, and recap the ideas of estimating conditional probability and expectation.

CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
BUILDING AND EVALUATING MODELS
Learning Outcomes (Week)
Build and refine a predictive model
Construct features for a model
Apply regularization to control features and their interaction
Measure a model’s effectiveness and other behavior
Photo by Hannah Olinger on Unsplash
Where We’re At
Linear regression (continuous prediction)
Logistic regression (binary classification)
Optimizing objective functions
Minimize loss functions (e.g. squared error)
Maximize utility functions (e.g. log likelihood)
Conditional Estimates
Distinction: Model Probability vs. Make Decision
Trick: Probability and Expectation
Wrapping Up
We are building models that estimate conditional probability or expectation and use them to make classifications.
We’re going to see more about their inputs and outputs this week.
Photo by Klim Musalimov on Unsplash

- Oh, this video, I'm going to introduce our week's topic about building and evaluating models, talking more in detail about how we go about doing that,
- learning outcomes for the week or for you to be able to build and refine a predictive model,
- construct features for that model, apply regularization to control features and then her interaction and to give us models that
- generalize better and to model measure of a model's effectiveness and its other behavior.
- So where we're at right now, we've seen linear regression and we have seen continue to be able to do continuous prediction,
- we want to predict a continuous outcome or target variable.
- We've seen logistic regression that lets us take the concept of linear modeling and move it into the realm of binary classification,
- where rather than having a continuous outcome variable, we have a binary outcome such as defaulted on the loan or is spam or fraud.
- We've also seen the idea of minimize it, of optimizing objective functions, we might minimize a loss function such as the squared error.
- We might maximize utility functions such as log likelihood. These are equivalent to each other.
- And if you've got a utility function in the minimize or you can minimize the negative of the utility function.
- We've also seen that we can think about what we're doing with modeling is doing conditional estimation.
- So in a regression model, we're trying to estimate the conditional expectation, given a particular set of values for my input features X.
- What's the expected value of Y? We might we might do some transformations to all these variables.
- But we're trying to compute this conditional expectation function.
- What's the expected value of Y condition done by feature values, X and classification?
- We're trying to solve a conditional probability problem.
- What's the probability of a particular outcome given that I have some particular feature values x.
- Also so. There's another, though, thing in here that's useful to thinking about,
- so that would just add regression at its heart is trying to model the probability of your data.
- So what we've been doing is with stats, models.
- We do model that predict and we get some scores and then we use the scores to make a decision because internally,
- the logistic regression mathematically with solving this problem of maximizing the log likelihood.
- Mathematically, what the logistic regression is doing is it's trying to build a probabilistic model of the data and the
- parameters are estimated based on their ability to accurately model probabilities in your training data.
- We then use these output probabilities to make decisions. So we'll say success if y had is greater than point five.
- Saikat Learn uses the logistic regression to directly classify by using the threshold of point five.
- But you can get those estimated probabilities out of it with decision, the decision function.
- This is important to note.
- So the log likelihood that you get out of a logistic regression is not based on its actual actual decisions that it's making.
- It's based on its ability to model probabilistically what the labels look like in your training data.
- And it's the more it's the probability that it assigns to those labels with the final fitted versions of the parameters.
- I want to mention briefly again, a trick that I mentioned, I believe, last week where.
- Expected value and probability are closely related. The expected value is the integral or the somewhere of values weighted by their probabilities.
- But also if we have an indicator function, ie, which is one if.
- X is in the set and and zero, if it is not with what one?
- Basically, given a value, it decides whether or not it's in the set. If that said as an event, it says whether or not the event happened,
- the probability and the expected value of the indicator function are the same thing.
- So we can think about estimating conditional expectation probable, but we can think about everything is estimated conditional expectation.
- When we're estimating a probability, we're estimating the conditional expectation of the characteristic or indicator function.
- So to wrap up, we're building models that estimate conditional probability and expectation.
- We've been doing this in a variety of ways. We use these models to make decisions.
- This week we're gonna see more. So we've got the idea of doing the modeling. This week, we're looking more at how do we build inputs for these models?
- And how do we evaluate the outputs that we get out of them?

## 🎥 Workflow

How do you do feature engineering and model selection in a machine learning workflow?
What is the iterative process involved?

CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
WORKFLOW AND ITERATION
Learning Outcomes
Properly split training, tuning, and evaluation data.
Understand what is and is not “cheating” for evaluating a predictive model.
Photo by Tam DV on Unsplash
Split the Testing Data
The Data
Training Data
Testing Data
One Way
Train model
Experiment with different model designs
Experiment with different features
Select hyperparameters
Evaluate Effectiveness
Refine
Motivation
Purpose: build models that can process new data
Eval goal: simulate model processing new data
Method: hide some data and pretend it’s new
Violation: allowing “new” data to affect the model design
Iterative Modeling
Work with the train data:
Exploratory analysis
Try different features and transforms
Try different hyperparameters
Try different models (logistic, random forest, etc.)
Test effectiveness with tuning set (another test set held out from the training data)
Or cross-validation (e.g. LogisticRegressionCV)
Applying to Test Data
Apply feature transforms / combos to test data
Otherwise the model won’t work
Apply them, but don’t use test data to assess if they’re useful
If you transform target in train, do that in test too!
Use trained model to predict test data
Measure accuracy / precision / whatever
Outcome of model process: one model (or one from each family) to evaluate for effectiveness.
Dos and Don’ts
Do
Split training data into further subsets (tune data) to test model concepts
Iteratively refine model’s predictive quality w/ tuning data
Explore and test features on training data
Don’t
Go back to fix the model if it performs poorly on test data
Use test data to inform model or feature decisions
Production Systems
Production systems often have new streams of test data.
New data arrives tomorrow!
Knowledge from today’s test data can be used for tomorrow’s modeling.
Carrying Knowledge Forward
Use what you learned on your test data for the next project
May have new data ⇒ no problem
Same data set ⇒ technical violation; less problematic w/ new test sample
Over-reuse of data is a problem, ramifications not fully known.
Wrapping Up
Train/test splits are to help us test the ability of a model to predict future, unseen data.
Using test-data knowledge to inform modeling decisions breaks that down.
Photo by Alexandre Debiève on Unsplash

- Though, in this video, I want to talk more about the workflow and the iterative process of model, building and refinement.
- We talk about how to properly split training, tuning in evaluation data,
- understand better what is and is not cheating for evaluating a predictive model.
- So we're setting up our setup. So we split our testing data. We have our main dataset, all the data.
- We split it into training data and testing data. And then on our training data, we print our model.
- We experiment with different model designs, different features. We select hyper parameters.
- We can do this based on the models internal goodness of fit statistics.
- So you can if you're training a linear regression model, you can be looking at your R-squared.
- You can look and be looking at your adjusted R-squared. You can be looking your AIC for a for logistic regression model.
- You can be looking at your log likelihood or you can do it by testing, by running a classifier evaluation metric on some tuning data.
- So you further subdivide your training data into train. Antoon.
- Or you may do cross-validation where you split your training data into five or 10 pieces.
- And for each piece you trade, the rest of the data, predict that piece and measure your metric.
- You can do all of these things basically so long as you don't touch your training.
- You're testing data. You can do whatever you want with your training data to better improve and understand your model.
- Well, not all things are reasonable to do, but you can do it. You can do. You're not cheating with whatever you do there in your training data.
- What you can't do is use knowledge from the testing data to refine.
- Your modeling process, and this includes exploratory analysis of the testing data, because the idea here is that.
- There is sort of the the motivation of what we're trying to do with this predictive
- modeling is to build models that are going to be able to process new data.
- So predicting our testing data isn't the point. If you're training something to detect.
- Fraudulent transactions in your online gaming platform.
- Your goal isn't to predict that like you're the purpose of your model is never to predict the fraudulent transactions in your historical data.
- For the purpose of the motto is to be able to run it. And as new transactions happen, categorize them as likely fraud or not.
- And so the goal of our evaluation is to simulate the model's ability to generalize to new data that it hasn't seen yet.
- And the way we do this is we hide some of the data and pretend it's new.
- And as soon as you allow this data that's supposed to be new.
- If you're simulating what's gonna happen, if you run this for a week and try to classify the new transactions,
- what you're doing is you're giving the model. Or the modeling process data that it's not going to be allowed to have in real life.
- We call this leakage. Information leaks into the model than its actual application.
- It's not going to be able to have in some ways, it's the opposite of the problem that we have when we're trying to give you tests in class and tests.
- We say you can't have a textbook, you can't have notes, you can't use the Internet, answer these questions.
- But in real life, you can use all of the reference material you want.
- Anytime you want have to actually solve that problem. In practice, there's still value in internalizing.
- A lot of it's that you can detect when you because you need if you haven't internalized a lot
- of the knowledge that it's hard to detect when you're going to go need to look something up.
- If you don't know that overfitting is a problem, then you don't know when you need to go read more about Overfitting.
- You just don't even think about it.
- But when it comes to actually doing things about things, you have all these resources available in machine learning.
- We have the opposite problem. Because in real life,
- the model is not going to have access to the test data because you're trying to use it to classify new transactions as they come in.
- You're trying to use it to predict the purchasing behavior of users as they come in.
- You're trying to use it to forecast the load that's going to be on your power grid or on your transportation network for a time in the future.
- And you don't get to look ahead and see any of that information. So in the real world,
- your model does not have access to any information about what it's trying to predict other than what it can learn from historical data.
- And so if you do anything with your test data that leaks information about the unknown,
- it's supposed to be predicting into your true model building process,
- either learning the model itself or the process of figuring out what feature parameters and values and whatever are going to be useful for your model.
- Then you effectively allow the model to cheat.
- And it's going to get better and you reduce it's gonna get better performance than who actually will in reality.
- And you reduce the ability of the evaluation process to simulate what you actually care about.
- Can my model effectively predict how much traffic is going to be on the freeway in December?
- Based on. Previous Decembers and on the date earlier in the air, like let's say we've got 10 years of traffic data.
- Can I accurately predict what the freeway data is going to be this December?
- You don't get to look at this December if you do. I think the physics department would like to have a word with you.
- So. When we have within this setup, we have an iterative model process.
- So with our training data, we can do exploratory analysis. We can try features and transforms.
- We can try different hyper parameters, talking to a hyper print.
- The parameters are what we learn in the model. Your logistic regression coefficients, those are parameters.
- We learn them from the data. Hyper parameters are additional values that control how the model learning process works.
- Oh, we can try different models like a logistic regression or random forest.
- We can test effectiveness with the tuning set so we can take our training set, split it into tuning and real training.
- We can do cross-validation, as I talked about, where we can split into many separate things.
- Some of the circuit models Saikat learn models have built in selection for some of their hyper parameters using cross-validation.
- Once you see regularization, you can pick the regular.
- You can tell logistic regression c.v to automatically find the regularization strength using cross-validation on on your training data.
- Then, though, you need to apply it to your test data. And a couple of things here.
- First, you need to apply your feature transformations and combinations to your test data.
- You have to apply them to test data because your model is built on these, your model, your linear model or whatever is built on these.
- These transformed features. These combine features, all of your feature engineering.
- The results of that are what the model is trained on. If you just try to apply the model to the raw data, it's not going to work.
- It's not going to have the features it needs. But the difference is you apply the feature transformations, the test data,
- but you don't use the test data to a site to test to assess which feature transformations are useful.
- You did all of that in the training data. And you take that as a pattern or a recipe.
- I'm going to show you in a future video how you can change Saikat, learn pipelines together to do this, to automate some of this.
- If you aren't using. Saikat learned you might write a function that, given raw data,
- will return transformed feature Saul or will you turn data with the final set of features?
- That's a very good design as well. You just take this as a pre canned recipe and you apply it to your test data.
- Then you run the model, predict the test outcome data and you measure accuracy precision area under the ROIC curve,
- whatever measure you're going to measure of of your model effectiveness on those results.
- The outcome of the iterative modeling process in the preceding slide, though,
- is one model or possibly like one model from each of three or different families
- that you want to finally evaluate for effectiveness using the test data.
- So. A few, too, does kind of synthesize what I've been talking about here.
- It's fine to split the training data and a small and additional subsets.
- You can do train test things within your training process as an iterative process to figure out.
- Does this does this feature give me? Does this feature transformation? Give me a more accurate classifier.
- Will do a train tune split. Add the feature. Measure the accuracy on the tuning data.
- Does it help? Does it not? Does a square root give me better give me better classifications or does a log give me better classifications.
- Do that on splits of your training data and leave the test data alone.
- Go put it on the shelf, lock it in the cupboard, whatever you're going to do with it. You can iteratively refining the models, predictive quality,
- you can explore and test all of your features using, as I said, using cross-validation or using train,
- using tuning splits of your training data,
- allow you to use predictive accuracy as part of your decision for what features do include how to transform them,
- how to construct new combinations, etc. Don't know if it was a once you then you take you take your model, you go, you run on your test data.
- You don't get to go back and fix the model if it performs poorly on the test data.
- That's what you need.
- In get it to do all of those fixes, because as soon as you say, oh, it didn't perform on the testing data, let me go back and fix something.
- Then you're giving your model development process access to information it doesn't have in reality.
- And testing on that test data is no longer a reliable test of what's going to happen when your model meets new data in the field.
- You also can't use the test data to inform model or future decisions, at least within the scope of one project.
- You can't say you've got your project, your test data thing. You've got to learn things from that.
- You're going to publish a paper.
- If you're doing this for a graduate, research the results of that learning you're going to carry into the next project.
- Arguably, that's that can induce a little bit of leakage because you or someone else is going to use them and they might work on the same dataset.
- Get a different data set. Arguably, it's a little bit of leakage if they read your paper on your test data.
- OK. We have these things on test data. I'm going to make a neutron test split of the same dataset and I want to do things.
- Arguably, we have some leakage. We can't plug all the leaks. The goal is to have the goal is not to be perfect.
- The goal is to have a good and credible emulation of the actual production environment for what
- we're trying to do so that we have an effective test of our models ability to do its job.
- And its job is almost never classify preexisting data.
- And the trick in the test data. That's how we study the model's effectiveness.
- But that's not how we deploy the model to improve our lives and improve our businesses.
- Production systems often have new streams of test data coming in every day.
- If you're doing online. If if you're doing an online.
- Shopping center. If you are monitoring quality control processes in a chip fab you've gotten used to in the test data.
- The things keep running next week, next month. And so you could knowledge from today's test data.
- So you run things. You predict the month of October. You predict last December.
- You're doing these tests on your models effectiveness. You run it.
- You're predicting this coming December. It's fine to learn what you learn about predicting this December for next December.
- There's no cheating. You aren't seeing next December.
- You're accusing the year over year, the month over month, etc, trends to be able to predict what's gonna happen next December.
- So in the industrial setting, because we can continually acquire new test data,
- it makes some of the iterative problems that happen in academic research with a static dataset significantly less of a problem because.
- OK, well, that's so I learned something from this project. What am I going to do next?
- Project when next project you have new test data because your chip plant has been running for another two UTS.
- You can use some of that data and your what you learn from the data you captured before, that isn't cheating at all.
- It's a problem in academic research where we have a static dataset.
- The movie lends data that wreck data edness pick your data set.
- We're all working with test sets on the same data set. I read your paper.
- And that's effectively a form of leakage. That's not a whole, as I said, that we can completely plug.
- So we carry knowledge forward and we use what we learn that our test data for the next
- project in production with new test data that comes from new runs of the system.
- As I said, technical violation. But if we pick a new test sample, it's less of a problem.
- We're all using the same test set. Then we have a real problem.
- But we don't often have a choice when we're trying to do academic research with these data sets.
- So to wrap up trend test splits are to help us test the model's ability to predict future unseen data that it didn't have a chance to learn from.
- So we're doing what we've been doing here with this predictive modeling its machine, learning the machine that the system is learning about the data.
- So we can generate predictions for future data. And we test it by giving it data hasn't been able to see using test data knowledge going back.
- It's like making a loop and using knowledge from our test data to inform our modeling decisions, breaks down that barrier.
- And it means testing on our test data is no longer an effective test of the model's ability to generalize to data.
- It wasn't able to learn from. This isn't a problem within your training process.
- The train tunes that like OK, train and try something, train data, see it on tuning data,
- go back and keep doing with the same tuning data because we're not using the
- tuning data results as the conclusive evidence of our model's effectiveness.
- We're just using them for our own internal debugging. It's why we want to go see this model works better.
- That we do that, then we go. We do that with the test data,
- we're not allowed to go back because otherwise we're optimizing for our ability to predict that specific set of test data.
- Which means predicting that specific set of tests we want is predicting that test
- data to be representative of predicting data that we haven't been able to see yet.

## 🎥 SciKit Pipelines

In this video, I introduce SciKit *pipelines* that put multiple transformations together.

CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
PIPELINES AND TRANSFORMERS
Learning Outcomes
Use a SciKit-Learn Pipeline to combine feature transforms and prediction
Photo by Danil Sorokin on Unsplash
Pipelines
Our data often takes the form of a pipeline:
Transform features
Fit model
Prediction then requires:
Transform features
Generate predictions
SciKit-Learn
Pipeline: create a sequence of ‘models’
Typically one or more transforms followed by regressor or classifier
Fit the pipeline and it will fit its inner models
Transformer
The SciKit-learn use case so far has been:
Train data with fit(X, y)
Generate predictions with predict(X)
Transformers have a third method: transform(X)
Learn transformation parameters with fit(X) (y is ignored)
Transform with transform(X)
Only apply to input features – not targets
ColumnTransformer
Transformers apply to all columns
Good when they’re all numerics
ColumnTransformer transforms different columns differently
List of (name, transformer, column) triples for columns
Some transformers take 1 column, some a list of columns
remainder option for remaining columns (‘drop’, transform, etc.)
Useful Transformers
StandardScaler – standardizes variables
PowerTransformer – applies power transformations
Binarizer – converts numeric to 0/1 with threshold
OneHotEncoder – encodes categorical as dummy
FunctionTransformer – transform with a function you write
Transforming Outcomes
Transformers only apply to features
The TransformedTargetRegressor class transforms target variables
Wraps an underlying predictor
Transforms target before calling inner ‘fit’ or ‘predict’
Un-transforms the results of ‘predict’
Wrapping Up
Pipelines let us combine multiple data steps into a single operation.
This facilitates applying train data transforms to test data.
Pay very close attention to defaults in SciKit-Learn.
Photo by Fabio Bracht on Unsplash

- This video, I'm going to introduce psychic learn transformers and pipelines that are going to allow
- you to put your feature transformation and your modeling process into one pipeline that's
- reproducible across your training and your test data learning objectives are for you to
- be able to use a psychic learning pipeline to combine feature transforms and prediction.
- So our data often takes the form conceptually of a pipeline.
- We're going to transform some features that are going to fit a model.
- And then in prediction, we need to transform the features and generate the predictions.
- Both of these steps you have, the transformation, the transformation may have parameters, for example, the standardization that we talked about.
- We have to learn the mean and the standard deviation that we're going to subtract and scale by in order to do the transformation.
- And in the test data, we want to transform by the training data properties, not the test data properties, for two reasons.
- One, they might be different to an actual production.
- You don't necessarily have a whole batch of test data. If you've got a new.
- So if you've got a new user coming to your online shop and you're going to predict for you want to
- predict whether they're going to which of the three specials they're most likely to be interested in.
- You just have this customer coming and you need to be able to transform their features to put them in the model
- or the way you do that as you use the transformation parameters that you learned from the training data.
- So Psyched Learn has an object called a pipeline that allows us to create a sequence of models.
- And the typically this is one or more transformer algorithms or models followed by finally a regressive or a classifier or some other kind of model,
- though, to output.
- There's other things you can put in the middle, like Matrix decompositions and other things like that that we're going to see a little bit of later.
- But if you have if you put this into a pipeline and then you tell and then you fit the pipeline, the pipeline exposes the fit method.
- It will fit its inner models and it will transform the data through in sequence.
- So if you've got so if you've got a if you've got a pipeline.
- And it has a transform. And it has a classifier.
- When you call fit, what it's going to do is it's going to, one, fit the data.
- Or fit the transform to transform data. Using the parameters at Just Fit.
- And then three fit the classifier. On the Transform data.
- So it automates this process of managing your data pipelines.
- So to talk a little bit about Transformers, the learned use case we've seen so far is that we train something on data with fit.
- We give it our input features, we give it our output class, and then we generate predictions with predict transformers.
- Add another modeled another function to this paradigm.
- Transform some functions can do both transformation and prediction, but transform returns a copy of your input data with the features adjusted.
- So if you fit so for the scale the standardization transformer.
- Fitt. What it does is it computes. It computes X Bar and S and then transform.
- Return. X minus X bar over s.
- And it. It does this separately for each column,
- for each of your input features is going to learn a separate mean and a separate scale for each of your input features.
- But that's what fit and transform. And so you can then.
- So if you fit the transformer and then you transform that, you can then use the transformed data as input to the next stage in the pipeline,
- another transformer or your your final classification of regression model.
- If you want to transform your columns differently, so Transformer's if you have a transformer, it's going to fly to every column in the asset column.
- Transformer allows you to apply different transformations to different columns in your input data.
- It's also one of the few Pay Saikat learn classes that actually knows about Panda's data frames.
- And so you give it a list of triples, you give it name transformer and column triples,
- and it will learn this transformer for these columns in this transformer for these columns.
- And then there's a remainder option, which you can say either a transformer that apply to all of them or a drop or there's some other options as well.
- And so what you can do is if you've got, say, three different categorical transfer functions, you want to do something.
- Or call it you want to do something, too. And you have a number of numerics you can apply.
- OK, here's one transformer for one of the the the categorical calls transformer
- for another one and then remainder just standardize all my numeric variables.
- Lets you do that conveniently. But the column I'm going to refer you to the documentation.
- I've got links to the documentation in the notes for this week.
- I'm going to refer you to that to learn more about how to apply column transformers, but they allow you to transform columns differently.
- Some of the useful transformers that psychic learn gives you are the standard scalar that standardizes variables.
- There's a power transformer that does power box or does box Cox style power transformations binary or
- converts numeric data to zero one by applying a threshold's be one of its greater one out encoder.
- We'll take a particle Virk Oracle variable. So a transformer is not limited to just returning one output column.
- It can expand the column in the multiple columns. So one hot encoder.
- We'll take your categorical column and it will return multiple columns by encoding
- by dummy encoding the categorical variable and then the function transformer.
- You can give an arbitrary function that will use that to transform the data.
- So the Transformers, though, they only apply to features if you also need to transform your outcome variable.
- The transformer target regressed classes. What you need to use and it does not go into a pipeline.
- You could use it as the as the last stage of a pipeline or as a stage in a pipeline.
- But it wraps an underlying predictors.
- You pass a predictor and a transformer in its constructor parameters and it transforms the target before calling the predict method or the fit method.
- And then if you when you call, predict it untransformed, the results, you get the results back out in the original scale.
- So to wrap up pipelines, let us combine multiple data steps into a single operation.
- One of the things this is really useful for is being able to apply your training data transforms to your test data.
- You fit the whole pipeline that transforms. You're going to learn the parameters from the training data.
- You then go apply them to the test data and it just does the right thing for you automatically.
- Now, one thing you have to do throughout your work with Saikat learn is pay very close attention to defaults.
- The defaults are not always what you expected.
- You need to pay close attention to them in order to understand that the model is doing exactly what you think that it's doing.

## 🎥 Regularization

This video introduces regularization: ridge regression, lasso regression, and the elasticnet.
Lasso regression can help with (semi-)automatic feature selection.

CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
REGULARIZATION
Learning Outcomes
Understand the function of a regularization term in a loss function
Apply regularization to logistic regression models
Tune regularization parameters
Photo by Joshua Hoehne on Unsplash
Multicollinearity
Correlated predictors cause poor model fit
X1
X2
Y
Loss and Regularization
Vector Norms
Understanding Regularization
How do you increase loss?
Increase a coefficient
How does that happen?
Strong relationship
More of the common factor on one than another
Balancing
Regularization Factor
Lasso Regression
The Elastic Net
Applying Regularization
Standardize numeric first
Coefficient strengths are comparable
0 is neutral, coefficient magnitude is strength of relationship
Select hyperparameters based on performance on tuning data
Scikit-Learn CV classes help with this
See notebook for example
Wrapping Up
Regularization penalizes large values of coefficients.
This controls model structure and, combined with standardization, requires “strong” beliefs in relationships to be justified by reducing training error.
Photo by Jasper Garratt on Unsplash

- But now it's time for a topic that I've mentioned a few times, when we're actually going to learn what it is, regularization.
- So the goal here is for you to understand the function of regularization,
- terminal lost function and apply regularization to your logistic regression models and then finally tune regularization parameters.
- I want to start by reviewing MultiKulti Darity. So remember that if we have correlated predictors that can cause poor model fit.
- So if we've got X1 and x2 and they cause why we've got this correlation between them.
- We don't know particularly where the common affects, so we can have a look.
- So we can factor this out as X one two plus X one plus X two.
- Except we don't actually have X one to. It's hidden behind the wall.
- Where does it. Where does its its value go when the coefficients?
- Does it go on X one, does it go next to you split it between them?
- The linear model itself has no way to determine where the common component should actually be allocated.
- And so one way we can deal with this and several other problems is by introducing what we call a regularization.
- So rather than just solving the problem, minimize lost function.
- And so if this is a linear regression, this might be squared loss or suspect B squared error.
- This might be negative log likelihood. Log likelihood is a utility function, a negative log likelihood to be a positive value,
- because the log likelihood they're negative is a lost function. You want to minimize your negative log likelihood.
- And what we do is we add to that another term, which is we call the regularization term.
- And all it is, is it is a parameter of the regularization strength times,
- the magnitude of our parameters lost function now as two terms, the error and the and the magnitude of the coefficients.
- When we're doing the squared magnitude here, we call it the ridge regression.
- So quick detour on some of the notation. I'm using a norm as a measure of the magnitude of a vector.
- So when we say X, we say the L2 norm, which is indicated with the subscript two, there's called it L2 Norma Euclidean norm.
- What it is, is it's the square root of the sum of the squares that the elements of the vector.
- If you take the L2 norm of Y minus Z, that's the Euclidean distance between Y and Axis.
- If they're two dimensional vectors, it's the straight line distance between them.
- So if you've got Y. And you've got X, Y, Z.
- It's the straight line distance between them, the.
- And then we can square it. So subscripts two means L2 Naum superscript, two means square.
- And that's the sum of the squares of the element. So we get rid of this square root and we get the some of the squares useful, really useful.
- It it simplifies the computation just a little bit. And it's how the retrogression normalization is defined or regularization is to find the L1 norm.
- Subscript one is the sum of the absolute values and we call this the Manhattan or taxicab distance
- because it's the distance you would have to travel if you could only travel in straight lines.
- So if you want to go from X to Y, it's the it's the total length of that path.
- So but it's also it's also useful, some of absolute values.
- I'll want to some of absolute values. L2 is the sum as the square root of the sum of the squares.
- You can generalize to get other norms as well. But this is what this notation means.
- The magnitude of the vector. And so when we build up this rig, we build up our regularized model the way we increase.
- The. The way we increase this component, the loss.
- Remember the way we want to think about it. One of the tools we want to use for understanding a metric is how do you how do you make them change?
- How do you increase them or decrease them? And the way you increase or decrease this part of the lost function is you.
- You increase the coefficient and that can happen where having a strong relationship
- that can happen by putting more the common factor on one than another. So when you have this multiple linearity,
- one thing the retrogression regression is going to do is it's going to encourage the
- model to distribute the influence of the common factor between the different sub factors.
- Because if I put it all on one, that would increase the square more than if it divides it evenly between the two,
- the way you minimize the squares as you divide the common the common components evenly between the two features.
- It's it's a part of and it gives us a solution. So a multi linearity our system is under determined.
- We don't have enough information to know where the coefficient is by adding regularization to our to our lost function.
- We introduce this additional this additional loss that.
- Tells it where to put it. By making the least expensive solution be the one where it's evenly distributed between all of the correlated features.
- So where do we have this lost function, we have our our error loss plus our coefficient strength.
- We can minimize this in two ways. We can minimize it by decreasing our error and we can minimize it by having small coefficients.
- And what effectively, though, what that means is in order for a coefficient value to be large, it has to earn its keep and it has to earn its keep.
- By decreasing training error. If if if if you've got a minimum, if you if you've got a particular value,
- we're going to try to try to increase the coefficient and increase the coefficient.
- That might give us a lower error. We only get a lower total loss if it decreases the error by more than it increases
- the coefficient after take into account square and our regularization strength term.
- And it gives this it encourages the coefficients to be small values unless a large value contribute significantly
- to decreasing the models error on the training data squared error or increasing its log likelihood.
- We're talking about a logistic regression.
- The regularization parameter lambda is what we call a hyper parameter because we don't learn lambda from the data in general,
- like within a single linear model. We don't learn lambda from the data.
- We have to come in from outside the exact impact. The value depends somewhat implementation details such as how difficult one thing is,
- the loss function itself, a mean or a some different psychic models actually make it.
- You can't just take a regularization term for once I get model and use it for another,
- even if they're both doing L2, because other details of lost function mean the value doesn't transfer,
- because if it's using a sum of squared error,
- then the regularization strength needs to depend on the data side because for the same amount, for the same amount of average error.
- The sum of squared error is going to be larger just for having more data.
- If it's a mean, then it's going to then you're right, Visitacion term is not going to depend on your data size.
- Some Saikat models also use a concentration parameter C, which is one over Lambda Lambda,
- and it's multiplied by the error instead of being multiplied by the by the, the coefficients.
- Because the strict parameters. So an increased value of lambda or a decreased value of C results in stronger regularization,
- a coefficient has to contribute more to the model performance to earn the keep for for a large value than it does with weaker regularization.
- Now one good way to learn to write a good value for Lambda is to optimize with the training and tuning split of the training data.
- Saikat learned. We'll do this automatically if you use.
- So a lot of the repressors also of a CVO class logistic regression c.v you're going to have REJ CV.
- Quite a few others have a CV variant.
- And what happens with the CV variant is it will learn values for one or more hyper parameters by doing Krait when you call fit with training data.
- It will cross validate on the training data to learn and you can give it a range of Perama,
- a range of hyper parameter values to consider a list of them.
- It will do. It will do the cross validation to automatically learn good values, the best values it can for these regularization parameters.
- There is also a class grid search CV that allows you to do hyper cross validation to search for good hyper parameter values,
- for any parameter, for any hyper parameter for a psychic. Learn model.
- I encourage you to go play with that at some point. But logistic regression CV will do that automatically just in the fit call within
- itself's with all it'll find a good and a good regularization strength value.
- So the lasso regression. This looks very, very similar, except every place, that square at L2, nor in the sum of squares.
- With the L1 norm, we're now looking at some of the absolute values and so the Elst, the square, the L2 norm allows it encourages values to be small.
- But if the value is close to zero, it doesn't like it's close to zero. Fine.
- What the oh one naum one of the effects it has is it doesn't like small noun's zero values.
- If a coexistent value was small as L1, Naum is going to push it to zero.
- And what this does is it makes the coefficient spot what we call sparse, sparse data is data with a lot of zeros.
- And so. If a coefficient is not contributing very much to classification, it's going to go to zero.
- And you can use that to see which class, which features are actually being used in the classification.
- And it effectively becomes an automatic feature selection technique because it's going to push the it's going to push the.
- Coefficients for features that don't contribute very much to decreasing your training error to zero.
- You can then put them together in what's called the elastic net, which combines L1 and L2 regularization.
- And you have an overall regularization strength lambda that controls your regularization or Seage was one over lambdas.
- What's going to multiply the loss function by sea? And then we have L1 regularization and L2 regularization.
- And they're balanced and they're balanced with this parameter ro.
- And so you could parameter Ryze. It's you just have your L1 strengthen your L2 strength.
- But most elastic net implementations have a regularization strength.
- Your out your lambda area C. And some of the psychic docs that use Alpha for this.
- And then you have a balance that says how much of the regularization to put on a one?
- And how much to put on L2 and these parameters both need to be chosen by cross-validation.
- That's really the only way to find good values if you use logistic regression.
- So logistic regression and logistic regression CVA can do elastic net.
- There's also an elastic net and elastic net CV classes. And by default, if you use logistic regression CV, it's only going to use.
- It's only going to search for the it's going to default L2 regularization and search for the regularization strength.
- If you want elastic net, you change the penalty option.
- You also have to change the solver because only one of the logistic regression can you several solvers to learn the logistic regression parameters.
- Only one of them supports elastic net. And then you're going to need some additional options in order to tell it to also search for for that L1 ratio.
- But it can do all of that for you.
- I refer you to the documentation for though, with logistic regression, logistic regression, CV classes to see how to do that.
- You're gonna find it useful in assignment five. I'm also gonna be giving you an example in the synchronous session that is dealing with some of this.
- So some notes on applying regularization, though.
- Regularization really works best when you're numeric variables are standardized because the coefficients.
- It's it's looking at the total magnitude of your coefficient vector.
- And if one of your coey if one of your features is in units of millimeters and one of your features is in units of KG's,
- the coefficient values have nothing to do with each other. And so looking at the total magnitude, treating them as elements of a vector,
- it becomes really difficult and it's going to penalize one just for having to have a larger range because of the underlying units.
- If you standardize your numeric variables, then each one is in terms of standard deviation.
- The coefficients become a lot more directly comparable with each other and your regression is going to be better be your your regularization
- is going to be better behaved than you want to select your hyper parameters based on performance and the tuning getter and the CV classes,
- as I said, get help with this. I'm giving you an example and one of the notebooks that does so give you an example,
- a notebook that does uses logistic regression CV to do hyper parameter search for L2 regularization.
- So you can see that in action with a simple example. So to conclude.
- So to conclude regularization. Imposes costs on the model for large coefficient values, either large squared values, the large, absolute values.
- Squared costs, which we call Rig Ridge Regularization, encourages values to be small.
- Absolute value loss, so to call L1 or lasso regularization encourages small values to be zero.
- If you put those together, it encourages values to be either zero or large enough to be meaningful, but not super large.
- L2 regularization or Vage regularization is useful for controlling the effects of multicam linearity.
- And together they're useful for decreasing your moral complexity. Allow making coefficient values to earn their keep.
- Another way.
- So another thing that they do is if everything's standardized or at least means centered, then small coefficients results in small effects.
- And effectively what it means is assume everything's average. Unless we have enough evidence,
- enough data to justify stronger beliefs and beliefs and stronger relationships that
- are justified in terms of their ability to reduce our error on the training data.

## 📓 Pipeline and Regularization

This notebook demonstrates pipelines and \(L_2\) regression, and performs a significance test of classifier improvement.

It also shows a training of a decision tree (next video).

## 🎥 Models and Depth

What does the world look like beyond logistic regression?
Can a model output be a feature?

CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
MODELS AND DEPTH
Learning Outcomes
Introduce models beyond linear and logistic models
Introduce the idea of a model output as a feature
Photo by Tim Mossholder on Unsplash
More Models
Decision Tree
Tree of nodes
Each node is a decision point
Result when you reach the bottom
Can learn complex interaction effects on their own
Random Forest
Decision trees can have high variance – they can memorize a lot of the model
Random forest:
Take (partial) bootstrap samples of the training data
Fit decision trees
Predict by taking vote from trees
Models as Features
Feature outputs don’t have to come directly from data
Transformation models (e.g. principal component analysis)
Prediction models for other tasks
Example: LinkedIn job ad recommender
Logistic regression
Features from text, job description, user, etc.
One feature: transition probability estimate
Many Models
Linear (GLM, GAM, etc.)
Support vector machines
Naïve Bayes classifiers (we’ll see those later)
Neural nets
Often play the logistic regression game – logit function to convert model scores to probabilities
Wrapping Up
There are many models for classification (and regression).
Model outputs can also be features for other (often linear) models.
Photo by Michal Janek on Unsplash

- Bo and this video, I want to move beyond logistic regression to talk about some additional classification
- models and also introduce the idea of putting models in features for other models.
- So learning outcomes are to do exactly what I just said.
- So so far, we've been estimating the probability of Y equals one by using a linear linear model, a Y hat equals actually would logistic.
- Of this. And we can use any estimate of this probability or we can just use models that output decisions,
- these may be based on scores, the scores that aren't estimated probabilities.
- For example, a support vector machine uses distance from either plane as its score.
- But we're not limited to just using a logistic regression, of course.
- So for one model, a decision tree is a tree of nodes where each node is a decision point.
- So I made a little decision tree here for the grad student admissions example.
- And at the first node, it's going to check if the GPA is less than or equal to three point four three five.
- And if it's less, it's gonna go to the left hand side. And there's extra nodes here, but it's going to deny admission.
- And if it's greater than three point four, three, five, it's then going to look at their class rank,
- their school rank, and if their school rank is less than one point five, it's going to do.
- It's going to admit. And if it's greater than one point five, it's going to deny.
- Really simple model. It would be absolutely terrible to actually use this model for regression, for admissions decisions, but for predicting the.
- But here we aren't. We aren't trying to build a model that will admit we're trying to build a model
- is gonna predict whether someone is going to get admitted it might work. But this illustrates how the decision tree actually works.
- They can learn complex interaction effects on their own because you can have the threshold.
- And what happens with the features changes as you go down to the node? Now, one of the problems, though, they have high variance,
- they can effectively memory memorize all of the training data by building themselves a lookup table that looks up the outcomes for training data.
- By the by the feature values, you can get extremely good training, accuracy.
- I trained one on this data with with unlimited feet, feature depth and I got training accuracy of over 99 percent.
- And I got tested accuracy of point five to.
- But a random forest, what a random forest does is it takes bootstrap samples by default psych, it learns random for us.
- We'll take complete bootstrap samples. You can tell it to take smaller ones.
- It's not actually a bootstrap sample, but it's a subsample of the dataset. And it fits a decision tree to that sample.
- And then it does that 100 times or however many times to get a bunch, you get one hundred decision trees.
- And then for a final classification, when you tell it to predict what it's going to do is it asks all of the decision, trees to vote.
- It's building up this random forest of happy trees. They're happy because they have a functioning democracy.
- They all get to vote on the final outcome. And the random forest takes the vote and returns the majority of the classification.
- Or if the if the individual values are producing scores, that it then it might average the scores and use that as an output.
- So but you build up, you decrease your variance.
- That you would get from training, it is, isn't she, on one set of data training decisions and another set of data by train?
- The decision tree on a bunch of sets of data by sub sampling your training data and then averaging over that in order to produce your final output.
- Brandon Forest is one of the classifiers that I want you to use in your assignment.
- Another thing, though, that I want to introduce is that feature output features don't have to directly come from data.
- So a lot of our features are going to come from data.
- But sometimes they're when they come from other models, sometimes they're a transformation model, some kind of what we call unsupervised learning,
- where it's computing things,
- but it doesn't have an output class that it's that's known that it's trying to predict or prediction models for other tasks.
- For example, in link to end their job ad recommender, the last I knew just a few years ago, it was it was at a high level.
- It was a logistic regression. You're going to LinkedIn. It says, here's a job ad for you.
- Well, that's coming from a logistic regression.
- But that logistic regression has very complex features, some of which are the outputs of other machine learning models.
- And so you're gonna get features from the job text, the job description features in the user's profile.
- One particularly interesting feature they use is a transition probability estimate.
- So they have a model. This is another. This is a statistical model that tries to predict.
- So if you are currently working and Boise as a data scientist,
- what's the likelihood that you would transition to a job title of senior data scientist in Salt Lake City?
- And so it takes into account job transitions like data.
- Scientists might leave the senior data scientists, software engineers to staff, software engineers or principal software engineers.
- It takes into account current migration patterns in the industry and various things like that to get this.
- How likely are you to even go move someone at a staff?
- Software engineering position is unlikely to take a job that where the title is Junior Software Engineer.
- And the output of this transition probability model is one of the input features to their logistic regression that's computed.
- That's estimating. Would you like to see this job ad for a senior data scientist in Salt Lake City?
- Also, you also get things where you might have might come from some kind of a deep learning thing,
- a deep learning object detection mechanism, a deep learning image similarity mechanism.
- So Pinterest gets a lot of mileage out of doing nearest neighbor calculations where the the neighbor nearest
- is defined by a deep learning model for assessing whether two images that are being pinned or similar.
- So we can there are many different models that we can look at.
- Linear models with their extensions, a generalized linear model and the logistic regression that we've been seeing, generalized adaptive models.
- There's also thing the support vector machine, which is another linear model, but it's not a regression model.
- The naive, naive Bayes classifier, we're going to see those later, a neural net.
- Whether shallow or deep, a lot of models, pretty like a lot of neural nets.
- They do a similar thing in logistic regression. They're computing a score and then you pass it through a logistic function or some other sigmoid in
- order to convert the model score to probabilities for making your final classification decisions.
- So wrap up. There are many different models for classification and for regression.
- I'm just the my goal in this class is to teach you what regression and
- classification are and how to get started with applying them and evaluating them,
- not to teach you a bunch of models in depth.
- The machine learning class is going to go into a lot more about how these different models work and how to get them to work.
- Well, model outputs also, though, can be features used as input features for other models, often linear.
- Not always, though. And so you can get models that build on top of other models.

## 🎥 Inference and Ablation

How do we understand, *robustly*, the performance of our system?
What contributes to its performance?

CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
INFERENCE AND ABLATION
Learning Outcomes
Make inferences about model accuracy
Understand interplay of cross-validation and inference
Use ablation studies to make inferences about feature or sub-model importance
Photo by Siora Photography on Unsplash
Train/Test Split
The Data
Training Data
Testing Data
One Way
Train model
Experiment with different model designs
Experiment with different features
Select hyperparameters
Evaluate Effectiveness
Significant?
Testing Effectiveness
For test data, we have:
Individual classifications, right or wrong
Single metric value (accuracy, precision, etc.) for each classifier
Can’t significance test a single value!
Testing Objectives
Does my classifier perform better than benchmark value?
What is the precision of my estimated classifier accuracy?
Confidence interval
Does classifier A perform better than B?
P-value
Confidence interval for the difference
Test Samples: Confidence Intervals
Solution 1: treat each test item as a binary measurement
If metric denominator from test data: Wilcox confidence interval
statsmodels: proportion_confint (with method=‘wilson’)
Works for accuracy, FPR, FNR, recall, specificity
Any metric:
Bootstrap the test samples
Compute metric from bootstrap samples
P-Value for Accuracy
Testing Regression
For regression, each sample is a continuous measurement of the model’s prediction error.
Use paired t-test or appropriate bootstrap
Repeated Testing
With repeated cross-validation, we can compute a t-statistic
Run 5 times
Each time, do 2-fold cross-validation
See reading.
Simple cross-validation not great– too much non-independence
Repeated test sampling unreliable – too much non-independence
Cross-Validation and Train/Test Split
Cross-validation sometimes used for final eval
Allows data leakage – what did you do your model & feature selection on?
Good for:
Limited engineering – just see how well the model works
Model and feature design – when done on training data
Understanding Performance & Behavior
Suppose you are detecting spam with:
Text features
Metadata features
URL features
URL reputation model
Sender reputation model
What makes it work?
Ablation Studies
An ablation study examines impact of individual components
Turn each off in turn
Measure classification performance
Lets you see how much each component contributes
Do use results for production decisions, future work
Do not use results to revisit model design (in this trial)
Wrapping Up
Inference for classifier performance is not immediately straightforward.
Several techniques helpful.
Be careful about data leakage. Sometimes tradeoffs are needed.
Photo by Bianca Ackermann on Unsplash

- Oh, in this video, I want to talk with you about inference from auto effectiveness and introduced the idea of an inflation study.
- So our goals are for you to be able to make inferences about model accuracy and underpin
- understand a little bit better the interplay of cross validation and inference,
- remembering that we can't be perfect. The goal is to do a good and an incredible job.
- And then also to be able to use an ablation study to make inferences about the particular
- contributions and value of different features or subsets of your subcomponents of your model.
- So remember, we've got this train test split training.
- We have the training data. We're doing all of our iterative process.
- It's a big, loopy thing. And then we've got we evaluate our effectiveness.
- One thing we haven't talked about yet is, is the is the effectiveness significant?
- To go and wait for our test data, we have a few outputs.
- We have the individual classifications of predictions and four classifications we have, whether they're right or wrong for predictions,
- we have the error and then we have a metric value, accuracy, precision, etc. for each classifier.
- One of the challenges is, though, for the classifier and the test data, we just have accuracy is point nine nine or precision is point four.
- We can't significance test that value.
- But I want to talk in order to set up how we can significance test, I'm first going to or otherwise do inference, I should say, because significant,
- significant testing, as we discussed earlier, a lot of times we might actually care about like an effect size estimate with confidence intervals.
- A lot more than we care about significance test.
- But there's a few questions that we want to answer as the results of an evaluation. What does my classifier perform better than some benchmark value?
- Well, you might have a value we want to beat, say, a value we know was good enough.
- And we want to know if my classifier performs better than that value.
- We might want to get an estimate of our classifiers, accuracy or precision or recall our pick our metric that has a confidence interval on it.
- So we know how precise ice this estimated performance measure is.
- And then we may also want to answer the question, is classifier A perform better than B?
- Maybe B is our current system, or B is the existing known state of the art.
- And we want to know if A does better. We might want to p value.
- We might want a confidence interval for the improvement or the difference in performance between A and B.
- So. To get started one way, we can compute a confidence interval.
- We can treat each item as a binary measurement. So you are each test item.
- So you've got hundred thousand test items because you've got a very large dataset.
- And hundred thousand to 20 percent split. Or it's a 10 percent split.
- You've got a million data points, one 100000 test points for each of these.
- You have the true value. Yes or no. You have prediction. Yes or no.
- If the metric denominator comes from the test data accuracy, it definitely does, because the denominate because accuracy is correct overall.
- You can also do this for false positive ratio, false negative ratio.
- Recall specificity, anything where the denominator is completely determined by the test data, not by the classifier results.
- You can use a Willcocks confit or a Wilson confidence interval.
- Stats models does this with proportion confident and a Wilson confidence interval is a confidence interval for a proportion.
- Any metric you can bootstrap that you can take your to your test samples, you can do.
- You can take bootstrap samples of them and then you can compute your classifier metric over your bootstrap samples.
- Now, you have to be careful when you're doing your bootstrap samples to make sure that
- when you're sample you're when you're doing the bootstrap and you keep the labels, the ground truth labels and the classifier outputs together.
- And if you're doing multiple classifiers, you have to keep all the classifier outputs together as you're computing these bootstrap samples.
- You can bootstrap from your test data and get a confidence interval for any of your classifier performance metrics.
- You can also do a computer P-value for the accuracy metric. This specific technique only works for accuracy.
- It does not work for any of our other classifier metrics. But you can get a P value for the null hypothesis that the two classifiers have the same
- accuracy by using what's called a contingency table and a contingency table for this purpose.
- You have you go from reclassifications to whether or not it was right or wrong.
- So. Here we have the number of times both classifiers were right in here, the number of times they were both wrong.
- And here we have where classifier one was right and classifier two was wrong.
- How often did that happen? We can do the same the other way around.
- And then we compute. What we do what's called a McNee ma test, and it uses these values and NY n is the value and one is wrong.
- And two is right. And then why is. One excuse excuse me.
- And this is. So here we have an.
- And why? And here we have. And why an.
- And so we take the squared difference and the their wrongness is and divide it by the sum of their wrongness is and this gives us a statistic.
- And my test statistic and under H zero under the null hypothesis and follows what's called
- a chi squared distribution with one degree of freedom to probability distribution,
- you can get CGF from stat's models or from sci fi.
- And you can use that to compute a P value. What's the probability of having an MS statistic, at least this large?
- And it's it's you don't have to deal with absolute values on it because it's it's a non-negative statistic in a non-negative distribution.
- We can't just. There is something called a proportion test, but proportion test is for independent proportions and independent samples.
- But we don't have independent samples. We have one sample of our test data.
- And for each test point, we have two measurements, class of a classifier one and classifier two.
- So we can't use a proportion test.
- But the Mackney MA test basically that says do this paired proportion test kind of thing and allows us to get a P value for whether this classifier,
- whether the classifiers have the same accuracy or not.
- And this one, the P value, does not allow us to reject the null hypothesis that they have the same accuracy.
- The P value is about one. So.
- We can also test regression. So each sample is a continuous measurement of the model's prediction error.
- So we have Y minus Y hat Y.
- I from C one for CROSSFIRE a. And we have Y minus Y.
- Hat I. From Classifier B. And those are two different measurements,
- we can use a paired t test or we can use an appropriate bootstrapping mechanism in order to assess the accuracy of a regression model.
- Now, when we have when we do a cross validation, so one technique, the sun, sometimes you do cross validation, say 10 tenfold cross validation.
- That gives you 10 accuracy's for each classifier and you can compute paired t test.
- So each of your each of your folds and your cross validation is a sample.
- Is one data point in your sample. So you've got N equals 10.
- You can do a T test that actually doesn't work very well because your your samples are not independent.
- If you're doing capable cross validation.
- Also if you just repeatedly draw a 10 percent sample and draw a 10 percent sample and do that, say, 30 times,
- you also have the same problem of the same data points are going to show up and you're sent to monitor your samples.
- Also, your training data classifiers are being trained in the same data too much.
- And the ideal is to be able to draw, say, 30 completely independent training and testing sets from your big population.
- But yeah, but if you can't do that, you're trying to simulate with cross validation,
- you wind up with the non independence just causes the resulting come statistical test to not be reliable.
- One thing you can do is you can do repeated cross validation where five times you do a two fold cross validation.
- I'm going to refer you to one of the readings I put in the notes for a lot more details on this.
- Just wanted to bring it up so that, you know, it's there. Cross-validation is sometimes used for final evaluation.
- You'll find this in papers sometimes.
- One of the problems, though, is this allows data leakage because you're testing on data that was available and you're trying it.
- You're testing on all of the data data that was available in your training set.
- This is a this can be a significant problem if we've got a large enough data
- set that we can just use a single test split or maybe two or three test split.
- That's going to allow us to much better simulate to avoid leakage, much better simulate what's going to happen.
- We put the model in production. Cross validation is really useful for a couple of contexts.
- One where you're not doing much model design or feature engineering. You just want to take you have data.
- Want to take a model. Apply it, see how it works. Cross-validation is great for that.
- You're not. You don't have the iterative process of how am I really getting this model to work?
- You can cross validate if you've got hyper parameter search,
- do a hyper parameter search separately for each needful, like make it part of your training process.
- Logit like that logistic regression c.v kinds of things. Help with that.
- But if you've got a model and just want to see how well it works in the data, cross-validation can work pretty well.
- Also with when you are doing cross-validation on the training data to iteratively, do improve your model and feature design.
- That can work really, really well as well. The problem arises when you're doing a lot of engineering on your model.
- And you get access to the test data, which you effectively have on a cross validation setup.
- Because even if you've got it's your say you do 10 cross-validation, do you pick one of them that's gonna be your.
- That's gonna be what you're really doing, your development? Well, all of your other test data is in this.
- This initial development part.
- So it's part of how you're effectively you're using the test data as part of your tuning process for your hyper parameter selection,
- as part of your exploratory data analysis. And that that is a cause of leakage.
- Again, though, I guess. We're can never be perfect, but it's important to be aware of as a cause of leakage.
- I really recommend having the designated test set that you hold out.
- You don't touch. That's the basis of your evaluation. Even if it makes the statistical inference a little bit harder.
- Now, another thing, though, I want to talk about is suppose.
- So let's suppose you've got a complex model and we've got we're detecting spam where we're working for, say, a.
- Telecommunications company were detecting text message spam or were detecting an e-mail spam for any mail company.
- We have text features. We've got made a day to features when they're sending you are else.
- You've got features of the you are all itself. Maybe we even hit the server.
- Let's say we've got another couple of sophisticated models that do that score.
- You are else by their reputation and they're sent and also score senders by theirs,
- by a reputation score that large e-mail search spam, antispam efforts such as the one built in the GMAT.
- I'll do this. I'm not just making that up. It's a part of of antispam at scale is building reputations for you are else and centers.
- And we've got let's say art, let's say our spam detector works well. Precision of ninety nine point five or ninety nine point nine.
- Recall of 80 percent. But what makes it work?
- Which of these features is contributing, how much to its success?
- The answer is to do what's called an oblation study and an oblation study takes our model.
- We take our whole model. We see how accurate it is, but then we turn off individual features of it.
- So we might turn off the sender reputation. How what how exactly you turn off depends on the model design.
- It might be if it's an honor. You just take that part out of your neuron, that graph syllogistic model, you know, everything's well standardized.
- You can put it zeroes for the feature and not retrain or even just take that term out of the model,
- trying to on your training data and try to predict your testing data. And what this lets you see and you probably want to do that just in case.
- Just to make sure the parameters are being tuned without the peace. What does that seat you see, though, is how much each component contributes?
- You can say, OK, my model gets ninety nine percent precision on spams and it gets 98 percent precision if I turn off the sender reputation.
- Well, that lets you see, OK, the sender reputation is responsible for one percent of my precision.
- Now, it's important to be careful how you use this, because you can use this for production decisions and for future work.
- You do this oblation study, you discover, OK, the center reputations only contributing 98 or one percent,
- or maybe it's contributing point one percent. And it's really expensive in terms of compute time and engineer time to maintain maybe stop using it.
- You could also use it for your future research work. What you can't do, particularly within the scope of one study,
- is use the results of your oblation study to go back and revisit your model design that gets you your leakage again.
- As again, as I said in the academic setting,
- we're doing multiple studies in the same data that we do get some leakage and we carry it forward to the next study.
- We again, we can't be perfect, but.
- There's a difference between the oblation study and the feature engineering, the feature engineering, I'm trying a bunch of things that keep things.
- I'm not going to keep things up doing it with this tuning data. Things are going back. I'm not being keeping my careful firewalls.
- In the oblation study, I have my top line performance monitor. Here's my model, I ran it.
- It got 99 percent precision and then. I'm trying to understand.
- Well, what are the drivers of that? I'm not putting it iteratively back into my life.
- Going back and rerunning my my stuff in my training data with it, I'm just using it to get knowledge to carry forward.
- That doesn't cause leakage within the context of the specific study we're talking about.
- And it is of acceptable practice.
- And it's a very, very useful practice for understanding the contributing factors to the performance of a complex model.
- So wrap up inference for classified performance is not immediately straightforward.
- There are several helpful techniques that pointed you here at pointing to you two in the readings and be careful about data leakage.
- But again, sometimes tradeoffs are.

## 🎥 Dates

This video discusses how to use work with dates in Pandas.

CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
DATES
Learning Outcomes
Parse and transform dates
Adjust dates using date offsets
Photo by J. on Unsplash
Dates and Representations
Time moves forward at a constant rate (generally…)
How we record it changes
Daylight savings time – the same hour happens twice
Key insight: time is different from representation
Typically: store time in monotonic form, translate for presentation
Numeric Representations
Unix timestamps — time since (or before) midnight Jan 1, 1970
Seconds, milliseconds, nanoseconds
Reference point is UTC
Julian day numbers
Days since January 1, 4713 BC
Floating point stores time as fraction of a day
“1900 system” (Excel): days since Jan. 1, 1900
String Representations
ISO: 2020-11-03
Alphabetic sorts by date (for AD, until 10000)
Localized numeric
US: 11/03/2020
EU: 03/11/2020
Longer
November 3, 2020
datetime64
The Pandas datetime64 type stores dates (and times).
Construct from:
Number, units, and epoch
pd.to_datetime(230481083, unit='s') — seconds since Unix epoch
pd.to_datetime(3810401, unit='D', origin='julian') — days since Julian epoch
String and format
pd.to_datetime('2020-11-03') — convert from ISO
pd.to_datetime('Nov 3 2020', format='%b %d %Y')
timedelta
The Pandas timedelta type stores time offsets
Create from number + units or string
‘1 day 00:30:22’ – one day, 30 minutes, and 22 seconds
Mark advances in linear time
DateOffset
DateOffset type stores date offsets to adjust calendar days
Create from number + units
pd.DateOffset(months=240)
Correctly offset dates, even with underlying nonlinearities
Months don’t have the same length
DST, leap years, leap seconds
Not directly supported by Series
Can use in ‘apply’:month_series.apply(lambda m: pd.DateOffset(months=m))
Date Arithmetic
datetime + timedelta = datetime
datetime + DateOffset = datetime
DateOffset * num = DateOffset
Comparisons
DateTime supports comparison operators (==, <, etc.)
Need to create DateTimes on both sides
Wrapping Up
Dates and times are typically stored internally using offsets from an origin.
Pandas provides several date and time features, including datetime, timedelta, and DateOffset.
Photo by Bundo Kim on Unsplash

- So in this video, I want to talk with you about dates,
- learning outcomes are for you to be able to pass and transform dates and adjust dates using date offsets.
- So first, I want to talk just briefly about the difference between a date as we say it, like.
- OK. It's November the 3rd, 2020 and underlying time.
- So dates to all kinds of funny things. When we change to or from Daylight Savings Time, we skip an hour or we repeat an hour.
- But the underlying time stream doesn't repeat.
- It's just that our way of mapping that to the way we write it down repeats.
- So we can think of underlying time as moving forward at a constant rate.
- Generally, there's relativity and all of those things.
- But the time is moving forward and how we record it changes and is complex and subject to a lot of rules.
- The key thing is like with text being different, like the text content is different from its encoding.
- Time is different from its representation.
- Well, one of the implications this says, is that we typically store time in more of its monotonic form, like seconds since a particular date, UTC.
- And then we translate for presentation. And so you'll see your store the time it offset UTC and then you will you'll translate
- that to the local time zone with all of the daylight savings rules and everything,
- we are going to go actually display it.
- So internally, there are a few ways we can represent time numerically, and sometimes you'll need to do this yourself.
- So one one is Unix timestamps, which is time since or before that can be negative.
- Midnight, UTC, January 1st, 1970. Often this is stored in seconds.
- Pythons like not in pandas or not pie, but Python, the Pathfinder standard library tends to do time and second floating point seconds since midnight.
- The reference point, as I said, the reference point for this is UTC. You can also store at milliseconds or nanoseconds since that time.
- If you have a data, if you have a data file that has a file, a column that's labeled as a timestamp and it's an end.
- It's a number. There's a very good chance it's a Unix timestamp. That's very common way to store dates and times.
- We can also store Julian Day numbers, which are days since January 1st.
- Forty seven 13 B.C. And you can you you can store a time by using a floating point numbers,
- it might be twenty two million, three hundred and seventy five point eight days.
- There's also other origins. You can use a lot of different origins. Pandas actually lets you specify arbitrary origins.
- But the nineteen hundred system that's used by Excel and other spreadsheets stores days since January 1st.
- Nineteen hundred. So we can also store data strings, so the ISO format is year, month, day.
- This has the nice advantage that at least until the year ten thousand.
- It sorts by date. If you sorted Alphabet Alpha numerically, it's it's going to sort the resulting dates by date.
- So if you're going to name files after dates, this with dates at the beginning of the file name.
- This is super useful.
- There's also localized numeric forms such as eleven three, twenty twenty, which is how we write the dates in the United States, Europe and the UK.
- Right. Generally right at day, month, year three. Eleven, twenty, twenty.
- So if you see a date that's two digits, two digits year, that's not enough information to know when we're talking about.
- Are we talking about November 3rd or are we talking about March 11th?
- You need to know the country locale in which the date came from to know how to correctly interpret it.
- Sometimes you can infer it by looking for, say, November twenty eighth,
- because 28 isn't a valid month number that I'll let you figure out which one you're dealing with.
- But this localized form, just if you get a date, it's often ambiguous.
- You could also have longer string forms like you did right out November 3rd.
- Twenty Twenty Panels provides a function called Date Time 64 that allows you to store dates and times.
- And even if you just have a date, you usually store it. It's a date time with midnight.
- At least that's how you work with it. And pan those pandas doesn't have a time free date type.
- You can create a date time from number and units as an origin.
- So you can say we want. We want two hundred and thirty million.
- Seconds since the Unix epoch, we want three point eight million second days since the Julian Origin D Funk.
- This function also supports the number can be a series or an array in addition to a single numbers.
- You can create a series or an array of pand date time objects.
- You can also convert it from a string. This also can be up a series or array of strings.
- So in an assignment, if you've got it, if you've got a column that string dates,
- you can convert that to a column of date times by using these functions.
- And by default, it's going to pass the time for my S.O.
- But you can also tell it to pass other time, but providing a format string that describes how the time is laid out.
- And there's a link in the pandas documentation to the way these format strings work can also be provided.
- That link in the notes that go with this video. Then so we've got daytime's pandas has also had an object called a Time Delta,
- which stores of a difference between two times, if you subtract one date time from another,
- what you're going to get as a time delta,
- you can create one from a number of plus units or a string that describes that like I can create the the the time delta one day,
- thirty minutes and twenty two seconds. The time Delta marks advances in linear times.
- You can't create a time delta for example, of one month. The date offset is what you use to get one month.
- So it's you can create it from a number and units and it correctly offsets the dates,
- even if you it knows it can know whether it needs to extend by 30 days or thirty one or twenty eight.
- It handles Daylight Savings Time, a Hannahs leap year, a handle's leap seconds and deals with being able to offset dates properly.
- Date Offset does not natively support series, so date times and time deltas, both native pandas natively supports them in series.
- You can't, however, create a series of an object series that contains data offsets.
- So if Month series is a series of numeric series that contains numbers of months, then we can use apply.
- And it's it's it's a little slow because it's doing a python loop effectively.
- But we can use apply to convert these data, offset these numbers of month into data, offset objects.
- We get a series of those which we can then say add to add to a series of date times in order to produce offset date times.
- For example, to add if we've got a column that has the term the number of months on loan is for and when the loan was issued,
- we can add we can convert the month to a date offset. Add it to the issue date and we can find when the loan is due.
- When you're doing arithmetic with dates, if you add a date time and the time Delta,
- you're going to get a date time, date time plus a tight offset is also a date time. You could subtract as well as add.
- As I said, if you subtract two date times, you're gonna get a time Delta.
- You can also multiply a date offset by a number and it's going to give you another date offset that's multiplied.
- So you can if you've got two months, you can multiply it by five and you'll get 10 months.
- You can also compare you can compare date times using comparison operators.
- You do need to create date times on both sides. If you've got something that say strings, you need to convert it to a date time object.
- So then you can do the comparison. So in conclusion, dates and times are typically stored internally using offsets from an origin.
- Usually store. Usually we store them in UTC and then we translate them to local time.
- When we go on display. PANDAS provides a number of functions and types for working with dates and times.
- In addition, NUM Pi provides some of its own. I generally work with pandas, but not PI does provide time.
- Delta date time objects at work just a little bit differently. Python also does it its standard library for our purposes.
- I recommend generally sticking with the pandas ones.

## 🚩 Quiz 11

Quiz 11 is in Canvas.