Week 9 — Models & Prediction (Oct. 18–22)
This week talks more about regression, simulation, and introduces the idea of minimizing a loss function.
🧐 Content Overview
This week has 0h59m of video and 4180 words of assigned readings. This week’s videos are available in a Panopto folder and as a podcast.
🎥 Introduction
This video introduces the week.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
MODELING AND PREDICTION
Learning Outcomes (Week)
Simulate to understand the behavior of a system
Understand overfitting and apply appropriate design to mitigate it
Select and engineer features for a model
Photo by Jamie Street on Unsplash
Taking Our Bearings
We’ve seen:
Basic data operations
Visualization
Probability and inference
Correlations and linear models
Prediction and inference
Where we’re going:
Classification
Mostly prediction going forward
Evaluating model accuracy
More data types
Text
Time
Workflows
This Week
More on variance
Simulation methods
Prediction accuracy evaluation
Features
Wrapping Up
We’re going to learn more about building and assessing models.
This will set us up to move from regression to classification.
Photo by vorster vanzyl on Unsplash
🎥 Simulation
This video talks more about simulation as a method for studying statistical techniques, which you are doing in the assignment.
I also describe more of NumPy’s random number generation facilities.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
SIMULATION
Learning Outcomes
Simulate data to understand statistical behavior
Understand the difference between simulation and the bootstrap
Photo by Markus Spiske on Unsplash
Simulation
We can make up data (with a random number generator)
We know the population parameters (we set them)
We can take unlimited samples (it’s just an RNG)
Result: we can study the behavior of sampling and statistical methods
Setting Up the RNG
Modern: np.random.Generator
Create with np.random.default_rng()
Can provide a seed (starting point) - integer
Recommended!
Same seed + software versions + platform ⇒ same results
Legacy: np.random functions
Python standard library also has random module
Drawing from Distributions
integers(k, size=n) — n integers in range [0,k)
integers(j, k, size=n) — n integers in range [j,k)
random(n) — n floats uniformly from [0,1)
normals(μ, σ, n) — n normally-distributed floats
exponential(λ, n) — n exponentially-distributed floats
Random and uniform are not the same thing.
More Randomness
Operating on arrays:
choice(a, n) — n draws from array a w/ replacement
permutation(a) — random reordering of a
More distributions: scipy.stats
Go NUTS: PyMC3, STAN
Example: Mean
Simulating Means (Law of Large Numbers)
Simulating Means — Check for Normality
Exponential
Simulation
We have a proof of the Central Limit Theorem
More complex methods are much harder to characterize.
Simulation lets us explore behavior
Also: useful for learning, gaining intuition
Not the Bootstrap
Bootstrap
Unknown population parameters
Resample actual observations
Estimate sampling distribution
Purpose: estimate population parameters
Simulation
Known population parameters
Sampling from theoretical distributions, making up data
Characterize sampling distributions & other properties
Purpose: understand estimate / parameter relationships
Wrapping Up
Simulation lets us study the behavior of statistical methods by drawing many samples with known population parameters.
This is commonly used in statistics research. You do it in A4.
Photo by Aditya Chinchure on Unsplash
Tip
You should set random seeds for all work that will need randomness, including train/test splits for evaluating predictors.
🎥 Variance, R², and the Sum of Squares
This video provides more detail on explained variance and what the \(R^2\) means.
🎥 Overfitting
This video introduces the idea of overfitting: learning too much from the training data so we can’t predict the testing data.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
PREDICTION AND OVERFITTING
Learning Outcomes
Understand overfitting
Design an evaluation that reduces the likelihood of overfitting
Understand that we can overfit process not just an individual model
Photo by Jennifer Burk on Unsplash
Prediction Goals
Predict future unseen data
We can just look at the data we have to learn about it
We want to forecast or predict the future
One way of thinking about generalizability
Overfitting
Overfitting
Train-Test
Data
Train
Test
Model
Split
Predict
Test data is unseen
Can model generalize to it?
Fitting to Data
The model structure:
Choice of model (linear, tree, etc.)
Selection & transformation of features
Hyper-parameter values
The model parameters:
Coefficients learned from data
Both can overfit. Train-test addresses parameters.
Tuning Data
Data
Train
Test
Train
Test
Tune
Wrapping Up
Overfitting arises when our model learns too much, so it can’t generalize to new data.
Splitting our data into training, tuning, and testing sets helps with this.
Photo by Darren Nunis on Unsplash
🎥 Replication, Bias, and Variance
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
REPLICATION, BIAS, AND VARIANCE
Learning Outcomes
Reason about repeated training on different samples
Introduce the bias-variance tradeoff
Photo by Nick Reynolds on Unsplash
Replication
Remember bootstrapping and simulation?
Repeated samples
Compute expectation, variance, etc. of parameters
What about prediction?
Test Error
Population
Test Error (Repeated Training)
Population
Bias-Variance Tradeoff
Variance
Bias2
Noise
Wrapping Up
Expected test error is the error we expect from predicting new, unobserved data points.
When training from samples, expected error can be decomposed into model bias, model variance, and noise.
Photo by Mitchell Luo on Unsplash
🎥 Optimizing Loss
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
LOSS AND OPTIMIZATION
Learning Outcomes
Understand a loss function
Optimize model for a loss function
See attached notebook for code!
Photo by Nick Reynolds on Unsplash
Least Squares Regression
Solve Least Squares
Minimization
Many problems are minimization (or maximization) problems!
scipy.optimize provides a minimize function
Model and Loss
def predict_mass_uv(params, data=penguin_std): b0, b1 = params return b0 + data['flipper'] * b1
def squared_loss_uv(params, data=penguin_std): preds = predict_mass_uv(params, data) err = data['mass'] - preds return np.sum(np.square(err))
Minimization
init_guess = rng.standard_normal(2)
result = spopt.minimize(squared_loss_uv, init_guess)
# get model parametersresult.xarray([1.12829633e-09, 8.71201757e-01])
Statsmodel OLS model
General Optimization
Understand data (features and outcome variables)
Define loss (or gain/utility) function
Define predictive model
Search for parameters that minimize loss function
Train and Test
We optimize the loss function on the train data
Evaluate on the test data
Sometimes loss function is eval; sometimes not
Mean squared error: often used for both
Augmented Loss
We can add more things to the loss function
Penalize model complexity
Penalize “strong” beliefs
Requires predictive utility to overcome them
Wrapping Up
Least squares generalizes into minimizing loss functions.
This is the heart of machine learning, particularly supervised learning.
Photo by Mitchell Luo on Unsplash
🚩 Week 9 Quiz
Take the Week 9 quiz in Blackboard (will be up by end of Saturday).
Since this is the second of two very closely intertwined weeks, there are questions about 📅 Week 8 — Regression (Oct. 11–15) in the quiz ads well.
✅ Practice
There are several ways you can practice the material so far:
📩 Assignment 4
Assignment 4 is due October 24, 2021.