Week 9 — Models & Prediction (Oct. 18–22)¶

This week talks more about regression, simulation, and introduces the idea of minimizing a loss function.

🧐 Content Overview¶

Element	Length
🎥 More Regression	3m7s
🎥 Simulation	14m48s
🎥 Variance and Sums of Squares	5m59s
🎥 Overfitting	10m31s
📃 Example of overfitting and underfitting in machine learning	1180 words
🎥 Bias-Variance	9m16s
📃 Understanding the Bias-Variance Tradeoff	3000 words
🎥 Optimizing Loss	15m23s

This week has 0h59m of video and 4180 words of assigned readings. This week’s videos are available in a Panopto folder and as a podcast.

🎥 Introduction¶

This video introduces the week.

Video (3m7s)

Slides

🎥 Simulation¶

This video talks more about simulation as a method for studying statistical techniques, which you are doing in the assignment. I also describe more of NumPy’s random number generation facilities.

Video (14m48s)

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand SIMULATION Learning Outcomes Simulate data to understand statistical behavior Understand the difference between simulation and the bootstrap Photo by Markus Spiske on Unsplash Simulation We can make up data (with a random number generator) We know the population parameters (we set them) We can take unlimited samples (it’s just an RNG) Result: we can study the behavior of sampling and statistical methods Setting Up the RNG Modern: np.random.Generator Create with np.random.default_rng() Can provide a seed (starting point) - integer Recommended! Same seed + software versions + platform ⇒ same results Legacy: np.random functions Python standard library also has random module Drawing from Distributions integers(k, size=n) — n integers in range [0,k) integers(j, k, size=n) — n integers in range [j,k) random(n) — n floats uniformly from [0,1) normals(μ, σ, n) — n normally-distributed floats exponential(λ, n) — n exponentially-distributed floats Random and uniform are not the same thing. More Randomness Operating on arrays: choice(a, n) — n draws from array a w/ replacement permutation(a) — random reordering of a More distributions: scipy.stats Go NUTS: PyMC3, STAN Example: Mean Simulating Means (Law of Large Numbers) Simulating Means — Check for Normality Exponential Simulation We have a proof of the Central Limit Theorem More complex methods are much harder to characterize. Simulation lets us explore behavior Also: useful for learning, gaining intuition Not the Bootstrap Bootstrap Unknown population parameters Resample actual observations Estimate sampling distribution Purpose: estimate population parameters Simulation Known population parameters Sampling from theoretical distributions, making up data Characterize sampling distributions & other properties Purpose: understand estimate / parameter relationships Wrapping Up Simulation lets us study the behavior of statistical methods by drawing many samples with known population parameters. This is commonly used in statistics research. You do it in A4. Photo by Aditya Chinchure on Unsplash

Tip

You should set random seeds for all work that will need randomness, including train/test splits for evaluating predictors.

🎥 Variance, R², and the Sum of Squares¶

This video provides more detail on explained variance and what the \(R^2\) means.

Video (5m59s)

Slides

Resources¶

🎥 Overfitting¶

This video introduces the idea of overfitting: learning too much from the training data so we can’t predict the testing data.

Video (10m31s)

Slides

📓 Overfitting Simulation¶

Overfitting Simulation notebook

📃 Overfitting Example¶

Read Example of overfitting and underfitting in machine learning.

🎥 Replication, Bias, and Variance¶

Video (9m16s)

Slides

📃 Bias-Variance Tradeoff¶

Read Understanding the Bias-Variance Tradeoff.

Resources¶

Further reading: Lecture 12: Bias-Variance Tradeoff.

🎥 Optimizing Loss¶

Video (15m23s)

Slides

Links¶

The scipy.optimize.minimize() function.
The minimization regression notebook notebook.
Our introduction to the idea of an objective function.

🚩 Week 9 Quiz¶

Take the Week 9 quiz in Blackboard (will be up by end of Saturday).

Since this is the second of two very closely intertwined weeks, there are questions about 📅 Week 8 — Regression (Oct. 11–15) in the quiz ads well.

✅ Practice¶

There are several ways you can practice the material so far:

Practice more regressions with World Bank data
Measure World Bank data predictive accuracy with train-test evaluation and mean squared error

📩 Assignment 4¶

Assignment 4 is due October 24, 2021.

CS 533 Fall 2021

Week 9 — Models & Prediction (Oct. 18–22)¶

🧐 Content Overview¶

🎥 Introduction¶

🎥 Simulation¶

🎥 Variance, R², and the Sum of Squares¶

Resources¶

🎥 Overfitting¶

📓 Overfitting Simulation¶

📃 Overfitting Example¶

🎥 Replication, Bias, and Variance¶

📃 Bias-Variance Tradeoff¶

Resources¶

🎥 Optimizing Loss¶

Links¶

🚩 Week 9 Quiz¶

✅ Practice¶

📩 Assignment 4¶