Week 6 — Two Variables (9/26–30)
Attention
The first midterm exam is released on Wednesday.
This week’s learning outcomes are:
Display two potentially-related numeric variables for exploratory analysis.
Compute correlation coefficients between variables
Run a linear regression
Since we have the exam this week, the lecture load is significantly reduced.
🧐 Content Overview
This week has 0h30m of video and 0 words of assigned readings. This week’s videos are available in a Panopto folder.
🚩 Midterm A
The first midterm is released on Wednesday on Canvas, and is due Saturday at midnight. It is written to take about an
hour, and covers material up through and including Week 5. It does not include
Week 6. Once you begin the exam, you will have 4 hours to complete it; if you have a technical problem with that (e.g. losing internet connectivity), let me know to reset the clock.
Exam Rules
Notes, books, class materials are allowed
Asking other humans for help on the exam is not allowed
Study Tips
Review the previous quizzes and assignments.
Review lecture slides to see where you are unclear on concepts and need to review.
Skim assigned readings, particularly the section headings to remind yourself what was in them.
Review the course glossary, keeping in mind that it does contain terms we haven’t gotten to yet.
🎥 Introduction
This video introduces the week’s topic.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
TWO VARIABLES
Learning Outcomes (Week)
Explore relationships between two variables
Understand correlation and covariance
Photo by Sonny Ravesteijn on Unsplash
We’ve Seen
Computing within a group
Paired t-tests
Two-sample t-tests
Both relate a numeric outcome to a categorical predictor.
Paired Observations
Observe more than one variable for the same object
Top Critics rating
All Critics rating
# of MovieLens ratings
Relating Observations
Observations happen at different levels
A student graduated, or did not
The students at a school graduated at some rate
The students of a particular race at a school graduated at some rate
Predictions
Often, we group two (or more) variables into two:
Outcome variable describes a result
Dependent
Endogenous
Predictor variable(s) are used to explain or predict outcomes
Explanatory
Independent
Exogenous
The Week
Two Variables
Midterm Exam (Wed. PM through Saturday, 72 hours)
Covers through Week 5
Notebook + Conceptual Questions
Concept Qs more specific than quizzes – apply!
Wrapping Up
A lot of interesting problems will have two (or more) variables per observation.
We’re going to look at how to measure these variables’ relationships.
Often an outcome and 1 or more predictors.
🎥 Displaying Variables
This video discusses how to display related numeric variables.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
DISPLAYING VARIABLES
Learning Outcomes
Display relationships between two variables for exploratory analysis.
Extend this to more than two.
Photo by Greg Jeanneau on Unsplash
Scatterplots
Remember them?
Scatter plots show relationships between two variables:
Each point is an observation
One variable on X axis
Other on Y axis
sns.scatterplot
plt.scatter
Including Distributions
Can include marginal distributions with scatter
Histograms on margins (jointplot)
“Rug plots” along axes
Useful for exploratory analysis.
Trend Lines - regplot
Pairwise Correlations - pairplot
Wrapping Up
We want to explore the relationships between variables.
Scatter plots and their variants let us do that, with possible augmentations.
🎥 Correlation
This video discusses how to compute the correlation coefficient between two variables.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
CORRELATION
Learning Outcomes
Compute correlation coefficient
Understand the relationship between covariance and correlation
Photo by Scott Webb on Unsplash
Correlation
We can see the relationship.
But how correlated are the two variables?
Variance
Covariance
Covariance
Correlation
Correlations
Figure from Wikipedia. CC BY-SA 3.0.
Correlation and Independence
Estimating and Testing
Wrapping Up
Correlation measures the degree of linear relationship between two variables.
It is a normalized version of covariance.
Pandas: cov and cor.
Warning
In this video, I list the Pandas correlation function as cor
. The correct name is corr
.
🎥 Regression
This video discusses how to fit a line between two variables.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
REGRESSION
Learning Outcomes
Compute a linear regression with one predictor
Photo by Andre Benz on Unsplash
Trend Line
Where did this come from?
Variables and Model
Regression Model
coef – strength of relationship
R2 – frac. of variance explained
Ignore p-values and standard errors for now
Wrapping Up
Regression predicts one value with another.
Lets us go beyond same-scale (paired t-test) to determine strength of relationship.
Photo by Lucy-Claire on Unsplash
📓 Correlation Notebook
The correlation notebook shows how to compute the metrics in this week’s videos, and has the code I used to produce the charts in the slides.
🎥 Features
This video introduces the idea of feature engineering
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
FEATURES
Learning Outcomes
Compute new features for observations
Think about feature relatedness
Preview: split data
Photo by Logan Clark on Unsplash
Features
Penguins:
Bill length
Bill depth
Flipper length
Body mass
Can compute:
Bill aspect ratio
Augments data with a new feature
Featured Selection
Things we’ll want to do:
Extract features
Select features
Rescale, normalize, and transform features
Combine features into new ones
These can improve performance and interpretability.
Desired Feature Properties
Well-distributed
Normal often easiest to work with
Interpretable, if we’re doing inference
Uncorrelated
Two correlated features may not contribute more value!
Cause problems with some models!
Beware
Don’t look at correlation between outcome and predictors in exploration!
Biases your feature selection
Hold out testing / validation data, then explore your “training data”.
Wrapping Up
We want to take care in curating and transforming features.
We’re going to need to split data into pieces soon.
Photo by James Fitzgerald on Unsplash
🚩 Week 6 Quiz
Due to the exam, there is no quiz this week. I will make sure this does not negatively impact anyone’s grade.