# Week 6 — Two Variables (Sep. 27–Oct. 1)

Attention

The first midterm exam is on Tuesday.

This week’s learning outcomes are:

Display two potentially-related numeric variables for exploratory analysis.

Compute correlation coefficients between variables

Run a linear regression

Since we have the exam this week, the lecture load is significantly reduced.

## 🧐 Content Overview

This week has **0h30m** of video and **0 words** of assigned readings. This week’s videos are available in a Panopto folder and as a podcast.

## 🚩 Midterm A

The first midterm is on Tuesday. It is written to take about an hour, and covers material
up through and including Week 5.

### Exam Rules

You may have **1 note sheet**, letter- or A4-sized, single-sided. (For the final, you will be allowed a two-sided note sheet.)

You should not need a calculator, but may bring one if you wish.

You may answer in either pen or pencil.

### Study Tips

Review the previous quizzes and assignments.

Review lecture slides to see where you are unclear on concepts and need to review.

Skim assigned readings, particularly the section headings to remind yourself what was in them.

Review the course glossary, keeping in mind that it does contain terms we haven’t gotten to yet.

## 🎥 Introduction

This video introduces the week’s topic.

CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
TWO VARIABLES
Learning Outcomes (Week)
Explore relationships between two variables
Understand correlation and covariance
Photo by Sonny Ravesteijn on Unsplash
We’ve Seen
Computing within a group
Paired t-tests
Two-sample t-tests
Both relate a numeric outcome to a categorical predictor.
Paired Observations
Observe more than one variable for the same object
Top Critics rating
All Critics rating
# of MovieLens ratings
Relating Observations
Observations happen at different levels
A student graduated, or did not
The students at a school graduated at some rate
The students of a particular race at a school graduated at some rate
Predictions
Often, we group two (or more) variables into two:
Outcome variable describes a result
Dependent
Endogenous
Predictor variable(s) are used to explain or predict outcomes
Explanatory
Independent
Exogenous
The Week
Two Variables
Midterm Exam (Wed. PM through Saturday, 72 hours)
Covers through Week 5
Notebook + Conceptual Questions
Concept Qs more specific than quizzes – apply!
Wrapping Up
A lot of interesting problems will have two (or more) variables per observation.
We’re going to look at how to measure these variables’ relationships.
Often an outcome and 1 or more predictors.

## 🎥 Displaying Variables

This video discusses how to display related numeric variables.

CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
DISPLAYING VARIABLES
Learning Outcomes
Display relationships between two variables for exploratory analysis.
Extend this to more than two.
Photo by Greg Jeanneau on Unsplash
Scatterplots
Remember them?
Scatter plots show relationships between two variables:
Each point is an observation
One variable on X axis
Other on Y axis
sns.scatterplot
plt.scatter
Including Distributions
Can include marginal distributions with scatter
Histograms on margins (jointplot)
“Rug plots” along axes
Useful for exploratory analysis.
Trend Lines - regplot
Pairwise Correlations - pairplot
Wrapping Up
We want to explore the relationships between variables.
Scatter plots and their variants let us do that, with possible augmentations.

## 🎥 Correlation

This video discusses how to compute the *correlation coefficient* between two variables.

CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
CORRELATION
Learning Outcomes
Compute correlation coefficient
Understand the relationship between covariance and correlation
Photo by Scott Webb on Unsplash
Correlation
We can see the relationship.
But how correlated are the two variables?
Variance
Covariance
Covariance
Correlation
Correlations
Figure from Wikipedia. CC BY-SA 3.0.
Correlation and Independence
Estimating and Testing
Wrapping Up
Correlation measures the degree of linear relationship between two variables.
It is a normalized version of covariance.
Pandas: cov and cor.

Warning

In this video, I list the Pandas correlation function as `cor`

. The correct name is `corr`

.

## 🎥 Regression

This video discusses how to fit a line between two variables.

CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
REGRESSION
Learning Outcomes
Compute a linear regression with one predictor
Photo by Andre Benz on Unsplash
Trend Line
Where did this come from?
Variables and Model
Regression Model
coef – strength of relationship
R2 – frac. of variance explained
Ignore p-values and standard errors for now
Wrapping Up
Regression predicts one value with another.
Lets us go beyond same-scale (paired t-test) to determine strength of relationship.
Photo by Lucy-Claire on Unsplash

## 📓 Correlation Notebook

The correlation notebook shows how to compute the metrics in this week’s videos, and has the code I used to produce the charts in the slides.

## 🎥 Features

This video introduces the idea of feature engineering

CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
FEATURES
Learning Outcomes
Compute new features for observations
Think about feature relatedness
Preview: split data
Photo by Logan Clark on Unsplash
Features
Penguins:
Bill length
Bill depth
Flipper length
Body mass
Can compute:
Bill aspect ratio
Augments data with a new feature
Featured Selection
Things we’ll want to do:
Extract features
Select features
Rescale, normalize, and transform features
Combine features into new ones
These can improve performance and interpretability.
Desired Feature Properties
Well-distributed
Normal often easiest to work with
Interpretable, if we’re doing inference
Uncorrelated
Two correlated features may not contribute more value!
Cause problems with some models!
Beware
Don’t look at correlation between outcome and predictors in exploration!
Biases your feature selection
Hold out testing / validation data, then explore your “training data”.
Wrapping Up
We want to take care in curating and transforming features.
We’re going to need to split data into pieces soon.
Photo by James Fitzgerald on Unsplash

## 🚩 Week 6 Quiz

The Week 6 quiz is due before class on Thursday as usual.