Skip to content

Week 10 — Classification

Activities:

The videos are also available as a Panopto playlist.

What is Classification?

In this video, I introduce the week and what classification is.

Log-Odds and Logistics

In this video, I introduce log-odds, along with the logistic function and its inverse, logit.

Logistic Regression

We're now ready for our first classification model: logistic regression.

The Confusion Matrix

The confusion matrix describes the outcomes of a classification model and is the basis for computing effectiveness metrics.

Resources

  • The Wikipedia article has a very good diagram of the confusion matrix and its derived metrics.

Logistic Regression Demo

The demo notebook for the first-half videos.

Week 10 Quiz

The Week 10 quiz will be posted to Blackboard.

Floating Point

This is provided for reference.

StatsModels Documentation

The following StatsModels page documents its logistic regression:

This is not an assigned reading - it is here for your reference.

Log Likelihood

This video describes the log likelihood that is the objective function used by logistic regression.

Scikit-Learn

This video introduces SciKit-Learn, and using it for a logistic regression.

SciKit-Learn Logistic Regression

The SciKit Logistic notebook demonstrates training and using a logistic regression classifier with SciKit-Learn.

Receiver Operating Characteristic

This video introduces the receiver operating characteristic (ROC) curve, and its use in evaluating classifiers and selecting tradeoffs.

Practice

Load the Penguin data, and use a logistic regression to try to classify a penguin as Gentoo or Chinstrap using various measurements. Delete the Adelie penguins first, so you have a binary classification problem.

Biases and Assumptions

This video revisits sources of bias and discusses the assumptions underlying prediction.

Prediction-Based Decisions

Read Sections 1 and 2 of the following paper:

Shira Mitchell, Eric Potash, Solon Barocas, Alexander D'Amour, Kristian Lum. 2018. Prediction-Based Decisions and Fairness: A Catalogue of Choices, Assumptions, and Definitions. arXiv:1811.07867 [stat.AP].

We'll come back to ideas here, but sections 1 and 2 describe the assumptions underlying most classification problems.

If you would like to learn more, I recommend:

Abolish the #TechToPrison Pipeline

Read Abolish the #TechToPrison Pipeline (the Medium reading time estimate includes the thorough — and valuable — footnotes and list of 2435 signatories). This article probes in more detail the assumptions underlying classes of criminal justice data science applications.

Assignment 5

Assignment 5 is due November 11, 2020.