First, we load our libraries. This makes use of the R plotROC package; it is available in Conda with
conda install -c mdekstrand r-plotroc
library(tidyverse)
library(modelr)
library(plotROC)
library(psych)
options(repr.plot.height=4.5, repr.matrix.max.rows=10)
And we want to load the data. Fun trick, readr can read URLs!
students = read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
students
What does the logit function look like?
ggplot(data.frame(x=seq(-10, 10, 0.01))) +
aes(x) +
stat_function(fun=logistic)
New function: glm; works a lot like lm!
family says what kind of general regression; binomial is binary classifier.
zscore = function(x) {
(x - mean(x)) / sd(x)
}
full_model = glm(admit ~ gpa + gre + rank, data=students,
family=binomial())
summary(full_model)
std_model = glm(admit ~ zscore(gpa) + zscore(gre) + rank, data=students,
family=binomial())
summary(std_model)
Let's do some checks for multicolinearity. The correlation matrix of the variables:
cor(select(students, -admit))
Looks like GRE and GPA might be linked?
ggplot(students) +
aes(x=gpa, y=gre) +
geom_point() + geom_smooth()
First, make train-test data
id columnHow many rows are we sampling from?
nrow(students)
students_with_ids = students %>%
mutate(id=1:n())
Let's go!
test_students = students_with_ids %>%
sample_frac(0.1)
nrow(test_students)
The train data is everything that isn't test data.
train_students = students_with_ids %>%
anti_join(select(test_students, id))
nrow(train_students)
Now we will train the model on the training data:
train_model_full = glm(admit ~ gpa + gre + rank, train_students, family=binomial())
summary(train_model_full)
Let's generate some predictions with it
test_preds = test_students %>%
add_predictions(train_model_full)
test_preds
What threshold to use?
Balance:
Use ROC (Receiver Operating Characteristic) curve to view change in rates as we shift threshold. geom_roc uses 2 aesthetics: d is the decision, m is the prediction estimate (numeric).
full_roc = ggplot(test_preds) +
aes(d=admit, m=pred) +
geom_roc()
full_roc
calc_auc(full_roc)
What if we just use GPA?
train_gpa_model = glm(admit ~ gpa, train_students, family=binomial())
summary(train_gpa_model)
test_gpa_preds = test_students %>%
add_predictions(train_gpa_model)
test_gpa_preds
Stack the data!
test_mm_preds =
bind_rows(Full=test_preds,
GPA=test_gpa_preds,
.id="Model")
test_mm_preds
mm_plot = ggplot(test_mm_preds) +
aes(d=admit, m=pred, color=Model) +
geom_roc()
mm_plot
calc_auc(mm_plot)