Logistic Regression Demo¶

Init¶

First, we load our libraries. This makes use of the R plotROC package; it is available in Conda with

conda install -c mdekstrand r-plotroc

library(tidyverse)
library(modelr)
library(plotROC)
library(psych)

Attaching package: 'psych'

The following object is masked from 'package:modelr':

    heights

The following objects are masked from 'package:ggplot2':

    %+%, alpha

options(repr.plot.height=4.5, repr.matrix.max.rows=10)

And we want to load the data. Fun trick, readr can read URLs!

students = read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
students

Parsed with column specification:
cols(
  admit = col_integer(),
  gre = col_integer(),
  gpa = col_double(),
  rank = col_integer()
)

Transformations¶

What does the logit function look like?

ggplot(data.frame(x=seq(-10, 10, 0.01))) +
    aes(x) +
    stat_function(fun=logistic)

Logistic Regression¶

New function: glm; works a lot like lm!

family says what kind of general regression; binomial is binary classifier.

zscore = function(x) {
    (x - mean(x)) / sd(x)
}

full_model = glm(admit ~ gpa + gre + rank, data=students,
                 family=binomial())
summary(full_model)

Call:
glm(formula = admit ~ gpa + gre + rank, family = binomial(), 
    data = students)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5802  -0.8848  -0.6382   1.1575   2.1732  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.449548   1.132846  -3.045  0.00233 ** 
gpa          0.777014   0.327484   2.373  0.01766 *  
gre          0.002294   0.001092   2.101  0.03564 *  
rank        -0.560031   0.127137  -4.405 1.06e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 459.44  on 396  degrees of freedom
AIC: 467.44

Number of Fisher Scoring iterations: 4

std_model = glm(admit ~ zscore(gpa) + zscore(gre) + rank, data=students,
                 family=binomial())
summary(std_model)

Call:
glm(formula = admit ~ zscore(gpa) + zscore(gre) + rank, family = binomial(), 
    data = students)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5802  -0.8848  -0.6382   1.1575   2.1732  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.5326     0.3154   1.689   0.0913 .  
zscore(gpa)   0.2957     0.1246   2.373   0.0177 *  
zscore(gre)   0.2650     0.1261   2.101   0.0356 *  
rank         -0.5600     0.1271  -4.405 1.06e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 459.44  on 396  degrees of freedom
AIC: 467.44

Number of Fisher Scoring iterations: 4

Let's do some checks for multicolinearity. The correlation matrix of the variables:

cor(select(students, -admit))

Looks like GRE and GPA might be linked?

ggplot(students) +
    aes(x=gpa, y=gre) +
    geom_point() + geom_smooth()

`geom_smooth()` using method = 'loess'

Effectiveness of Classifier¶

First, make train-test data

Add a id column
Sample rows
anti-join to make train data

How many rows are we sampling from?

nrow(students)

students_with_ids = students %>%
    mutate(id=1:n())

Let's go!

test_students = students_with_ids %>%
    sample_frac(0.1)
nrow(test_students)

The train data is everything that isn't test data.

train_students = students_with_ids %>%
    anti_join(select(test_students, id))
nrow(train_students)

Joining, by = "id"

Now we will train the model on the training data:

train_model_full = glm(admit ~ gpa + gre + rank, train_students, family=binomial())
summary(train_model_full)

Call:
glm(formula = admit ~ gpa + gre + rank, family = binomial(), 
    data = train_students)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5962  -0.8741  -0.6208   1.1451   2.1991  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.556068   1.205096  -2.951  0.00317 ** 
gpa          0.762570   0.351440   2.170  0.03002 *  
gre          0.002546   0.001170   2.177  0.02946 *  
rank        -0.562584   0.135730  -4.145  3.4e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 446.39  on 359  degrees of freedom
Residual deviance: 408.26  on 356  degrees of freedom
AIC: 416.26

Number of Fisher Scoring iterations: 4

Let's generate some predictions with it

test_preds = test_students %>%
    add_predictions(train_model_full)
test_preds

What threshold to use?

Balance:

catching them all (recall)
catching things that aren't (precision)

Use ROC (Receiver Operating Characteristic) curve to view change in rates as we shift threshold. geom_roc uses 2 aesthetics: d is the decision, m is the prediction estimate (numeric).

full_roc = ggplot(test_preds) +
    aes(d=admit, m=pred) +
    geom_roc()
full_roc

calc_auc(full_roc)

Smaller Models¶

What if we just use GPA?

train_gpa_model = glm(admit ~ gpa, train_students, family=binomial())
summary(train_gpa_model)

Call:
glm(formula = admit ~ gpa, family = binomial(), data = train_students)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.1096  -0.8767  -0.7529   1.3059   2.0087  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -4.498      1.102  -4.084 4.43e-05 ***
gpa            1.084      0.318   3.409 0.000652 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 446.39  on 359  degrees of freedom
Residual deviance: 434.14  on 358  degrees of freedom
AIC: 438.14

Number of Fisher Scoring iterations: 4

test_gpa_preds = test_students %>%
    add_predictions(train_gpa_model)
test_gpa_preds

Stack the data!

test_mm_preds =
    bind_rows(Full=test_preds,
              GPA=test_gpa_preds,
              .id="Model")
test_mm_preds

mm_plot = ggplot(test_mm_preds) +
    aes(d=admit, m=pred, color=Model) +
    geom_roc()
mm_plot

calc_auc(mm_plot)

admit	gre	gpa	rank
0	380	3.61	3
1	660	3.67	3
1	800	4.00	1
1	640	3.19	4
0	520	2.93	4
...	...	...	...
0	620	4.00	2
0	560	3.04	3
0	460	2.63	2
0	700	3.65	2
0	600	3.89	3

	gre	gpa	rank
gre	1.0000000	0.38426588	-0.12344707
gpa	0.3842659	1.00000000	-0.05746077
rank	-0.1234471	-0.05746077	1.00000000

admit	gre	gpa	rank	id	pred
0	660	3.07	3	197	-1.22208500
1	760	2.81	1	356	-0.04054171
0	640	3.67	3	63	-0.81547172
0	700	2.88	2	99	-0.70253188
0	700	4.00	2	183	0.15154641
...	...	...	...	...	...
1	740	3.52	4	255	-1.2377980
0	480	3.57	2	98	-0.7365739
0	740	4.00	3	331	-0.3091804
1	540	3.77	2	371	-0.4312739
1	660	3.70	4	288	-1.3042500

admit	gre	gpa	rank	id	pred
0	660	3.07	3	197	-1.1699418
1	760	2.81	1	356	-1.4518377
0	640	3.67	3	63	-0.5194127
0	700	2.88	2	99	-1.3759426
0	700	4.00	2	183	-0.1616217
...	...	...	...	...	...
1	740	3.52	4	255	-0.6820450
0	480	3.57	2	98	-0.6278342
0	740	4.00	3	331	-0.1616217
1	540	3.77	2	371	-0.4109912
1	660	3.70	4	288	-0.4868863

Model	admit	gre	gpa	rank	id	pred
Full	0	660	3.07	3	197	-1.22208500
Full	1	760	2.81	1	356	-0.04054171
Full	0	640	3.67	3	63	-0.81547172
Full	0	700	2.88	2	99	-0.70253188
Full	0	700	4.00	2	183	0.15154641
...	...	...	...	...	...	...
GPA	1	740	3.52	4	255	-0.6820450
GPA	0	480	3.57	2	98	-0.6278342
GPA	0	740	4.00	3	331	-0.1616217
GPA	1	540	3.77	2	371	-0.4109912
GPA	1	660	3.70	4	288	-0.4868863

PANEL	group	AUC
1	1	0.6026667
1	2	0.5720000