Logistic Regression Demo

Init

First, we load our libraries. This makes use of the R plotROC package; it is available in Conda with

conda install -c mdekstrand r-plotroc
In [8]:
library(tidyverse)
library(modelr)
library(plotROC)
library(psych)
Attaching package: 'psych'

The following object is masked from 'package:modelr':

    heights

The following objects are masked from 'package:ggplot2':

    %+%, alpha

In [2]:
options(repr.plot.height=4.5, repr.matrix.max.rows=10)

And we want to load the data. Fun trick, readr can read URLs!

In [3]:
students = read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
students
Parsed with column specification:
cols(
  admit = col_integer(),
  gre = col_integer(),
  gpa = col_double(),
  rank = col_integer()
)
admitgregparank
0 380 3.613
1 660 3.673
1 800 4.001
1 640 3.194
0 520 2.934
............
0 620 4.002
0 560 3.043
0 460 2.632
0 700 3.652
0 600 3.893

Transformations

What does the logit function look like?

In [14]:
ggplot(data.frame(x=seq(-10, 10, 0.01))) +
    aes(x) +
    stat_function(fun=logistic)

Logistic Regression

New function: glm; works a lot like lm!

family says what kind of general regression; binomial is binary classifier.

In [28]:
zscore = function(x) {
    (x - mean(x)) / sd(x)
}
In [16]:
full_model = glm(admit ~ gpa + gre + rank, data=students,
                 family=binomial())
summary(full_model)
Call:
glm(formula = admit ~ gpa + gre + rank, family = binomial(), 
    data = students)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5802  -0.8848  -0.6382   1.1575   2.1732  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.449548   1.132846  -3.045  0.00233 ** 
gpa          0.777014   0.327484   2.373  0.01766 *  
gre          0.002294   0.001092   2.101  0.03564 *  
rank        -0.560031   0.127137  -4.405 1.06e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 459.44  on 396  degrees of freedom
AIC: 467.44

Number of Fisher Scoring iterations: 4
In [29]:
std_model = glm(admit ~ zscore(gpa) + zscore(gre) + rank, data=students,
                 family=binomial())
summary(std_model)
Call:
glm(formula = admit ~ zscore(gpa) + zscore(gre) + rank, family = binomial(), 
    data = students)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5802  -0.8848  -0.6382   1.1575   2.1732  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.5326     0.3154   1.689   0.0913 .  
zscore(gpa)   0.2957     0.1246   2.373   0.0177 *  
zscore(gre)   0.2650     0.1261   2.101   0.0356 *  
rank         -0.5600     0.1271  -4.405 1.06e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 459.44  on 396  degrees of freedom
AIC: 467.44

Number of Fisher Scoring iterations: 4

Let's do some checks for multicolinearity. The correlation matrix of the variables:

In [17]:
cor(select(students, -admit))
gregparank
gre 1.0000000 0.38426588-0.12344707
gpa 0.3842659 1.00000000-0.05746077
rank-0.1234471 -0.05746077 1.00000000

Looks like GRE and GPA might be linked?

In [18]:
ggplot(students) +
    aes(x=gpa, y=gre) +
    geom_point() + geom_smooth()
`geom_smooth()` using method = 'loess'

Effectiveness of Classifier

First, make train-test data

  1. Add a id column
  2. Sample rows
  3. anti-join to make train data

How many rows are we sampling from?

In [22]:
nrow(students)
400
In [23]:
students_with_ids = students %>%
    mutate(id=1:n())

Let's go!

In [24]:
test_students = students_with_ids %>%
    sample_frac(0.1)
nrow(test_students)
40

The train data is everything that isn't test data.

In [26]:
train_students = students_with_ids %>%
    anti_join(select(test_students, id))
nrow(train_students)
Joining, by = "id"
360

Now we will train the model on the training data:

In [27]:
train_model_full = glm(admit ~ gpa + gre + rank, train_students, family=binomial())
summary(train_model_full)
Call:
glm(formula = admit ~ gpa + gre + rank, family = binomial(), 
    data = train_students)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5962  -0.8741  -0.6208   1.1451   2.1991  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.556068   1.205096  -2.951  0.00317 ** 
gpa          0.762570   0.351440   2.170  0.03002 *  
gre          0.002546   0.001170   2.177  0.02946 *  
rank        -0.562584   0.135730  -4.145  3.4e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 446.39  on 359  degrees of freedom
Residual deviance: 408.26  on 356  degrees of freedom
AIC: 416.26

Number of Fisher Scoring iterations: 4

Let's generate some predictions with it

In [30]:
test_preds = test_students %>%
    add_predictions(train_model_full)
test_preds
admitgregparankidpred
0 660 3.07 3 197 -1.22208500
1 760 2.81 1 356 -0.04054171
0 640 3.67 3 63 -0.81547172
0 700 2.88 2 99 -0.70253188
0 700 4.00 2 183 0.15154641
..................
1 740 3.52 4 255 -1.2377980
0 480 3.57 2 98 -0.7365739
0 740 4.00 3 331 -0.3091804
1 540 3.77 2 371 -0.4312739
1 660 3.70 4 288 -1.3042500

What threshold to use?

Balance:

  • catching them all (recall)
  • catching things that aren't (precision)

Use ROC (Receiver Operating Characteristic) curve to view change in rates as we shift threshold. geom_roc uses 2 aesthetics: d is the decision, m is the prediction estimate (numeric).

In [32]:
full_roc = ggplot(test_preds) +
    aes(d=admit, m=pred) +
    geom_roc()
full_roc
In [33]:
calc_auc(full_roc)
PANELgroupAUC
1 -1 0.6026667

Smaller Models

What if we just use GPA?

In [34]:
train_gpa_model = glm(admit ~ gpa, train_students, family=binomial())
summary(train_gpa_model)
Call:
glm(formula = admit ~ gpa, family = binomial(), data = train_students)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.1096  -0.8767  -0.7529   1.3059   2.0087  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -4.498      1.102  -4.084 4.43e-05 ***
gpa            1.084      0.318   3.409 0.000652 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 446.39  on 359  degrees of freedom
Residual deviance: 434.14  on 358  degrees of freedom
AIC: 438.14

Number of Fisher Scoring iterations: 4
In [35]:
test_gpa_preds = test_students %>%
    add_predictions(train_gpa_model)
test_gpa_preds
admitgregparankidpred
0 660 3.07 3 197 -1.1699418
1 760 2.81 1 356 -1.4518377
0 640 3.67 3 63 -0.5194127
0 700 2.88 2 99 -1.3759426
0 700 4.00 2 183 -0.1616217
..................
1 740 3.52 4 255 -0.6820450
0 480 3.57 2 98 -0.6278342
0 740 4.00 3 331 -0.1616217
1 540 3.77 2 371 -0.4109912
1 660 3.70 4 288 -0.4868863

Stack the data!

In [36]:
test_mm_preds =
    bind_rows(Full=test_preds,
              GPA=test_gpa_preds,
              .id="Model")
test_mm_preds
Modeladmitgregparankidpred
Full 0 660 3.07 3 197 -1.22208500
Full 1 760 2.81 1 356 -0.04054171
Full 0 640 3.67 3 63 -0.81547172
Full 0 700 2.88 2 99 -0.70253188
Full 0 700 4.00 2 183 0.15154641
.....................
GPA 1 740 3.52 4 255 -0.6820450
GPA 0 480 3.57 2 98 -0.6278342
GPA 0 740 4.00 3 331 -0.1616217
GPA 1 540 3.77 2 371 -0.4109912
GPA 1 660 3.70 4 288 -0.4868863
In [44]:
mm_plot = ggplot(test_mm_preds) +
    aes(d=admit, m=pred, color=Model) +
    geom_roc()
mm_plot
In [42]:
calc_auc(mm_plot)
PANELgroupAUC
1 1 0.6026667
1 2 0.5720000
In [ ]: