Logistic Regression Demo¶

This notebook begins introducing the logistic regression.

Setup¶

Let’s import some packages:

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf

rng = np.random.RandomState(20201022)

Read Data¶

students = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
students.head()

	admit	gre	gpa	rank
0	0	380	3.61	3
1	1	660	3.67	3
2	1	800	4.00	1
3	1	640	3.19	4
4	0	520	2.93	4

Let’s train and test:

test = students.sample(frac=0.2, random_state=rng)

train_mask = pd.Series(True, index=students.index)
train_mask[test.index] = False
train = students[train_mask].copy()
train.head()

	admit	gre	gpa	rank
0	0	380	3.61	3
1	1	660	3.67	3
2	1	800	4.00	1
3	1	640	3.19	4
4	0	520	2.93	4

Train the Model¶

Now we’ll a GLM model:

mod = smf.glm('admit ~ gre + gpa + rank', train,
              family=sm.families.Binomial())
lg_res = mod.fit()
lg_res.summary()

Generalized Linear Model Regression Results
Dep. Variable:	admit	No. Observations:	320
Model:	GLM	Df Residuals:	316
Model Family:	Binomial	Df Model:	3
Link Function:	logit	Scale:	1.0000
Method:	IRLS	Log-Likelihood:	-185.53
Date:	Thu, 22 Oct 2020	Deviance:	371.06
Time:	19:51:37	Pearson chi2:	321.
No. Iterations:	4
Covariance Type:	nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-3.1575	1.231	-2.566	0.010	-5.570	-0.745
gre	0.0025	0.001	2.092	0.036	0.000	0.005
gpa	0.6554	0.352	1.862	0.063	-0.035	1.345
rank	-0.5566	0.141	-3.935	0.000	-0.834	-0.279

Let’s generate scores for the train data:

train['score'] = lg_res.predict(train)
sns.distplot(train['score'])

<matplotlib.axes._subplots.AxesSubplot at 0x13e4287d580>

../../../_images/LogitRegressionDemo_12_1.png

Let’s compute the likelihood of each observed outcome:

train['like'] = train['score'].pow(train['admit']) 
train['like'] *= (1 - train['score']).pow(1 - train['admit'])
train.head()

	admit	gre	gpa	rank	score	like
0	0	380	3.61	3	0.182807	0.817193
1	1	660	3.67	3	0.321259	0.321259
2	1	800	4.00	1	0.718391	0.718391
3	1	640	3.19	4	0.158440	0.158440
4	0	520	2.93	4	0.104835	0.895165

What is the likelhood of all the data?

train['like'].prod()

2.659838384326594e-81

That’s small! Let’s compute log likelihoods.

train['log_like'] = train['admit'] * np.log(train['score'])
train['log_like'] += (1 - train['admit']) * np.log(1 - train['score'])
train.head()

	admit	gre	gpa	rank	score	like	log_like
0	0	380	3.61	3	0.182807	0.817193	-0.201880
1	1	660	3.67	3	0.321259	0.321259	-1.135508
2	1	800	4.00	1	0.718391	0.718391	-0.330741
3	1	640	3.19	4	0.158440	0.158440	-1.842378
4	0	520	2.93	4	0.104835	0.895165	-0.110747

Total log likelihood:

train['log_like'].sum()

-185.5311271693419

That matches the log likelihood from the GLM fit process.

What is the accuracy on the training data using 0.5 as our decision threshold:

np.mean(train['admit'] == (train['score'] > 0.5))

0.7

Analysis with Test Data¶

Now we’ll predict our test data:

test['score'] = lg_res.predict(test)
test.head()

	admit	gre	gpa	rank	score
45	1	460	3.45	3	0.197909
377	1	800	4.00	2	0.593855
322	0	500	3.01	4	0.104995
227	0	540	3.02	4	0.115585
391	1	660	3.88	2	0.486557

What are the highest-scored and lowest-scored items?

test.nlargest(5, 'score')

	admit	gre	gpa	rank	score
12	1	760	4.00	1	0.697421
69	0	800	3.73	1	0.681253
14	1	700	4.00	1	0.664381
206	0	740	3.54	1	0.618418
106	1	700	3.56	1	0.597365

test.nsmallest(5, 'score')

	gre	gpa	rank	score
102	380	3.33	4	0.096431
166	440	3.24	4	0.104861
322	500	3.01	4	0.104995
179	300	3.01	3	0.109722
227	540	3.02	4	0.115585

Let’s make decisions with a threshold of 0.5.

test['decision'] = test['score'] > 0.5
test.head()

	admit	gre	gpa	rank	score	decision
45	1	460	3.45	3	0.197909	False
377	1	800	4.00	2	0.593855	True
322	0	500	3.01	4	0.104995	False
227	0	540	3.02	4	0.115585	False
391	1	660	3.88	2	0.486557	False

What is the accuracy of these decisions?

(test['admit'] == test['decision']).mean()

0.7375

Let’s compute the confusion matrix:

out_counts = test.groupby(['decision', 'admit'])['score'].count()
out_counts

decision  admit
False     0        52
          1        17
True      0         4
          1         7
Name: score, dtype: int64

out_counts.unstack()

admit	0	1
decision
False	52	17
True	4	7

What is the recall? TP / (TP + FN)

out_counts.loc[True, 1] / (out_counts.loc[True, 1] + out_counts.loc[False, 1])

0.2916666666666667

Using column/row sums:

out_counts.loc[True, 1] / out_counts.loc[:, 1].sum()

0.2916666666666667

CS 533 Fall 2021

Logistic Regression Demo¶

Setup¶

Read Data¶

Train the Model¶

Analysis with Test Data¶