SciKit Pipeline and Transform Demo¶

This expands the SciKit Learn logistic regression demo to incorporate pipelines and regularization.

This uses the grad student admissions data again.

Setup¶

Let’s import some packages:

import pandas as pd
import seaborn as sns
import numpy as np
from scipy.special import logit
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import precision_score, accuracy_score
rng = np.random.RandomState(20201031)

Read Data¶

Load the UCLA grad admissions data.

students = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
students.head()
admit gre gpa rank
0 0 380 3.61 3
1 1 660 3.67 3
2 1 800 4.00 1
3 1 640 3.19 4
4 0 520 2.93 4

Let’s train and test:

test_students = students.sample(frac=0.2, random_state=rng)
train_mask = pd.Series(True, index=students.index)
train_mask[test_students.index] = False
train_students = students[train_mask].copy()
train_students.head()
admit gre gpa rank
0 0 380 3.61 3
1 1 660 3.67 3
2 1 800 4.00 1
3 1 640 3.19 4
4 0 520 2.93 4

And now we set up the predictor and outcome variables:

feat_cols = ['gre', 'gpa', 'rank']
out_col = 'admit'
train_x = train_students[feat_cols]
train_y = train_students[out_col]

Transformations¶

We’re going to start by standardizing all the features, so that regularization works better.

SciKit-Learn provides transformers that can do this for us. Transformers have a fit method, like other models, to learn the transformation parameters; they also have a transform that modifies data.

Let’s fit a standard scalar:

scale = StandardScaler()
scale.fit(train_x)
StandardScaler()

The scalar learned the mean of each feature:

scale.mean_
array([584.6875    ,   3.38034375,   2.559375  ])

And the standard deviation (square root of variance):

scale.scale_
array([113.31053501,   0.38513091,   0.95667372])

And it can transform data:

scale.transform(train_x)[:5, :]
array([[-1.80642956,  0.59630698,  0.46058023],
       [ 0.66465576,  0.75209816,  0.46058023],
       [ 1.90019842,  1.60894966, -1.62999669],
       [ 0.48814967, -0.4942313 ,  1.50586869],
       [-0.5708869 , -1.16932642,  1.50586869]])

SciKit-Learn doesn’t know about Pandas, so it returns a NumPy matrix. The indexing gives us the first 5 rows.

Verify that these are the z-standardized values of the first 5 rows of the training data.

If we apply the transform to the test data, it uses the transforms learned from the training data to do the job:

scale.transform(test_students[feat_cols])[:5,:]
array([[ 1.01766795,  0.88192414, -0.58470823],
       [ 0.84116186,  0.75209816, -0.58470823],
       [ 0.13513748,  0.51841139, -1.62999669],
       [-0.92389909,  0.18086383, -0.58470823],
       [-0.04136862, -0.23457933,  1.50586869]])

Set Up the Pipeline¶

Now we’re going to use a standard scaler in a pipeline with a logistic regression:

pure_pipe = Pipeline([
    ('standardize', StandardScaler()),
    ('classify', LogisticRegression(penalty='none'))
])

Now we can fit the pipeline:

pure_pipe.fit(train_x, train_y)
Pipeline(steps=[('standardize', StandardScaler()),
                ('classify', LogisticRegression(penalty='none'))])

The named_steps exposes the individual steps by name, so we can look up our coefficients:

pure_pipe.named_steps['classify'].coef_
array([[ 0.24559163,  0.3354789 , -0.52467376]])

What’s the training accuracy of this model?

accuracy_score(train_y, pure_pipe.predict(train_x))
0.709375

Regularized Model¶

Let’s do a regularized model now! We’re going to use \(L_2\) (Ridge) regression, which is the default for logistic regression.

We will use LogisticRegressionCV to select regression strength using cross-validation on the training data. It will automatically try 10 different regression strengths.

reg_pipe = Pipeline([
    ('standardize', StandardScaler()),
    ('classify', LogisticRegressionCV())
])

Now train it:

reg_pipe.fit(train_x, train_y)
Pipeline(steps=[('standardize', StandardScaler()),
                ('classify', LogisticRegressionCV())])

Now let’s grab the logistic regression out of it and start looking at some of its parameters:

reg_model = reg_pipe.named_steps['classify']
reg_model.coef_
array([[ 0.24039303,  0.32109707, -0.50014372]])

You can see that these coefficients are a little smaller than they were in the un-regularized model. Regularization is doing some work!

What regularization strength did it learn?

reg_model.C_
array([0.35938137])

Remember \(C = 1/\lambda\), to connect this value to the videos.

What’s our training accuracy?

accuracy_score(train_y, reg_pipe.predict(train_x))
0.709375

A quick note: the logistic model is trained on transformed variables. If we try to apply it to un-transformed varaibles, it will have no idea what to do:

accuracy_score(train_y, reg_model.predict(train_x))
c:\users\michaelekstrand\.virtualenvs\cs533-web-yawkpmjf\lib\site-packages\sklearn\base.py:438: UserWarning: X has feature names, but LogisticRegressionCV was fitted without feature names
  warnings.warn(
0.321875

The pipeline makes sure that the same transforms are applied at predict time that were learned at model training time.

Decision Tree¶

from sklearn.tree import DecisionTreeClassifier, plot_tree
dtree = DecisionTreeClassifier(max_depth=2)
dtree.fit(train_x, train_y)
DecisionTreeClassifier(max_depth=2)
plot_tree(dtree, feature_names=feat_cols, impurity=False, class_names=['Deny', 'Admit'], label='root')
[Text(167.4, 181.2, 'gpa <= 3.435\nsamples = 320\nvalue = [217, 103]\nclass = Deny'),
 Text(83.7, 108.72, 'rank <= 2.5\n177\n[138, 39]\nDeny'),
 Text(41.85, 36.23999999999998, '89\n[63, 26]\nDeny'),
 Text(125.55000000000001, 36.23999999999998, '88\n[75, 13]\nDeny'),
 Text(251.10000000000002, 108.72, 'rank <= 1.5\n143\n[79, 64]\nDeny'),
 Text(209.25, 36.23999999999998, '23\n[5, 18]\nAdmit'),
 Text(292.95, 36.23999999999998, '120\n[74, 46]\nDeny')]
../../../_images/SciKitPipeline_48_1.png
accuracy_score(train_y, dtree.predict(train_x))
0.71875

Now let’s make an unlimited decision tree:

uldtree = DecisionTreeClassifier()
uldtree.fit(train_x, train_y)
DecisionTreeClassifier()
accuracy_score(train_y, uldtree.predict(train_x))
0.99375

Test Data Accuracy¶

Let’s test our two different models on the testing data.

test_x = test_students[feat_cols]
test_y = test_students[out_col]
pure_y = pure_pipe.predict(test_x)
reg_y = reg_pipe.predict(test_x)

What is our pure-model accuracy?

accuracy_score(test_y, pure_y)
0.6875

And our reg-model accuracy?

accuracy_score(test_y, reg_y)
0.7

The two regression models had the same training accuracy, but the regularized model does slightly better on the test data.

We would like to get an estimate of whether this meaningful or not. We can’t use a proportion test, because the proportions are paired. We can’t use a t-test, because we don’t have multiple samples of continuous values.

One of the most common answers is to use McNemar’s test for the null hypothesis that the two classifiers have the same accuracy. To compute that, we need a contingency table — it’s like a confusion matrix, but it’s for documenting when two classifiers get the result right or wrong.

We’ll start by making a data frame with whether or not each classifier got the result correct (indexed with the test index, just to keep things linkable):

test_correct = pd.DataFrame({
    'Pure': pure_y == test_y,
    'Reg': reg_y == test_y
}, index=test_y.index)
test_correct.head()
Pure Reg
117 False True
272 False False
139 True True
109 True True
230 True True

Now we can compute the contingency table by cross-tabulating the two columns:

ct = pd.crosstab(test_correct['Pure'], test_correct['Reg'])
ct
Reg False True
Pure
False 24 1
True 0 55

There’s only one data point that differed, not a lot of difference here.

We can now compute the McNemar result:

print(sm.stats.mcnemar(ct))
pvalue      1.0
statistic   0.0

Ok, that is not significant.

Let’s look at Wilson confidence intervals. First the pure model:

sm.stats.proportion_confint(np.sum(pure_y == test_y), len(test_students), method='wilson')
(0.5793222425531255, 0.7784959588710518)

And the regularized model:

sm.stats.proportion_confint(np.sum(reg_y == test_y), len(test_students), method='wilson')
(0.5923184607866026, 0.7893542873991866)

We could also bootstrap our test data to estimate confidence intervals:

def boot_sample(test_x, test_y):
    n = len(test_x)
    # get positions
    sample = rng.randint(n, size=n)

    # get data
    bs_x = test_x.iloc[sample, :]
    bs_y = test_y.iloc[sample]
    return bs_x, bs_y

def boot_class_metric(test_x, test_y, model, metric=accuracy_score, nboot=10000):
    test_d = model.predict(test_x)
    val = metric(test_y, test_d)
    
    boot_vals = np.empty(nboot)
    for i in range(nboot):
        bs_x, bs_y = boot_sample(test_x, test_y)
        bs_d = model.predict(bs_x)
        boot_vals[i] = metric(bs_y, bs_d)
    
    return val, np.quantile(boot_vals, [0.025, 0.975])

And bootstrap accuracy on our test data:

boot_class_metric(test_x, test_y, pure_pipe)
(0.6875, array([0.5875, 0.7875]))

And with the other model:

boot_class_metric(test_x, test_y, reg_pipe)
(0.7, array([0.6, 0.8]))

We can’t claim any any conclusive evidence of improvement.

These metrics are for the accuracy of these specific models. They are not testing the general accuracy of training this model structure over data like this. To test that, we could use a careful cross-validation design; just computing 10-fold CV and doing a t-test is unlikely to be correct however, but repeated cross validation can be effective. See the statistical testing reading.

Decision Tree Stats¶

Let’s look at the accuracy of our trees.

Compute preds:

dt_y = dtree.predict(test_x)
uldt_y = uldtree.predict(test_x)

Short decision tree:

accuracy_score(test_y, dt_y)
0.7375

And the unlimited tree?

accuracy_score(test_y, uldt_y)
0.55

Exercise: is this significant?

Observe the overfitting of the unlimited decision tree!