# SciKit Pipeline and Transform Demo

## Contents

# SciKit Pipeline and Transform Demo#

This expands the SciKit Learn logistic regression demo to incorporate pipelines and regularization.

This uses the grad student admissions data again.

## Setup#

Letâ€™s import some packages:

```
import pandas as pd
import seaborn as sns
import numpy as np
from scipy.special import logit
import matplotlib.pyplot as plt
import statsmodels.api as sm
```

```
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import precision_score, accuracy_score
```

```
rng = np.random.RandomState(20201031)
```

## Read Data#

Load the UCLA grad admissions data.

```
students = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
students.head()
```

admit | gre | gpa | rank | |
---|---|---|---|---|

0 | 0 | 380 | 3.61 | 3 |

1 | 1 | 660 | 3.67 | 3 |

2 | 1 | 800 | 4.00 | 1 |

3 | 1 | 640 | 3.19 | 4 |

4 | 0 | 520 | 2.93 | 4 |

Letâ€™s train and test:

```
test_students = students.sample(frac=0.2, random_state=rng)
```

```
train_mask = pd.Series(True, index=students.index)
train_mask[test_students.index] = False
train_students = students[train_mask].copy()
train_students.head()
```

admit | gre | gpa | rank | |
---|---|---|---|---|

0 | 0 | 380 | 3.61 | 3 |

1 | 1 | 660 | 3.67 | 3 |

2 | 1 | 800 | 4.00 | 1 |

3 | 1 | 640 | 3.19 | 4 |

4 | 0 | 520 | 2.93 | 4 |

And now we set up the predictor and outcome variables:

```
feat_cols = ['gre', 'gpa', 'rank']
out_col = 'admit'
train_x = train_students[feat_cols]
train_y = train_students[out_col]
```

## Transformations#

Weâ€™re going to start by standardizing all the features, so that regularization works better.

SciKit-Learn provides *transformers* that can do this for us. Transformers have a `fit`

method, like other models, to learn the transformation parameters; they also have a `transform`

that modifies data.

Letâ€™s fit a standard scalar:

```
scale = StandardScaler()
scale.fit(train_x)
```

```
StandardScaler()
```

The scalar learned the mean of each feature:

```
scale.mean_
```

```
array([584.6875 , 3.38034375, 2.559375 ])
```

And the standard deviation (square root of variance):

```
scale.scale_
```

```
array([113.31053501, 0.38513091, 0.95667372])
```

And it can transform data:

```
scale.transform(train_x)[:5, :]
```

```
array([[-1.80642956, 0.59630698, 0.46058023],
[ 0.66465576, 0.75209816, 0.46058023],
[ 1.90019842, 1.60894966, -1.62999669],
[ 0.48814967, -0.4942313 , 1.50586869],
[-0.5708869 , -1.16932642, 1.50586869]])
```

SciKit-Learn doesnâ€™t know about Pandas, so it returns a NumPy matrix. The indexing gives us the first 5 rows.

**Verify that these are the z-standardized values of the first 5 rows of the training data.**

If we apply the transform to the test data, it uses the transforms learned from the training data to do the job:

```
scale.transform(test_students[feat_cols])[:5,:]
```

```
array([[ 1.01766795, 0.88192414, -0.58470823],
[ 0.84116186, 0.75209816, -0.58470823],
[ 0.13513748, 0.51841139, -1.62999669],
[-0.92389909, 0.18086383, -0.58470823],
[-0.04136862, -0.23457933, 1.50586869]])
```

## Set Up the Pipeline#

Now weâ€™re going to use a standard scaler in a pipeline with a logistic regression:

```
pure_pipe = Pipeline([
('standardize', StandardScaler()),
('classify', LogisticRegression(penalty='none'))
])
```

Now we can fit the pipeline:

```
pure_pipe.fit(train_x, train_y)
```

```
Pipeline(steps=[('standardize', StandardScaler()),
('classify', LogisticRegression(penalty='none'))])
```

The `named_steps`

exposes the individual steps by name, so we can look up our coefficients:

```
pure_pipe.named_steps['classify'].coef_
```

```
array([[ 0.24559163, 0.3354789 , -0.52467376]])
```

Whatâ€™s the training accuracy of this model?

```
accuracy_score(train_y, pure_pipe.predict(train_x))
```

```
0.709375
```

## Regularized Model#

Letâ€™s do a regularized model now! Weâ€™re going to use \(L_2\) (Ridge) regression, which is the default for logistic regression.

We will use `LogisticRegressionCV`

to select regression strength using cross-validation on the training data. It will automatically try 10 different regression strengths.

```
reg_pipe = Pipeline([
('standardize', StandardScaler()),
('classify', LogisticRegressionCV())
])
```

Now train it:

```
reg_pipe.fit(train_x, train_y)
```

```
Pipeline(steps=[('standardize', StandardScaler()),
('classify', LogisticRegressionCV())])
```

Now letâ€™s grab the logistic regression out of it and start looking at some of its parameters:

```
reg_model = reg_pipe.named_steps['classify']
reg_model.coef_
```

```
array([[ 0.24039303, 0.32109707, -0.50014372]])
```

You can see that these coefficients are a little smaller than they were in the un-regularized model. Regularization is doing some work!

What regularization strength did it learn?

```
reg_model.C_
```

```
array([0.35938137])
```

Remember \(C = 1/\lambda\), to connect this value to the videos.

Whatâ€™s our training accuracy?

```
accuracy_score(train_y, reg_pipe.predict(train_x))
```

```
0.709375
```

A quick note: the logistic model is trained on *transformed* variables. If we try to apply it to un-transformed varaibles, it will have no idea what to do:

```
accuracy_score(train_y, reg_model.predict(train_x))
```

```
c:\users\michaelekstrand\.virtualenvs\cs533-web-yawkpmjf\lib\site-packages\sklearn\base.py:438: UserWarning: X has feature names, but LogisticRegressionCV was fitted without feature names
warnings.warn(
```

```
0.321875
```

The pipeline makes sure that the same transforms are applied at predict time that were learned at model training time.

## Decision Tree#

```
from sklearn.tree import DecisionTreeClassifier, plot_tree
```

```
dtree = DecisionTreeClassifier(max_depth=2)
dtree.fit(train_x, train_y)
```

```
DecisionTreeClassifier(max_depth=2)
```

```
plot_tree(dtree, feature_names=feat_cols, impurity=False, class_names=['Deny', 'Admit'], label='root')
```

```
[Text(167.4, 181.2, 'gpa <= 3.435\nsamples = 320\nvalue = [217, 103]\nclass = Deny'),
Text(83.7, 108.72, 'rank <= 2.5\n177\n[138, 39]\nDeny'),
Text(41.85, 36.23999999999998, '89\n[63, 26]\nDeny'),
Text(125.55000000000001, 36.23999999999998, '88\n[75, 13]\nDeny'),
Text(251.10000000000002, 108.72, 'rank <= 1.5\n143\n[79, 64]\nDeny'),
Text(209.25, 36.23999999999998, '23\n[5, 18]\nAdmit'),
Text(292.95, 36.23999999999998, '120\n[74, 46]\nDeny')]
```

```
accuracy_score(train_y, dtree.predict(train_x))
```

```
0.71875
```

Now letâ€™s make an unlimited decision tree:

```
uldtree = DecisionTreeClassifier()
uldtree.fit(train_x, train_y)
```

```
DecisionTreeClassifier()
```

```
accuracy_score(train_y, uldtree.predict(train_x))
```

```
0.99375
```

## Test Data Accuracy#

Letâ€™s test our two different models on the testing data.

```
test_x = test_students[feat_cols]
test_y = test_students[out_col]
```

```
pure_y = pure_pipe.predict(test_x)
```

```
reg_y = reg_pipe.predict(test_x)
```

What is our pure-model accuracy?

```
accuracy_score(test_y, pure_y)
```

```
0.6875
```

And our reg-model accuracy?

```
accuracy_score(test_y, reg_y)
```

```
0.7
```

The two regression models had the same training accuracy, but the regularized model does slightly better on the test data.

We would like to get an estimate of whether this meaningful or not. We canâ€™t use a proportion test, because the proportions are paired. We canâ€™t use a *t*-test, because we donâ€™t have multiple samples of continuous values.

One of the most common answers is to use McNemarâ€™s test for the null hypothesis that the two classifiers have the same accuracy. To compute that, we need a **contingency table** â€” itâ€™s like a confusion matrix, but itâ€™s for documenting when two classifiers get the result right or wrong.

Weâ€™ll start by making a data frame with whether or not each classifier got the result correct (indexed with the test index, just to keep things linkable):

```
test_correct = pd.DataFrame({
'Pure': pure_y == test_y,
'Reg': reg_y == test_y
}, index=test_y.index)
test_correct.head()
```

Pure | Reg | |
---|---|---|

117 | False | True |

272 | False | False |

139 | True | True |

109 | True | True |

230 | True | True |

Now we can compute the *contingency table* by **cross-tabulating** the two columns:

```
ct = pd.crosstab(test_correct['Pure'], test_correct['Reg'])
ct
```

Reg | False | True |
---|---|---|

Pure | ||

False | 24 | 1 |

True | 0 | 55 |

Thereâ€™s only one data point that differed, not a lot of difference here.

We can now compute the McNemar result:

```
print(sm.stats.mcnemar(ct))
```

```
pvalue 1.0
statistic 0.0
```

Ok, that is not significant.

Letâ€™s look at Wilson confidence intervals. First the pure model:

```
sm.stats.proportion_confint(np.sum(pure_y == test_y), len(test_students), method='wilson')
```

```
(0.5793222425531255, 0.7784959588710518)
```

And the regularized model:

```
sm.stats.proportion_confint(np.sum(reg_y == test_y), len(test_students), method='wilson')
```

```
(0.5923184607866026, 0.7893542873991866)
```

We could also bootstrap our test data to estimate confidence intervals:

```
def boot_sample(test_x, test_y):
n = len(test_x)
# get positions
sample = rng.randint(n, size=n)
# get data
bs_x = test_x.iloc[sample, :]
bs_y = test_y.iloc[sample]
return bs_x, bs_y
def boot_class_metric(test_x, test_y, model, metric=accuracy_score, nboot=10000):
test_d = model.predict(test_x)
val = metric(test_y, test_d)
boot_vals = np.empty(nboot)
for i in range(nboot):
bs_x, bs_y = boot_sample(test_x, test_y)
bs_d = model.predict(bs_x)
boot_vals[i] = metric(bs_y, bs_d)
return val, np.quantile(boot_vals, [0.025, 0.975])
```

And bootstrap accuracy on our test data:

```
boot_class_metric(test_x, test_y, pure_pipe)
```

```
(0.6875, array([0.5875, 0.7875]))
```

And with the other model:

```
boot_class_metric(test_x, test_y, reg_pipe)
```

```
(0.7, array([0.6, 0.8]))
```

We canâ€™t claim any any conclusive evidence of improvement.

These metrics are for the accuracy of **these specific models**. They are not testing the general accuracy of training **this model structure** over data **like this**. To test that, we could use a careful cross-validation design; just computing 10-fold CV and doing a *t*-test is unlikely to be correct however, but *repeated cross validation* can be effective. See the statistical testing reading.

## Decision Tree Stats#

Letâ€™s look at the accuracy of our trees.

Compute preds:

```
dt_y = dtree.predict(test_x)
uldt_y = uldtree.predict(test_x)
```

Short decision tree:

```
accuracy_score(test_y, dt_y)
```

```
0.7375
```

And the unlimited tree?

```
accuracy_score(test_y, uldt_y)
```

```
0.55
```

**Exercise:** is this significant?

Observe the overfitting of the unlimited decision tree!