Tuning Example
Contents
Tuning Example#
This notebook demonstrates hyperparameter tuning
It uses the CHI Papers data.
All of these are going to optimize for accuracy, the metric returned by the score
function on a classifier.
Setup#
scikit-optimize isn’t in Anaconda, so we’ll install it from Pip:
%pip install scikit-optimize
Requirement already satisfied: scikit-optimize in c:\users\michaelekstrand\anaconda3\lib\site-packages (0.8.1)
Requirement already satisfied: pyaml>=16.9 in c:\users\michaelekstrand\anaconda3\lib\site-packages (from scikit-optimize) (20.4.0)
Requirement already satisfied: scikit-learn>=0.20.0 in c:\users\michaelekstrand\anaconda3\lib\site-packages (from scikit-optimize) (0.23.1)
Requirement already satisfied: numpy>=1.13.3 in c:\users\michaelekstrand\anaconda3\lib\site-packages (from scikit-optimize) (1.18.5)
Requirement already satisfied: scipy>=0.19.1 in c:\users\michaelekstrand\anaconda3\lib\site-packages (from scikit-optimize) (1.5.0)
Requirement already satisfied: joblib>=0.11 in c:\users\michaelekstrand\anaconda3\lib\site-packages (from scikit-optimize) (0.16.0)
Requirement already satisfied: PyYAML in c:\users\michaelekstrand\anaconda3\lib\site-packages (from pyaml>=16.9->scikit-optimize) (5.3.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\michaelekstrand\anaconda3\lib\site-packages (from scikit-learn>=0.20.0->scikit-optimize) (2.1.0)
Note: you may need to restart the kernel to use updated packages.
Import our general PyData packages:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
And some SciKit Learn:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score
from sklearn.naive_bayes import MultinomialNB
And finally the Bayesian optimizer:
from skopt import BayesSearchCV
We want predictable randomness:
rng = np.random.RandomState(20201130)
In this notebook, I have SciKit-Learn run some tasks in parallel. Let’s configure the (max) number of parallel tasks in one place, so you can easily adjust it based on your computer’s capacity:
NJOBS = 8
Load Data#
We’re going to load the CHI Papers data from the CSV file, output by the other notebook:
papers = pd.read_csv('chi-papers.csv', encoding='utf8')
papers.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13422 entries, 0 to 13421
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 13333 non-null object
1 title 13416 non-null object
2 authors 13416 non-null object
3 date 13422 non-null object
4 abstract 12926 non-null object
5 year 13422 non-null int64
dtypes: int64(1), object(5)
memory usage: 629.3+ KB
Let’s treat empty abstracts as empty strings:
papers['abstract'].fillna('', inplace=True)
papers['title'].fillna('', inplace=True)
For our purposes we want all text - the title and the abstract. We will join them with a space, so we don’t fuse the last word of the title to the first word of the abstract:
papers['all_text'] = papers['title'] + ' ' + papers['abstract']
We’re going to classify papers as recent if they’re newer than 2005:
papers['IsRecent'] = papers['year'] > 2005
And make training and test data:
train, test = train_test_split(papers, test_size=0.2, random_state=rng)
Let’s make a function for measuring accuracy:
def measure(model, text='all_text'):
preds = model.predict(test[text])
print(classification_report(test['IsRecent'], preds))
And look at the class distribution:
sns.countplot(train['IsRecent'])
<matplotlib.axes._subplots.AxesSubplot at 0x26bb8bdde80>
Classifying New Papers#
Let’s classify recent papers with k-NN on TF-IDF vectors:
base_knn = Pipeline([
('vectorize', TfidfVectorizer(stop_words='english', lowercase=True, max_features=10000)),
('class', KNeighborsClassifier(5))
])
base_knn.fit(train['all_text'], train['IsRecent'])
Pipeline(steps=[('vectorize',
TfidfVectorizer(max_features=10000, stop_words='english')),
('class', KNeighborsClassifier())])
And measure it:
measure(base_knn)
precision recall f1-score support
False 0.39 1.00 0.56 1034
True 0.91 0.01 0.02 1651
accuracy 0.39 2685
macro avg 0.65 0.51 0.29 2685
weighted avg 0.71 0.39 0.23 2685
Tune the Neighborhood#
Let’s tune the neighborhood with a grid search:
tune_knn = Pipeline([
('vectorize', TfidfVectorizer(stop_words='english', lowercase=True, max_features=10000)),
('class', GridSearchCV(KNeighborsClassifier(), param_grid={
'n_neighbors': [1, 2, 3, 5, 7, 10]
}, n_jobs=NJOBS))
])
tune_knn.fit(train['all_text'], train['IsRecent'])
Pipeline(steps=[('vectorize',
TfidfVectorizer(max_features=10000, stop_words='english')),
('class',
GridSearchCV(estimator=KNeighborsClassifier(), n_jobs=8,
param_grid={'n_neighbors': [1, 2, 3, 5, 7,
10]}))])
What did it pick?
tune_knn.named_steps['class'].best_params_
{'n_neighbors': 10}
And measure it:
measure(tune_knn)
precision recall f1-score support
False 0.51 0.90 0.65 1034
True 0.87 0.45 0.60 1651
accuracy 0.62 2685
macro avg 0.69 0.67 0.62 2685
weighted avg 0.73 0.62 0.62 2685
SVD Neighborhood#
Let’s set up SVD-based neighborhood, and use random search to search both the latent feature count and the neighborhood size:
svd_knn_inner = Pipeline([
('latent', TruncatedSVD(random_state=rng)),
('class', KNeighborsClassifier())
])
svd_knn = Pipeline([
('vectorize', TfidfVectorizer(stop_words='english', lowercase=True)),
('class', RandomizedSearchCV(svd_knn_inner, param_distributions={
'latent__n_components': stats.randint(1, 50),
'class__n_neighbors': stats.randint(1, 25)
}, n_iter=60, n_jobs=NJOBS, random_state=rng))
])
svd_knn.fit(train['all_text'], train['IsRecent'])
Pipeline(steps=[('vectorize', TfidfVectorizer(stop_words='english')),
('class',
RandomizedSearchCV(estimator=Pipeline(steps=[('latent',
TruncatedSVD(random_state=RandomState(MT19937) at 0x26BB86DF140)),
('class',
KNeighborsClassifier())]),
n_iter=60, n_jobs=8,
param_distributions={'class__n_neighbors': <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000026BB8B0BEB0>,
'latent__n_components': <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000026BB945A070>},
random_state=RandomState(MT19937) at 0x26BB86DF140))])
What parameters did we pick?
svd_knn['class'].best_params_
{'class__n_neighbors': 21, 'latent__n_components': 21}
And measure it on the test data:
measure(svd_knn)
precision recall f1-score support
False 0.75 0.47 0.58 1034
True 0.73 0.90 0.81 1651
accuracy 0.74 2685
macro avg 0.74 0.69 0.69 2685
weighted avg 0.74 0.74 0.72 2685
SVD with scikit-optimize#
Now let’s cross-validate with SciKit-Optimize:
svd_knn_inner = Pipeline([
('latent', TruncatedSVD()),
('class', KNeighborsClassifier())
])
svd_bayes_knn = Pipeline([
('vectorize', TfidfVectorizer(stop_words='english', lowercase=True)),
('class', BayesSearchCV(svd_knn_inner, {
'latent__n_components': (1, 50),
'class__n_neighbors': (1, 25)
}, n_jobs=NJOBS, random_state=rng))
])
svd_bayes_knn.fit(train['all_text'], train['IsRecent'])
C:\Users\michaelekstrand\Anaconda3\lib\site-packages\skopt\optimizer\optimizer.py:449: UserWarning: The objective has been evaluated at this point before.
warnings.warn("The objective has been evaluated "
C:\Users\michaelekstrand\Anaconda3\lib\site-packages\skopt\optimizer\optimizer.py:449: UserWarning: The objective has been evaluated at this point before.
warnings.warn("The objective has been evaluated "
C:\Users\michaelekstrand\Anaconda3\lib\site-packages\skopt\optimizer\optimizer.py:449: UserWarning: The objective has been evaluated at this point before.
warnings.warn("The objective has been evaluated "
C:\Users\michaelekstrand\Anaconda3\lib\site-packages\skopt\optimizer\optimizer.py:449: UserWarning: The objective has been evaluated at this point before.
warnings.warn("The objective has been evaluated "
C:\Users\michaelekstrand\Anaconda3\lib\site-packages\skopt\optimizer\optimizer.py:449: UserWarning: The objective has been evaluated at this point before.
warnings.warn("The objective has been evaluated "
C:\Users\michaelekstrand\Anaconda3\lib\site-packages\skopt\optimizer\optimizer.py:449: UserWarning: The objective has been evaluated at this point before.
warnings.warn("The objective has been evaluated "
Pipeline(steps=[('vectorize', TfidfVectorizer(stop_words='english')),
('class',
BayesSearchCV(estimator=Pipeline(steps=[('latent',
TruncatedSVD()),
('class',
KNeighborsClassifier())]),
n_jobs=8,
random_state=RandomState(MT19937) at 0x26BB86DF140,
search_spaces={'class__n_neighbors': (1, 25),
'latent__n_components': (1,
50)}))])
What parameters did we pick?
svd_bayes_knn['class'].best_params_
OrderedDict([('class__n_neighbors', 25), ('latent__n_components', 20)])
And measure it:
measure(svd_bayes_knn)
precision recall f1-score support
False 0.74 0.43 0.55 1034
True 0.72 0.91 0.80 1651
accuracy 0.72 2685
macro avg 0.73 0.67 0.67 2685
weighted avg 0.73 0.72 0.70 2685
Naive Bayes#
Let’s give the Naive Bayes classifier a whirl:
nb = Pipeline([
('vectorize', TfidfVectorizer(stop_words='english', lowercase=True, max_features=10000)),
('class', MultinomialNB())
])
nb.fit(train['all_text'], train['IsRecent'])
Pipeline(steps=[('vectorize',
TfidfVectorizer(max_features=10000, stop_words='english')),
('class', MultinomialNB())])
measure(nb)
precision recall f1-score support
False 0.89 0.50 0.64 1034
True 0.75 0.96 0.85 1651
accuracy 0.78 2685
macro avg 0.82 0.73 0.74 2685
weighted avg 0.81 0.78 0.77 2685
Summary Accuracy#
What does our test accuracy look like for our various classifiers?
models = {
'kNN': base_knn,
'kNN-CV': tune_knn,
'kNN-SVD-Rand': svd_knn,
'kNN-SVD-Bayes': svd_bayes_knn,
'NB': nb
}
all_preds = pd.DataFrame()
for name, model in models.items():
all_preds[name] = model.predict(test['all_text'])
acc = all_preds.apply(lambda ds: accuracy_score(test['IsRecent'], ds))
acc
kNN 0.391806
kNN-CV 0.623836
kNN-SVD-Rand 0.735940
kNN-SVD-Bayes 0.724022
NB 0.783240
dtype: float64
acc.plot.bar()
plt.ylabel('Accuracy')
plt.show()