Week 13 Demo¶
This exercise involves paper data taken from the HCI Bibliography; in particular, abstracts for papers at CHI (the human-computer interaction conference).
Setup¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
Load Data¶
papers = pd.read_csv('chi-papers.csv', encoding='utf8')
papers.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 13403 entries, 0 to 13402 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 13292 non-null object 1 year 13370 non-null float64 2 title 13370 non-null object 3 keywords 3504 non-null object 4 abstract 12872 non-null object dtypes: float64(1), object(4) memory usage: 523.7+ KB
Let's treat empty abstracts as empty strings:
papers['abstract'].fillna('', inplace=True)
papers['title'].fillna('', inplace=True)
For some purposes, we want all text. Let's make a field:
papers['all_text'] = papers['title'] + ' ' + papers['abstract']
Counting¶
Now that you have this, let's go!
Set up a CountVectorizer
to tokenize the words and compute counts:
vec = CountVectorizer(encoding='utf8')
You can use the sum
method on a sparse matrix to sum up entries. If you sum the columns (specify axis=0
), you will get an array of word counts:
mat = vec.fit_transform(papers['abstract'])
mat
<13403x27315 sparse matrix of type '<class 'numpy.int64'>' with 991576 stored elements in Compressed Sparse Row format>
abs_counts = np.array(mat.sum(axis=0)).flatten()
Plot the distribution of the log of word counts.
Classifying¶
Train a classifier to predict whether a paper was written before 2000 or after (predict "recent" where "recent" is where the year is >= 2000).
Use either Naive Bayes or k-NN.
Factorizing¶
Compute a TruncatedSVD with 10 features from all text words.
svd = Pipeline([
('tokenize', CountVectorizer()),
('svd', TruncatedSVD(10))
])
svd_X = svd.fit_transform(papers['all_text'])
What does the pairplot of these dimensions look like? (Week 13 demo notebook is helpful!)
sns.pairplot(pd.DataFrame(svd_X))
<seaborn.axisgrid.PairGrid at 0x1f6b56206d0>