Clustering Example¶

This example notebook applies k-means clustering to the CHI data from the HCI Bibliography, building on the Week 13 Example.

Setup¶

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE

Load Data¶

papers = pd.read_csv('chi-papers.csv', encoding='utf8')
papers.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13403 entries, 0 to 13402
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   id        13292 non-null  object 
 1   year      13370 non-null  float64
 2   title     13370 non-null  object 
 3   keywords  3504 non-null   object 
 4   abstract  12872 non-null  object 
dtypes: float64(1), object(4)
memory usage: 523.7+ KB

Let’s treat empty abstracts as empty strings:

papers['abstract'].fillna('', inplace=True)
papers['title'].fillna('', inplace=True)

For some purposes, we want all text. Let’s make a field:

papers['all_text'] = papers['title'] + ' ' + papers['abstract']

Raw Clustering¶

Let’s set up a k-means to make 10 clusters out of our titles and abstracts. We’re going to also limit the term vectors to only the 10K most common words, to make the vectors more manageable.

cluster_pipe = Pipeline([
    ('vectorize', TfidfVectorizer(stop_words='english', max_features=10000)),
    ('cluster', KMeans(10))
])
cluster_pipe.fit(papers['all_text'])
Pipeline(steps=[('vectorize',
                 TfidfVectorizer(max_features=10000, stop_words='english')),
                ('cluster', KMeans(n_clusters=10))])

Now, if we want clusters for all of our papers, we use predict:

paper_clusters = cluster_pipe.predict(papers['all_text'])
sns.countplot(paper_clusters)
<matplotlib.axes._subplots.AxesSubplot at 0x208412ea400>
../../../_images/ClusteringExample_15_1.png

We can, for instance, get the titles of papers in cluster 0:

papers.loc[paper_clusters == 0, 'title']
44       The Streamlined Cognitive Walkthrough Method, ...
58          The Social Life of Small Graphical Chat Spaces
200             Interacting with music in a social setting
254      Casablanca: Designing Social Communication Dev...
256                      Social Navigation of Food Recipes
                               ...                        
13194    Making interactions visible: tools for social ...
13216    Trust me, I'm accountable: trust and accountab...
13220                  Counting on community in cyberspace
13221              Social navigation: what is it good for?
13266    Emotional Interfaces for Interactive Aardvarks...
Name: title, Length: 631, dtype: object

This created a Boolean mask that is True where the cluster number is equal to 0, and selects those rows and the 'title' column.

Don’t know if these papers make any sense, but they are clusters. We aren’t doing anything to find the most central papers to the cluster, though.

We can get that with transform, which will transform papers into cluster distance space - columns are the distances between each paper and that cluster:

paper_cdist = cluster_pipe.transform(papers['all_text'])

And we can find the papers closest to the center of cluster 0:

closest = np.argsort(paper_cdist[:, 0])[-10:]
papers.iloc[closest]['title']
3082         Anthropomorphism: From Eliza to Terminator 2
5428                                            Shake it!
5374                                        Kick-up menus
2478    Indentation, Documentation and Programmer Comp...
5404                                    eLearning and fun
3189                                         Introduction
2781                             UIMSs: Threat or Menace?
3709                           Pointing without a pointer
5427                    PHOTOVOTE: Olympic judging system
3130                        Toward a More Humane Keyboard
Name: title, dtype: object

We can also look at clusters in space. t-SNE is a technique for dimensionality reduction that is emphasized on visualizability. Let’s compute the t-SNE of our papers:

sne_pipe = Pipeline([
    ('vectorize', TfidfVectorizer(stop_words='english', max_features=10000)),
    ('sne', TSNE())
])
paper_sne = sne_pipe.fit_transform(papers['all_text'])
paper_sne
array([[  6.8206224,   0.6565056],
       [ 33.6402   , -15.340711 ],
       [  2.7017617,  13.163247 ],
       ...,
       [-37.581757 ,  -6.7183685],
       [-15.388872 , -37.62566  ],
       [  1.1543125,  -4.1664543]], dtype=float32)

Now we can plot:

paper_viz = pd.DataFrame({
    'SNE0': paper_sne[:, 0],
    'SNE1': paper_sne[:, 1],
    'cluster': paper_clusters
})
sns.scatterplot('SNE0', 'SNE1', hue='cluster', data=paper_viz)
<matplotlib.axes._subplots.AxesSubplot at 0x2083ab5aa90>
../../../_images/ClusteringExample_25_1.png

SVD-based Clusters¶

Let’s cluster in reduced-dimensional space:

svd_cluster_pipe = Pipeline([
    ('vectorize', TfidfVectorizer(stop_words='english')),
    ('svd', TruncatedSVD(25)),
    ('cluster', KMeans(10))
])
paper_svd_clusters = svd_cluster_pipe.fit_predict(papers['all_text'])
sns.countplot(paper_svd_clusters)
<matplotlib.axes._subplots.AxesSubplot at 0x20842a732b0>
../../../_images/ClusteringExample_28_1.png
paper_svd_cdist = svd_cluster_pipe.transform(papers['all_text'])

Let’s look at Cluster 0 in this space:

closest = np.argsort(paper_svd_cdist[:, 0])[-10:]
papers.iloc[closest]['title']
12018                            Visual Interaction Design
6922     Heuristic evaluation for games: usability prin...
6114     Enhancing credibility judgment of web search r...
2770     Video: Data for Studying Human-Computer Intera...
3855     A large scale study of wireless search behavio...
2860                               Search Technology, Inc.
10302    An In-Situ Study of Mobile App &amp; Mobile Se...
12010                               The Design Interaction
2508                Designing the Human-Computer Interface
3705     Older adults and web usability: is web experie...
Name: title, dtype: object

Not sure if that’s better, but it shows the concept.

Let’s do the color-coded SNE visualization:

paper_viz = pd.DataFrame({
    'SNE0': paper_sne[:, 0],
    'SNE1': paper_sne[:, 1],
    'cluster': paper_svd_clusters
})
sns.scatterplot('SNE0', 'SNE1', hue='cluster', data=paper_viz)
<matplotlib.axes._subplots.AxesSubplot at 0x208429bf8b0>
../../../_images/ClusteringExample_34_1.png