Bibliography Fetch¶

A couple of the notebooks use the ‘CHI Papers’ data set. This notebook creates that data set from the original sources.

Setup¶

Let’s import libraries:

import pandas as pd
import re
import html5lib as html
import requests
from tqdm.notebook import tqdm

Fetching the Index¶

The full list of files in the HCI Bibliography is on the index page. We’ll use that to get our file list:

hcibib_root = 'http://hcibib.org'
file_index = requests.get(f'{hcibib_root}/listdir.cgi')
idx_html = html.parse(file_index.text)

Parse the files out of the HTML content itself:

files = {}
bib_re = re.compile(r'^/bibdata/(.*\.bib)')
for link in idx_html.findall('*//{http://www.w3.org/1999/xhtml}a'):
    href = link.get('href')
    m = bib_re.match(href)
    if m:
        files[m.group(1)] = href
len(files)
1988

Decoding Data¶

Let’s get an example file to see what we’re dealing with:

ex_path = files['CHI10-1.bib']
ex_file = requests.get(f'{hcibib_root}{ex_path}')
print(ex_file.text[:5000])
%M C.CHI.10.1.1
%T Estimating residual error rate in recognized handwritten documents using
artificial error injection
%S EPIC #FAIL
%A Lank, Edward
%A Stedman, Ryan
%A Terry, Michael
%B Proceedings of ACM CHI 2010 Conference on Human Factors in Computing Systems
%D 2010-04-10
%V 1
%P 1-4
%K artificial error, handwriting recognition, residual error
%* (c) Copyright 2010 ACM
%W http://doi.acm.org/10.1145/1753326.1753328
%X Both handwriting recognition systems and their users are error prone.
Handwriting recognizers make recognition errors, and users may miss those
errors when verifying output. As a result, it is common for recognized
documents to contain residual errors. Unfortunately, in some application
domains (e.g. health informatics), tolerance for residual errors in recognized
handwriting may be very low, and a desire might exist to maximize user accuracy
during verification. In this paper, we present a technique that allows us to
measure the performance of a user verifying recognizer output. We inject
artificial errors into a set of recognized handwritten forms and show that the
rate of injected errors and recognition errors caught is highly correlated in
real time. Systems supporting user verification can make use of this measure of
user accuracy in a variety of ways. For example, they can force users to slow
down or can highlight injected errors that were missed, thus encouraging users
to take more care.

%M C.CHI.10.1.5
%T Predicting the cost of error correction in character-based text entry
technologies
%S EPIC #FAIL
%A Arif, Ahmed Sabbir
%A Stuerzlinger, Wolfgang
%B Proceedings of ACM CHI 2010 Conference on Human Factors in Computing Systems
%D 2010-04-10
%V 1
%P 5-14
%K cognitive model, error correction, error rate, hand-held devices, mobile
phone, performance metric, prediction, text entry
%* (c) Copyright 2010 ACM
%W http://doi.acm.org/10.1145/1753326.1753329
%X Researchers have developed many models to predict and understand human
performance in text entry. Most of the models are specific to a technology or
fail to account for human factors and variations in system parameters, and the
relationship between them. Moreover, the process of fixing errors and its
effects on text entry performance has not been studied. Here, we first analyze
real-life text entry error correction behaviors. We then use our findings to
develop a new model to predict the cost of error correction for character-based
text entry technologies. We validate our model against quantities derived from
the literature, as well as with a user study. Our study shows that the
predicted and observed cost of error correction correspond well. At the end, we
discuss potential applications of our new model.

%M C.CHI.10.1.15
%T SHRIMP: solving collision and out of vocabulary problems in mobile
predictive input with motion gesture
%S EPIC #FAIL
%A Wang, Jingtao
%A Zhai, Shumin
%A Canny, John
%B Proceedings of ACM CHI 2010 Conference on Human Factors in Computing Systems
%D 2010-04-10
%V 1
%P 15-24
%K camera phones, dictionary-based disambiguation, gestures, mobile devices,
mobile phones, multitap, predictive input, t9, text input
%* (c) Copyright 2010 ACM
%W http://doi.acm.org/10.1145/1753326.1753330
%X Dictionary-based disambiguation (DBD) is a very popular solution for text
entry on mobile phone keypads but suffers from two problems: 1. the resolution
of encoding collision (two or more words sharing the same numeric key sequence)
and 2. entering out-of-vocabulary (OOV) words. In this paper, we present
SHRIMP, a system and method that addresses these two problems by integrating
DBD with camera based motion sensing that enables the user to express
preference through a tilting or movement gesture. SHRIMP (Small Handheld Rapid
Input with Motion and Prediction) runs on camera phones equipped with a
standard 12-key keypad. SHRIMP maintains the speed advantage of DBD driven
predictive text input while enabling the user to overcome DBD collision and OOV
problems seamlessly without even a mode switch. An initial empirical study
demonstrates that SHRIMP can be learned very quickly, performed immediately
faster than MultiTap and handled OOV words more efficiently than DBD.

%M C.CHI.10.1.25
%T Reactive information foraging for evolving goals
%S Exploratory search
%A Lawrance, Joseph
%A Burnett, Margaret
%A Bellamy, Rachel
%A Bogart, Christopher
%A Swart, Calvin
%B Proceedings of ACM CHI 2010 Conference on Human Factors in Computing Systems
%D 2010-04-10
%V 1
%P 25-34
%K field study, information foraging theory, programming
%* (c) Copyright 2010 ACM
%W http://doi.acm.org/10.1145/1753326.1753332
%X Information foraging models have predicted the navigation paths of people
browsing the web and (more recently) of programmers while debugging, but these
models do not explicitly model users' goals evolving over time. We present a
new information foraging model called PFIS2 that does model information seeking
with potentially evolving goals. We then evalua

These files have line prefixes that indicate different fields. Fields can continue across multiple lines. A blank line separates records. So we need to parse this.

We’re going to write a function that processes a file to do exactly that. It is going to be a Python generator, so it can be used in a loop but doesn’t build a whole list.

_c_re = re.compile(r'^%([A-Z*]) (.*)')
_blank_re = re.compile(r'^\s*$')
_bib_codes = {
    'T': 'title',
    'X': 'abstract',
    'A': 'authors',
    'D': 'date',
    'M': 'id'
}
def parse_bib(text):
    bibrec = {}
    last_fld = None
    for line in text.splitlines():
        cm = _c_re.match(line)
        if _blank_re.match(line):
            # end of record, emit
            if bibrec:
                yield bibrec
            bibrec = {}
        elif cm:
            # new field
            code = cm.group(1)
            value = cm.group(2)
            fld = _bib_codes.get(code, None)
            if fld:
                if fld in bibrec:
                    bibrec[fld] += '; ' + value
                else:
                    bibrec[fld] = value
            last_fld = fld
        elif last_fld:
            # text, add to field
            bibrec[last_fld] += ' ' + line
            
    # if we have an in-progress record, emit it
    if bibrec:
        yield bibrec
ex_recs = list(parse_bib(ex_file.text))
ex_recs[0]
{'id': 'C.CHI.10.1.1',
 'title': 'Estimating residual error rate in recognized handwritten documents using artificial error injection',
 'authors': 'Lank, Edward; Stedman, Ryan; Terry, Michael',
 'date': '2010-04-10',
 'abstract': 'Both handwriting recognition systems and their users are error prone. Handwriting recognizers make recognition errors, and users may miss those errors when verifying output. As a result, it is common for recognized documents to contain residual errors. Unfortunately, in some application domains (e.g. health informatics), tolerance for residual errors in recognized handwriting may be very low, and a desire might exist to maximize user accuracy during verification. In this paper, we present a technique that allows us to measure the performance of a user verifying recognizer output. We inject artificial errors into a set of recognized handwritten forms and show that the rate of injected errors and recognition errors caught is highly correlated in real time. Systems supporting user verification can make use of this measure of user accuracy in a variety of ways. For example, they can force users to slow down or can highlight injected errors that were missed, thus encouraging users to take more care.'}

Now we have a file record extracted!

Extracting CHI Papers¶

Now we want to get all the CHI papers.

Let’s define a regex to match a CHI paper file name, CHI followed by at least a digit:

chi_re = re.compile(r'^CHI\d')

And get the CHI files:

chi_files = [k for k in files.keys() if chi_re.match(k)]
chi_files
['CHI16-2.bib',
 'CHI16-1.bib',
 'CHI15-1.bib',
 'CHI15-2.bib',
 'CHI02-1.bib',
 'CHI14-1.bib',
 'CHI14-2.bib',
 'CHI04-2.bib',
 'CHI05-2.bib',
 'CHI07-2.bib',
 'CHI11-1.bib',
 'CHI13-1.bib',
 'CHI13-2.bib',
 'CHI12-1.bib',
 'CHI10-1.bib',
 'CHI83.bib',
 'CHI12-2.bib',
 'CHI11-2.bib',
 'CHI82.bib',
 'CHI85.bib',
 'CHI86.bib',
 'CHI87.bib',
 'CHI88.bib',
 'CHI89.bib',
 'CHI90.bib',
 'CHI92a.bib',
 'CHI92b.bib',
 'CHI93a.bib',
 'CHI95-2b.bib',
 'CHI95-2c.bib',
 'CHI96-2b.bib',
 'CHI05-1.bib',
 'CHI06-1.bib',
 'CHI07-1.bib',
 'CHI08-1.bib',
 'CHI09-1.bib',
 'CHI99-2.bib',
 'CHI00-1.bib',
 'CHI00-2.bib',
 'CHI01-1.bib',
 'CHI01-2.bib',
 'CHI02-2.bib',
 'CHI03-1.bib',
 'CHI03-2.bib',
 'CHI04-1.bib',
 'CHI06-2.bib',
 'CHI08-2.bib',
 'CHI09-2.bib',
 'CHI10-2.bib',
 'CHI81.bib',
 'CHI91.bib',
 'CHI92X.bib',
 'CHI92Y.bib',
 'CHI93X.bib',
 'CHI93Y.bib',
 'CHI93b.bib',
 'CHI94-1.bib',
 'CHI94-2a.bib',
 'CHI94-2b.bib',
 'CHI94-2c.bib',
 'CHI94-2d.bib',
 'CHI94-2e.bib',
 'CHI95-1.bib',
 'CHI95-2a.bib',
 'CHI96-1.bib',
 'CHI96-2a.bib',
 'CHI96-2c.bib',
 'CHI97-1.bib',
 'CHI97-2a.bib',
 'CHI97-2b.bib',
 'CHI97-2c.bib',
 'CHI98-1.bib',
 'CHI98-2a.bib',
 'CHI98-2b.bib',
 'CHI98-2c.bib',
 'CHI98-2d.bib',
 'CHI99-1.bib']

Now we’re going to a list of all the records in all the CHI files:

chi_records = []
for fk in tqdm(chi_files):
    path = files[fk]
    data = requests.get(f'{hcibib_root}{path}')
    for rec in parse_bib(data.text):
        chi_records.append(rec)
len(chi_records)

13422

Now we’re going to turn that all into a Pandas data series:

papers = pd.DataFrame.from_records(chi_records)
papers
id title authors date abstract
0 C.CHI.16.2.1 TactileVR: Integrating Physical Toys into Lear... Amores, Judith; Benavides, Xavier; Shapira, Lior 2016-05-07 We present TactileVR, an immersive presence an...
1 C.CHI.16.2.2 PsychicVR: Increasing mindfulness by using Vir... Amores, Judith; Benavides, Xavier; Maes, Pattie 2016-05-07 We present PsychicVR, a proof-of-concept syste...
2 C.CHI.16.2.3 Haptic Retargeting Video Showcase: Dynamic Rep... Azmandian, Mahdi; Hancock, Mark; Benko, Hrvoje... 2016-05-07 Manipulating a virtual object with appropriate...
3 C.CHI.16.2.4 Reality Editor Heun, Valentin; Stern-Rodriguez, Eva; Teyssier... 2016-05-07 The Reality Editor is a tool for empowering a ...
4 C.CHI.16.2.5 Access: A Mobile Application to Improve Access... Yang, Yi; Hu, Yunqi; Hong, Yidi; Joshi, Varun;... 2016-05-07 This video introduces Access, a mobile applica...
... ... ... ... ... ...
13417 C.CHI.99.1.576 Mutual Disambiguation of Recognition Errors in... Oviatt, Sharon 1999-05-15 As a new generation of multimodal/media system...
13418 C.CHI.99.1.584 Model-Based and Empirical Evaluation of Multim... Suhm, Bernhard; Waibel, Alex; Myers, Brad 1999-05-15 Our research addresses the problem of error co...
13419 C.CHI.99.1.592 Cooperative Inquiry: Developing New Technologi... Druin, Allison 1999-05-15 In today's homes and schools, children are eme...
13420 C.CHI.99.1.600 Projected Realities: Conceptual Design for Cul... Gaver, William; Dunne, Anthony 1999-05-15 As a part of a European Union sponsored projec...
13421 C.CHI.99.1.608 Customer-Focused Design Data in a Large, Multi... Curtis, Paula; Heiserman, Tammy; Jobusch, Davi... 1999-05-15 Qualitative user-centered design processes suc...

13422 rows × 5 columns

Next step, we need to extract the years from the dates.

papers['year'] = papers['date'].str.replace(r'^(\d{4}).*', r'\1').astype('i4')
papers
id title authors date abstract year
0 C.CHI.16.2.1 TactileVR: Integrating Physical Toys into Lear... Amores, Judith; Benavides, Xavier; Shapira, Lior 2016-05-07 We present TactileVR, an immersive presence an... 2016
1 C.CHI.16.2.2 PsychicVR: Increasing mindfulness by using Vir... Amores, Judith; Benavides, Xavier; Maes, Pattie 2016-05-07 We present PsychicVR, a proof-of-concept syste... 2016
2 C.CHI.16.2.3 Haptic Retargeting Video Showcase: Dynamic Rep... Azmandian, Mahdi; Hancock, Mark; Benko, Hrvoje... 2016-05-07 Manipulating a virtual object with appropriate... 2016
3 C.CHI.16.2.4 Reality Editor Heun, Valentin; Stern-Rodriguez, Eva; Teyssier... 2016-05-07 The Reality Editor is a tool for empowering a ... 2016
4 C.CHI.16.2.5 Access: A Mobile Application to Improve Access... Yang, Yi; Hu, Yunqi; Hong, Yidi; Joshi, Varun;... 2016-05-07 This video introduces Access, a mobile applica... 2016
... ... ... ... ... ... ...
13417 C.CHI.99.1.576 Mutual Disambiguation of Recognition Errors in... Oviatt, Sharon 1999-05-15 As a new generation of multimodal/media system... 1999
13418 C.CHI.99.1.584 Model-Based and Empirical Evaluation of Multim... Suhm, Bernhard; Waibel, Alex; Myers, Brad 1999-05-15 Our research addresses the problem of error co... 1999
13419 C.CHI.99.1.592 Cooperative Inquiry: Developing New Technologi... Druin, Allison 1999-05-15 In today's homes and schools, children are eme... 1999
13420 C.CHI.99.1.600 Projected Realities: Conceptual Design for Cul... Gaver, William; Dunne, Anthony 1999-05-15 As a part of a European Union sponsored projec... 1999
13421 C.CHI.99.1.608 Customer-Focused Design Data in a Large, Multi... Curtis, Paula; Heiserman, Tammy; Jobusch, Davi... 1999-05-15 Qualitative user-centered design processes suc... 1999

13422 rows × 6 columns

And we can save this data:

papers.to_csv('chi-papers.csv', index=False)