Bibliography Fetch¶
A couple of the notebooks use the 'CHI Papers' data set. This notebook creates that data set from the original sources.
Setup¶
Let's import libraries:
import pandas as pd
import re
import html5lib as html
import requests
from tqdm.notebook import tqdm
Fetching the Index¶
The full list of files in the HCI Bibliography is on the index page. We'll use that to get our file list:
hcibib_root = 'http://hcibib.org'
file_index = requests.get(f'{hcibib_root}/listdir.cgi')
idx_html = html.parse(file_index.text)
Parse the files out of the HTML content itself:
files = {}
bib_re = re.compile(r'^/bibdata/(.*\.bib)')
for link in idx_html.findall('*//{http://www.w3.org/1999/xhtml}a'):
href = link.get('href')
m = bib_re.match(href)
if m:
files[m.group(1)] = href
len(files)
1988
Decoding Data¶
Let's get an example file to see what we're dealing with:
ex_path = files['CHI10-1.bib']
ex_file = requests.get(f'{hcibib_root}{ex_path}')
print(ex_file.text[:5000])
%M C.CHI.10.1.1 %T Estimating residual error rate in recognized handwritten documents using artificial error injection %S EPIC #FAIL %A Lank, Edward %A Stedman, Ryan %A Terry, Michael %B Proceedings of ACM CHI 2010 Conference on Human Factors in Computing Systems %D 2010-04-10 %V 1 %P 1-4 %K artificial error, handwriting recognition, residual error %* (c) Copyright 2010 ACM %W http://doi.acm.org/10.1145/1753326.1753328 %X Both handwriting recognition systems and their users are error prone. Handwriting recognizers make recognition errors, and users may miss those errors when verifying output. As a result, it is common for recognized documents to contain residual errors. Unfortunately, in some application domains (e.g. health informatics), tolerance for residual errors in recognized handwriting may be very low, and a desire might exist to maximize user accuracy during verification. In this paper, we present a technique that allows us to measure the performance of a user verifying recognizer output. We inject artificial errors into a set of recognized handwritten forms and show that the rate of injected errors and recognition errors caught is highly correlated in real time. Systems supporting user verification can make use of this measure of user accuracy in a variety of ways. For example, they can force users to slow down or can highlight injected errors that were missed, thus encouraging users to take more care. %M C.CHI.10.1.5 %T Predicting the cost of error correction in character-based text entry technologies %S EPIC #FAIL %A Arif, Ahmed Sabbir %A Stuerzlinger, Wolfgang %B Proceedings of ACM CHI 2010 Conference on Human Factors in Computing Systems %D 2010-04-10 %V 1 %P 5-14 %K cognitive model, error correction, error rate, hand-held devices, mobile phone, performance metric, prediction, text entry %* (c) Copyright 2010 ACM %W http://doi.acm.org/10.1145/1753326.1753329 %X Researchers have developed many models to predict and understand human performance in text entry. Most of the models are specific to a technology or fail to account for human factors and variations in system parameters, and the relationship between them. Moreover, the process of fixing errors and its effects on text entry performance has not been studied. Here, we first analyze real-life text entry error correction behaviors. We then use our findings to develop a new model to predict the cost of error correction for character-based text entry technologies. We validate our model against quantities derived from the literature, as well as with a user study. Our study shows that the predicted and observed cost of error correction correspond well. At the end, we discuss potential applications of our new model. %M C.CHI.10.1.15 %T SHRIMP: solving collision and out of vocabulary problems in mobile predictive input with motion gesture %S EPIC #FAIL %A Wang, Jingtao %A Zhai, Shumin %A Canny, John %B Proceedings of ACM CHI 2010 Conference on Human Factors in Computing Systems %D 2010-04-10 %V 1 %P 15-24 %K camera phones, dictionary-based disambiguation, gestures, mobile devices, mobile phones, multitap, predictive input, t9, text input %* (c) Copyright 2010 ACM %W http://doi.acm.org/10.1145/1753326.1753330 %X Dictionary-based disambiguation (DBD) is a very popular solution for text entry on mobile phone keypads but suffers from two problems: 1. the resolution of encoding collision (two or more words sharing the same numeric key sequence) and 2. entering out-of-vocabulary (OOV) words. In this paper, we present SHRIMP, a system and method that addresses these two problems by integrating DBD with camera based motion sensing that enables the user to express preference through a tilting or movement gesture. SHRIMP (Small Handheld Rapid Input with Motion and Prediction) runs on camera phones equipped with a standard 12-key keypad. SHRIMP maintains the speed advantage of DBD driven predictive text input while enabling the user to overcome DBD collision and OOV problems seamlessly without even a mode switch. An initial empirical study demonstrates that SHRIMP can be learned very quickly, performed immediately faster than MultiTap and handled OOV words more efficiently than DBD. %M C.CHI.10.1.25 %T Reactive information foraging for evolving goals %S Exploratory search %A Lawrance, Joseph %A Burnett, Margaret %A Bellamy, Rachel %A Bogart, Christopher %A Swart, Calvin %B Proceedings of ACM CHI 2010 Conference on Human Factors in Computing Systems %D 2010-04-10 %V 1 %P 25-34 %K field study, information foraging theory, programming %* (c) Copyright 2010 ACM %W http://doi.acm.org/10.1145/1753326.1753332 %X Information foraging models have predicted the navigation paths of people browsing the web and (more recently) of programmers while debugging, but these models do not explicitly model users' goals evolving over time. We present a new information foraging model called PFIS2 that does model information seeking with potentially evolving goals. We then evalua
These files have line prefixes that indicate different fields. Fields can continue across multiple lines. A blank line separates records. So we need to parse this.
We're going to write a function that processes a file to do exactly that. It is going to be a Python generator, so it can be used in a loop but doesn't build a whole list.
_c_re = re.compile(r'^%([A-Z*]) (.*)')
_blank_re = re.compile(r'^\s*$')
_bib_codes = {
'T': 'title',
'X': 'abstract',
'A': 'authors',
'D': 'date',
'M': 'id'
}
def parse_bib(text):
bibrec = {}
last_fld = None
for line in text.splitlines():
cm = _c_re.match(line)
if _blank_re.match(line):
# end of record, emit
if bibrec:
yield bibrec
bibrec = {}
elif cm:
# new field
code = cm.group(1)
value = cm.group(2)
fld = _bib_codes.get(code, None)
if fld:
if fld in bibrec:
bibrec[fld] += '; ' + value
else:
bibrec[fld] = value
last_fld = fld
elif last_fld:
# text, add to field
bibrec[last_fld] += ' ' + line
# if we have an in-progress record, emit it
if bibrec:
yield bibrec
ex_recs = list(parse_bib(ex_file.text))
ex_recs[0]
{'id': 'C.CHI.10.1.1', 'title': 'Estimating residual error rate in recognized handwritten documents using artificial error injection', 'authors': 'Lank, Edward; Stedman, Ryan; Terry, Michael', 'date': '2010-04-10', 'abstract': 'Both handwriting recognition systems and their users are error prone. Handwriting recognizers make recognition errors, and users may miss those errors when verifying output. As a result, it is common for recognized documents to contain residual errors. Unfortunately, in some application domains (e.g. health informatics), tolerance for residual errors in recognized handwriting may be very low, and a desire might exist to maximize user accuracy during verification. In this paper, we present a technique that allows us to measure the performance of a user verifying recognizer output. We inject artificial errors into a set of recognized handwritten forms and show that the rate of injected errors and recognition errors caught is highly correlated in real time. Systems supporting user verification can make use of this measure of user accuracy in a variety of ways. For example, they can force users to slow down or can highlight injected errors that were missed, thus encouraging users to take more care.'}
Now we have a file record extracted!
Extracting CHI Papers¶
Now we want to get all the CHI papers.
Let's define a regex to match a CHI paper file name, CHI
followed by at least a digit:
chi_re = re.compile(r'^CHI\d')
And get the CHI files:
chi_files = [k for k in files.keys() if chi_re.match(k)]
chi_files
['CHI16-2.bib', 'CHI16-1.bib', 'CHI15-1.bib', 'CHI15-2.bib', 'CHI02-1.bib', 'CHI14-1.bib', 'CHI14-2.bib', 'CHI04-2.bib', 'CHI05-2.bib', 'CHI07-2.bib', 'CHI11-1.bib', 'CHI13-1.bib', 'CHI13-2.bib', 'CHI12-1.bib', 'CHI10-1.bib', 'CHI83.bib', 'CHI12-2.bib', 'CHI11-2.bib', 'CHI82.bib', 'CHI85.bib', 'CHI86.bib', 'CHI87.bib', 'CHI88.bib', 'CHI89.bib', 'CHI90.bib', 'CHI92a.bib', 'CHI92b.bib', 'CHI93a.bib', 'CHI95-2b.bib', 'CHI95-2c.bib', 'CHI96-2b.bib', 'CHI05-1.bib', 'CHI06-1.bib', 'CHI07-1.bib', 'CHI08-1.bib', 'CHI09-1.bib', 'CHI99-2.bib', 'CHI00-1.bib', 'CHI00-2.bib', 'CHI01-1.bib', 'CHI01-2.bib', 'CHI02-2.bib', 'CHI03-1.bib', 'CHI03-2.bib', 'CHI04-1.bib', 'CHI06-2.bib', 'CHI08-2.bib', 'CHI09-2.bib', 'CHI10-2.bib', 'CHI81.bib', 'CHI91.bib', 'CHI92X.bib', 'CHI92Y.bib', 'CHI93X.bib', 'CHI93Y.bib', 'CHI93b.bib', 'CHI94-1.bib', 'CHI94-2a.bib', 'CHI94-2b.bib', 'CHI94-2c.bib', 'CHI94-2d.bib', 'CHI94-2e.bib', 'CHI95-1.bib', 'CHI95-2a.bib', 'CHI96-1.bib', 'CHI96-2a.bib', 'CHI96-2c.bib', 'CHI97-1.bib', 'CHI97-2a.bib', 'CHI97-2b.bib', 'CHI97-2c.bib', 'CHI98-1.bib', 'CHI98-2a.bib', 'CHI98-2b.bib', 'CHI98-2c.bib', 'CHI98-2d.bib', 'CHI99-1.bib']
Now we're going to a list of all the records in all the CHI files:
chi_records = []
for fk in tqdm(chi_files):
path = files[fk]
data = requests.get(f'{hcibib_root}{path}')
for rec in parse_bib(data.text):
chi_records.append(rec)
len(chi_records)
13422
Now we're going to turn that all into a Pandas data series:
papers = pd.DataFrame.from_records(chi_records)
papers
id | title | authors | date | abstract | |
---|---|---|---|---|---|
0 | C.CHI.16.2.1 | TactileVR: Integrating Physical Toys into Lear... | Amores, Judith; Benavides, Xavier; Shapira, Lior | 2016-05-07 | We present TactileVR, an immersive presence an... |
1 | C.CHI.16.2.2 | PsychicVR: Increasing mindfulness by using Vir... | Amores, Judith; Benavides, Xavier; Maes, Pattie | 2016-05-07 | We present PsychicVR, a proof-of-concept syste... |
2 | C.CHI.16.2.3 | Haptic Retargeting Video Showcase: Dynamic Rep... | Azmandian, Mahdi; Hancock, Mark; Benko, Hrvoje... | 2016-05-07 | Manipulating a virtual object with appropriate... |
3 | C.CHI.16.2.4 | Reality Editor | Heun, Valentin; Stern-Rodriguez, Eva; Teyssier... | 2016-05-07 | The Reality Editor is a tool for empowering a ... |
4 | C.CHI.16.2.5 | Access: A Mobile Application to Improve Access... | Yang, Yi; Hu, Yunqi; Hong, Yidi; Joshi, Varun;... | 2016-05-07 | This video introduces Access, a mobile applica... |
... | ... | ... | ... | ... | ... |
13417 | C.CHI.99.1.576 | Mutual Disambiguation of Recognition Errors in... | Oviatt, Sharon | 1999-05-15 | As a new generation of multimodal/media system... |
13418 | C.CHI.99.1.584 | Model-Based and Empirical Evaluation of Multim... | Suhm, Bernhard; Waibel, Alex; Myers, Brad | 1999-05-15 | Our research addresses the problem of error co... |
13419 | C.CHI.99.1.592 | Cooperative Inquiry: Developing New Technologi... | Druin, Allison | 1999-05-15 | In today's homes and schools, children are eme... |
13420 | C.CHI.99.1.600 | Projected Realities: Conceptual Design for Cul... | Gaver, William; Dunne, Anthony | 1999-05-15 | As a part of a European Union sponsored projec... |
13421 | C.CHI.99.1.608 | Customer-Focused Design Data in a Large, Multi... | Curtis, Paula; Heiserman, Tammy; Jobusch, Davi... | 1999-05-15 | Qualitative user-centered design processes suc... |
13422 rows × 5 columns
Next step, we need to extract the years from the dates.
papers['year'] = papers['date'].str.replace(r'^(\d{4}).*', r'\1').astype('i4')
papers
id | title | authors | date | abstract | year | |
---|---|---|---|---|---|---|
0 | C.CHI.16.2.1 | TactileVR: Integrating Physical Toys into Lear... | Amores, Judith; Benavides, Xavier; Shapira, Lior | 2016-05-07 | We present TactileVR, an immersive presence an... | 2016 |
1 | C.CHI.16.2.2 | PsychicVR: Increasing mindfulness by using Vir... | Amores, Judith; Benavides, Xavier; Maes, Pattie | 2016-05-07 | We present PsychicVR, a proof-of-concept syste... | 2016 |
2 | C.CHI.16.2.3 | Haptic Retargeting Video Showcase: Dynamic Rep... | Azmandian, Mahdi; Hancock, Mark; Benko, Hrvoje... | 2016-05-07 | Manipulating a virtual object with appropriate... | 2016 |
3 | C.CHI.16.2.4 | Reality Editor | Heun, Valentin; Stern-Rodriguez, Eva; Teyssier... | 2016-05-07 | The Reality Editor is a tool for empowering a ... | 2016 |
4 | C.CHI.16.2.5 | Access: A Mobile Application to Improve Access... | Yang, Yi; Hu, Yunqi; Hong, Yidi; Joshi, Varun;... | 2016-05-07 | This video introduces Access, a mobile applica... | 2016 |
... | ... | ... | ... | ... | ... | ... |
13417 | C.CHI.99.1.576 | Mutual Disambiguation of Recognition Errors in... | Oviatt, Sharon | 1999-05-15 | As a new generation of multimodal/media system... | 1999 |
13418 | C.CHI.99.1.584 | Model-Based and Empirical Evaluation of Multim... | Suhm, Bernhard; Waibel, Alex; Myers, Brad | 1999-05-15 | Our research addresses the problem of error co... | 1999 |
13419 | C.CHI.99.1.592 | Cooperative Inquiry: Developing New Technologi... | Druin, Allison | 1999-05-15 | In today's homes and schools, children are eme... | 1999 |
13420 | C.CHI.99.1.600 | Projected Realities: Conceptual Design for Cul... | Gaver, William; Dunne, Anthony | 1999-05-15 | As a part of a European Union sponsored projec... | 1999 |
13421 | C.CHI.99.1.608 | Customer-Focused Design Data in a Large, Multi... | Curtis, Paula; Heiserman, Tammy; Jobusch, Davi... | 1999-05-15 | Qualitative user-centered design processes suc... | 1999 |
13422 rows × 6 columns
And we can save this data:
papers.to_csv('chi-papers.csv', index=False)