Week 7 — Getting Data (10/3–7)
This week has the following learning outcomes:
🧐 Content Overview
This week has 1h58m of video and 13200 words of assigned readings. This week’s videos are available in a Panopto folder.
The long video does not require as careful a study as the rest — it is here for
you to see a worked, real-world example, but I will not be expecting you to
reconstruct its various details.
🎥 Introduction
What are we talking about this week? I also discuss general principles that will drive the week’s material.
🎥 Finding Data
Where do we go to find data?
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
FINDING DATA
Learning Outcomes
Identify potential sources for data
Think about the data needed to answer a question
Photo by elCarito on Unsplash
Questions
What do you want to learn?
Need for data flows from what you want to learn.
Sometimes in practice we will start with data, but that shouldn’t be our default.
Example
Q: Does poverty correlate with adverse health outcomes?
Need:
Wealth/poverty data
Census
Health data
Public health agencies
Global health data exchange
Locating Existing Data
Lists of data sets
UCI Machine Learning Repository
Various lists on GitHub
Governmental operations
US federal: data.gov
Individual government agencies
Government data portals
Your organization
Searching on the Web
Purchasing
Asking data owners
Scraping from web sites
Large repositories
Common Crawl
Semantic Scholar
Seeing what other papers use
Wrapping Up
We sometimes need to be creative when locating data for a problem.
There is a lot of data available, but our question should drive our data.
Photo by Garrett Sears on Unsplash
🎥 Integrating Data
This video talks about the key ideas of integrating multiple data sources.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
INTEGRATING DATA
Learning Outcomes
Combine two data sources
Identify the kinds of data useful to make a link
Photo by Sandy Millar on Unsplash
Data Integration
?
Types of Integration
Linking records – matching records in one set with another
Best case: we have a linking identifier shared between data sets.
Pooling records — taking records of the same kind from different sources
Convert each into common format, and stack!
Identifiers?
XKCD 927: Standards. © Randall Monroe. CC-BY-NC 2.5. Modified by me.
identifiers
identifier
identifiers
identifiers
Example: Linking US Geopolitical Data
State name (unique, fine)
Postal code (2-character state abbreviation, unique, also fine)
FIPS code (Federal Information Processing Standard Series)
States and counties!
Withdrawn but still in use
Great when you have them!
ZIP codes
Legislative districts
Census tracts
Geographic position (lat, long) (ugggh)
Linking Challenges
Corrupt identifiers
Clean and correct them
Duplicate identifiers
Measure frequency of occurrence, try to measure impact
Missing identifiers
Find alternate linking strategies
Alternate linking strategies
Names?
Often not unique
Often take different forms
Locations?
Require complex geographic matching
Or address matching / normalization
Linking takes creativity and care
Recommended Approach
Take a conservative, incremental approach
Prefer identifiers
Minimize munging
Look at unlinked records, see what rules need to be added
Measure success of every linked step
Successfully-linked records
Unlinked records
Sometimes: use rate of different link strategies
Document and justify
Technical Pieces (for linking with Pandas)
Cleaning up individual columns
Series operations (esp. string ops!)
Merge data frames
pd.merge or pd.DataFrame.join (for linking records)
Pooling Records
Convert into common structure
Stack on top of each other
Sometimes: de-duplicate
Usually good to keep a field identifying record source.
pd.concat is your friend
Wrapping Up
We often need to combine data from multiple sources; sometimes linking, sometimes pooling.
Linking identifiers make this easy (sometimes).
We don’t always have them.
Photo by Joshua Hoehne on Unsplash
🎥 Values and Types
This video discusses how to deal with and clean up various data types.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
CLEANING VALUES
Learning Outcomes
Clean up individual values or columns
Identify useful Pandas operations for cleaning values
Photo by Daiga Ellaby on Unsplash
Types of Operations
Convert data types
Standardize data codes
Remove or clean corrupt data
Fill missing data (with care)
Basic Data Type Conversion
.astype()
Converts data from one type to another
Parses strings w/ simple rules
df['column'].astype('i4')
Common NumPy data types:
Integer: i1, i2, i4, i8
Unsigned: u1, u2, u4, u8
Float: f4, f8
Also have bit-based sizes:
i4 = int32
f8 = float64 (double-precision)
Parsing Strings
pd.to_datetime — parse strings to date/time objects
Can specify format!
pd.to_timedelta — parse strings to difference between times)
Making Categoricals
Categorical variables save memory and can improve presentation.
String: series.astype('category')
Integer: pd.Categorical
After initialization: rename/reorder etc.
Standardizing Data
Normalize missing data
String encoding like ‘NA’?
Numeric sentinel values like -999?
Reassign to NA
Unify case (upper, lower, title, casefold) [series.str.upper]
Replace substrings [series.str.replace]
Trim whitespace [strip/rstrip/lstrip]
Rename codes [cat.rename_categories]
Merge codes [reassign, then cat.remove_unused_categories]
Cleaning Data
Strings are often corrupt – excess characters, etc.
Drop leading/trailing whitespace [strip and friends]
Match with regular expressions
Expression to match expected data & keep
Expression to match invalid data & delete
series.str.replace(regex, replace)
Extract specific columns [series.str.slice]
Clearing Data
Sometimes values are unrecoverably corrupt
Delete value (replace with NA or INVALID code)
May separate UNKNOWN from INVALID
Or just use one UNKNOWN code
Depends on question – I often separate early, combine later
Delete record (if unusable)
Don’t delete from underlying files – in memory, or in new files
Pandas Features to Study
Series methods
Top-level
Category methods under ‘.cat’
String methods under ‘.str’
Date-time methods under ‘.dt’
Wrapping Up
Data is messy.
Pandas gives us a number of tools for working with individual values or columns.
Photo by Ashim D’Silva on Unsplash
Resources
In addition to the next reading, you may find these useful:
📃 Pandas Text Operations
Read Working with Text Data.
🎥 Ethical Issues in Data
This video provides a very brief overview of some of the ethical issues in data collection and use.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
HUMAN SUBJECTS ETHICS
Learning Outcomes
Know the pillars of classical human subjects research ethics
Apply these principles to data science applications
Know when to get IRB approval
Understand boundaries and some concerns not covered by classic ethics
Photo by Joel & Jasmin Førestbird on Unsplash
Principles of Human Subjects Ethics
Respect for Persons
Individuals should be treated as autonomous agents
Persons with diminished autonomy are entitled to protection
Beneficence
Avoid harm, provide benefit
Justice
How are costs and benefits of research distributed?
See the Belmont Report
Informed Consent
Research subjects should
Be informed about the research
In terms they can understand
So they can voluntarily choose whether to participate
Photo by Philipp Wüthrich on Unsplash
Compliance
Ethics is about doing the right thing
Compliance is about following rules and regulations
Some rules are to enforce a baseline of ethical behavior, but compliance is not ethics.
Regulation in US Federal Funding
Common Rule regulates federally-funded human subjects research
Organizations must have an Institutional Review Board (IRB)
IRB must approve human subjects research
Uses Belmont principles
Reviews consent procedures, analysis, etc.
Weighs risks and benefits
IRB approval does not mean something is ethical.
Human Subjects
U.S. Common Rule:
Human subject means a living individual about whom an investigator (whether professional or student) conducting research:
(i) Obtains information or biospecimens through intervention or interaction with the individual, and uses, studies, or analyzes the information or biospecimens; or
(ii) Obtains, uses, studies, analyzes, or generates identifiable private information or identifiable biospecimens.
Additional Relevant Concerns
Data protection regulation, law, and rights
EU GDPR
California, many jurisdictions have similar laws
Principles are broader than implementation
Contextual integrity — people may not expect data to be reused
Anti-discrimination (both law and principle)
Further development of respect and justice concerns
Pushing further on racial and gender justice
Wrapping Up
Our work has ethical implications.
It is our responsibility as scientists and as human beings to work through the ethics of our work and be accountable.
Rules and regulations can provide some guidance.
Photo by Tiago Felipe Ferreira on Unsplash
🚩 Week 7 Quiz
Take the Week 7 quiz in Canvas.
🎥 A Real Example
This video describes the data cleaning and integration in a real example from my own research group.
I am providing it so you can see the principles in this week’s material applied to an actual problem; details of this specific data set will not be on exams.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
A REAL EXAMPLE
Integrating Book Data
Learning Outcomes
See some of the data cleaning and integration decisions in a real data science project
Photo by Sharon McCutcheon on Unsplash
Question
Are book authors of different genders well-represented in user ratings and reviews ratings in online book platforms?
Data Requirements
Books
“Easy” public data:
Library of Congress Open MARC Data (10.5M records)
OpenLibrary (26.7M records)
GoodReads (crawled by UCSD; limited data)
MARC
Standard format for book and author data
Stores data in fields and subfields with numeric tags
Available in XML format
Same structure and tags
Easier to parse w/ standard tools
Still must know MARC fields!
Leader/00-23 *****nam##22*****#a#4500
001 <control number>
003 <control number identifier>
005 19920331092212.7
007/00-01 ta
008/00-39 820305s1991####nyu###########001#0#eng##
020 ##$a0845348116 :$c$29.95 (£19.50 U.K.)
020 ##$a0845348205 (pbk.)
040 ##$a[organization code]$c[organization code]
050 14$aPN1992.8.S4$bT47 1991
082 04$a791.45/75/0973$219
100 1#$aTerrace, Vincent,$d1948-
245 10$aFifty years of television :$ba guide to series and pilots, 1937-1988 /$cVincent Terrace.
246 1#$a50 years of television
260 ##$aNew York :$bCornwall Books,$cc1991.
300 ##$a864 p. ;$c24 cm.
500 ##$aIncludes index.
650 #0$aTelevision pilot programs$zUnited States$vCatalogs.
650 #0$aTelevision serials$zUnited States$vCatalogs.
What is a Book?
LoC catalogs many things:
Books
Periodicals
Government records
Music
Films
A “book” is:
Textual (MARC format ‘t’)
Also included ‘a’ (Map) to catch atlases
Not a government record
International Standard Book Number (ISBN)
Identifies books across publishers
Defined to identify a single edition of a book
Come in 10- and 13-digit formats
13-digit is variant of EAN
Can convert 10 to 13
Problems:
Many formats, surrounding text, etc.
Corrupt ISBNs
Extra digits
Missing digits
Incorrect check digit
Non-ISBNs in ISBN field
$15
Other Identifiers
Library of Congress Control Number (LCCN)
OpenLibrary edition & work IDs
GoodReads book & work IDs
Our own table primary keys
Amazon Standard Identification Number (ASIN)
Define unified book code – number identifying a book record
Real “ISBNs”
0836932722
013027190X (v. 2)
8719359022. ISBN 8719359004 (pbk.)
2890401448 : $5.00
$10.00
978-963-87472-2-8
0981482295 (978-0-9814822-9-3 : alk. paper)
0877030863 lib. bdg. (from label mounted on verso of t.p. correcting printed ISBN) (v. 1)
Cleaning ISBNs
Process ISBN fields:
Extract things that look like ISBNs
Digits, with allowable separators and terminating X
Long enough to be an ISBN
Extract “tags” (e.g. ‘pbk’)
Define allowable prefixes, postfixes, and separators
Clean junk and separators from inside ISBNs to standardize
Recursive descent parser driven by regular expressions (CS 361 & 354!)
Development Process
Start with simple rules
Manually verify some matches
Look for patterns not yet matched
Add to matcher!
Make ignore patterns for known-unparsable patterns
Iterate until everything is matched, ignored, or unusable
OpenLibrary
JSON – much easier to work with
Shares many ISBN problems – used same logic
What is a Book?
Books have multiple editions
Different publishers
Different ISBNs
Ratings / reviews usually attached to a single edition
OpenLibrary, GoodReads group into “work”
Not complete
Some records more complete than others
Best to recommend works
Grouping by Connected Components
Work
Work
Book
Book
Book
Book
ISBN
ISBN
ISBN
Each connected component gets a cluster identifier
Problems with Grouping
Miss groups when we don’t have records to link them
ISBN that never appears in any book data
Over-grouping when ISBNs are mislinked
ISBN erroneously recorded for wrong book
Two books’ ISBNs corrupted to same value
ISBN reuse
Combined volumes
Not yet good solutions – documented as limitation
Integration So Far
Book data from 3 sources
ISBNs (and ASINs) extracted
Book records mapped into clusters
ISBN ⇔ cluster mapping (what ISBNs are in a cluster?)
This is (complex) pooling, but w/o standardization
Next: linking!
Book Ratings and Reviews
Collected from online platforms:
BookCrossing [BX] (collected from an old book rating site, often used for recommender systems research)
Amazon [AZ] (public ratings and reviews, collected by UCSD in 2014)
GoodReads [GR] (ratings, reviews, and add-to-shelf actions by users with public profiles, collected by UCSD in 2017)
BookCrossing and Amazon Linking
BX: user ID, ISBN, rating
ISBNs can be corrupt – remove invalid chars
Link to book clusters
AZ: user ID, ASIN, rating, review text
Treat as ISBNs
Link to book clusters
GoodReads Linking
GoodReads contains:
User adds book to shelf (user ID, book ID, shelf ID, timestamp)
Shelf IDs not meaningful
“Is Read” flag for adding to the Read shelf
5-star rating (accompanying some add-to-shelf actions)
Review text (with user ID, item ID, timestamp, rating)
All have GoodReads book IDs – directly link to cluster
Integration So Far
Book Cluster
LoC Book Records
OpenLib Edition Records
GoodReads Book Records
OpenLib Work Records
BookCrossing Ratings
Amazon Ratings & Reviews
GoodReads Ratings, Adds, Reviews
GoodReads Work Records
Author Data
Book data includes list of authors
OpenLibrary & GoodReads have author entries
Linked identifiers
But we want author gender — missing!
Let’s not guess from names.
Authority Records
Librarians create two kinds of records:
Bibliographic records about books
Name authority records about authors
More MARC, and… MARC Authority field 375 is “gender”!
Recording Gender
MARC Authority Field 375:
Defined as gender identity
Uses an open vocabulary
Allows for multiple gender records per author with validity dates
Program for Cooperative Cataloging Working Group provides guidance on filling the field
Un-Recording Gender
Library of Congress
PCC coordinates contributing libraries
Some contributors do not record gender
Only covers LoC collections
Virtual Internet Authority File
Combines LoC with other national libraries
Smashes all gender records into male/female without dates
Linking Authority
MARC records do not have linking identifiers
Bibliographic record stores author name
Authority record has control number, but it isn’t referenced
OpenLibrary records VIAF control numbers for a few authors
Linking by Name
Find VIAF record(s) with names matching first author’s name
Extract gender identity field
Assumption: it is unlikely for two authors to have the same full name and different genders.
Combine all matching gender records into male, female, ambiguous, or unknown
Only reasonable because of goal/question
Names
Authority records list multiple name variants for an author
OpenLibrary also lists multiple names for author
Book only lists one
Inconsistent on “Family, Given” or “Given Family”
Solution:
If authority record has “B, A”, generate additional “A B” name
Accept match on any name
Integration
Book Cluster
LoC Book Records
OpenLib Edition Records
GoodReads Book Records
OpenLib Work Records
BookCrossing Ratings
Amazon Ratings & Reviews
GoodReads Ratings, Adds, Reviews
GoodReads Work Records
First Author Gender
Measuring
Representation
Limitations
Over-links authors, which can cause some gender errors
Over-links books, causing some gender and book errors
Only records binary gender
Unknown gender distribution of unknown books
Presence of the label is probably biased, we don’t know how
LoC has linked data that should help; not yet publicly available
Software
Rust program
Parses raw data files
Loads into SQL database
Cleans ISBNs
Rust:
Compiled language (fast like C)
High-level features (memory management, types, strings, XML parsers, etc.)
PostgreSQL
Stores data
Extracts desired fields from natural objects
Integrates data sources
Python
Clusters books
Computes statistics
DVC controls whole process
Wrapping Up
Many decisions go into a serious data integration project.
Document and justify them all.
Design workflow to make it transparent and reproducible.
Photo by Erol Ahmed on Unsplash
🎥 Workflow Advice
This video talks about general principles for processing and integration workflows.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
WORKFLOW ADVICE
Learning Outcomes
Understand basic principles for data processing workflows
Design a workflow that preserves data as much as possible
Photo by Ivan Bandura on Unsplash
Workflow
Single-Source Cleaning
Standardize representations
Delete corrupt data
De-duplicate if appropriate
Preserve information for downstream use
To evaluate impact of decisions
Keep goals and questions in mind
Integration
Link or pool records
De-duplicate (if appropriate)
Unify representations (if not done in per-source cleaning)
Describe integration results & resulting data set
Preserve information for downstream use
Analysis
Now you get to drop data, etc.
Select & reprocess the subset of data for a particular analysis
Different analyses may have different requirements
Multiple analyses and variants from the same data
Still may need to iterate w/ cleaning & integration
Wrapping Up
Data processing is an iterative, multi-stage process.
Preserving information for downstream use gives flexibility and helps us assess impact of upstream decisions.
Photo by Ravi Pinisetti on Unsplash
📃 Further Reading
These aren’t part of the assigned reading, but are for you to learn more.
📩 Assignment 3
Assignment 3 is due on Sunday, October 9, 2022 at the end of the day.