Week 7 — Getting Data (10/3–7)#

This week has the following learning outcomes:

Locate data sources
Integrate data from multiple sources
Reason about bias and social effects of data

🧐 Content Overview#

Element	Length
🎥 Data	8m22s
🎥 Finding Data	7m25s
🎥 Data Formats	13m55s
🎥 Integrating Data	11m40s
🎥 Values and Types	8m15s
📃 Working with Text Data	4200 words
🎥 Ethics	14m10s
📃 The Belmont Report	5500 words
📃 ACM Code of Ethics and Professional Responsibility	3500 words
🎥 Real Example	48m57s
🎥 Workflow Advice	5m24s

This week has 1h58m of video and 13200 words of assigned readings. This week’s videos are available in a Panopto folder.

The long video does not require as careful a study as the rest — it is here for you to see a worked, real-world example, but I will not be expecting you to reconstruct its various details.

📅 Deadlines#

Week 7 Quiz Thursday Oct. 7 at 8AM
Assignment 3 Sunday Oct. 10 at 11:59 PM

🎥 Introduction#

What are we talking about this week? I also discuss general principles that will drive the week’s material.

Video (8m22s)

Slides

🎥 Finding Data#

Where do we go to find data?

Video (7m25s)

Slides

Resources#

Data.gov
UCI Machine Learning Repository — quality of documentation varies widely
Awesome Public Datasets

🎥 Data Formats#

In this video I describe different formats in which you may find data.

Video (13m55s)

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand DATA TYPES AND FORMATS Learning Outcomes Understand broad categories of data formats Know where to look for particular data formats Read data in different encodings Photo by Joshua Newton on Unsplash Types of Data Tabular is organized into columns and rows like a spreadsheet. Each row has the same shape & attributes. Semi-Structured data has structure, like labeled fields, but different objects can have different fields. Unstructured data has no defined structure. Includes raw text and images. Tabular Data Delimited text Comma-separated (CSV) Tab-separated (TSV) Spreadsheet files Excel (xlsx or xls) Sometimes actually TSV 🤦‍♂️ “.dat” means nothing Binary formats HDF NetCDF Parquet Other Matlab STATA … Delimited Text One record per line (usually) Delimiter: comma, tab (\t), |, … Escape: Quotes: "field" Double-up quotes to escape inside Backslashes Header: usually first row Pandas read_csv / to_csv pretty portable (though watch index!) Semi-Structured Data JSON: dictionaries, lists, strings, numbers, true/false, null XML: trees of nodes with text and attributes YAML: JSON ++ MSGPACK RDF, SQL Others… { "fit": "fit", "user_id": "420272", "bust size": "34d", "item_id": "2260466", "weight": "137lbs", "rating": "10", "rented for": "vacation", "review_text": "An adorable romper! Belt and zipper were a little hard to navigate in a full day of wear/bathroom use, but that's to be expected. Wish it had pockets, but other than that-- absolutely perfect! I got a million compliments.", "body type": "hourglass", "review_summary": "So many compliments!", "category": "romper", "height": "5' 8\"", "size": 14, "age": "28", "review_date": "April 20, 2016"} Unstructured Data Raw text — (usually) human-written text Images — we can try to get data from images Can appear as a field in tabular or semi-structured data. More Data A table… … in a PDF … in an image in a PDF … in a photograph in a PDF Corrupt data Syntactically invalid XML Incorrect JSON syntax Weird nonprinting characters Encodings Text data is encoded (more on this later) UTF-8: modern standard Many older encodings, both single-byte and double-byte XML declares encoding JSON almost always UTF-8 Text (delimited or unstructured): ??? Encodings are not self-describing. Look at Data Mac/Linux: less Windows: Notepad++ (or other text editor) Data may be compressed, usually indicated in extension .gz .xz .7z .zip Unix: zless, etc. Wrapping Up Data comes in many formats. Sometimes it is tabular, semi-structured, or unstructured. Some formats are nice. Others are not. Photo by Wesley Tingey on Unsplash

Resources#

Pandas IO tools describes Pandas support for reading and writing various data formats

🎥 Integrating Data#

This video talks about the key ideas of integrating multiple data sources.

Video (11m40s)

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand INTEGRATING DATA Learning Outcomes Combine two data sources Identify the kinds of data useful to make a link Photo by Sandy Millar on Unsplash Data Integration ? Types of Integration Linking records – matching records in one set with another Best case: we have a linking identifier shared between data sets. Pooling records — taking records of the same kind from different sources Convert each into common format, and stack! Identifiers? XKCD 927: Standards. © Randall Monroe. CC-BY-NC 2.5. Modified by me. identifiers identifier identifiers identifiers Example: Linking US Geopolitical Data State name (unique, fine) Postal code (2-character state abbreviation, unique, also fine) FIPS code (Federal Information Processing Standard Series) States and counties! Withdrawn but still in use Great when you have them! ZIP codes Legislative districts Census tracts Geographic position (lat, long) (ugggh) Linking Challenges Corrupt identifiers Clean and correct them Duplicate identifiers Measure frequency of occurrence, try to measure impact Missing identifiers Find alternate linking strategies Alternate linking strategies Names? Often not unique Often take different forms Locations? Require complex geographic matching Or address matching / normalization Linking takes creativity and care Recommended Approach Take a conservative, incremental approach Prefer identifiers Minimize munging Look at unlinked records, see what rules need to be added Measure success of every linked step Successfully-linked records Unlinked records Sometimes: use rate of different link strategies Document and justify Technical Pieces (for linking with Pandas) Cleaning up individual columns Series operations (esp. string ops!) Merge data frames pd.merge or pd.DataFrame.join (for linking records) Pooling Records Convert into common structure Stack on top of each other Sometimes: de-duplicate Usually good to keep a field identifying record source. pd.concat is your friend Wrapping Up We often need to combine data from multiple sources; sometimes linking, sometimes pooling. Linking identifiers make this easy (sometimes). We don’t always have them. Photo by Joshua Hoehne on Unsplash

🎥 Values and Types#

This video discusses how to deal with and clean up various data types.

Video (8m15s)

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand CLEANING VALUES Learning Outcomes Clean up individual values or columns Identify useful Pandas operations for cleaning values Photo by Daiga Ellaby on Unsplash Types of Operations Convert data types Standardize data codes Remove or clean corrupt data Fill missing data (with care) Basic Data Type Conversion .astype() Converts data from one type to another Parses strings w/ simple rules df['column'].astype('i4') Common NumPy data types: Integer: i1, i2, i4, i8 Unsigned: u1, u2, u4, u8 Float: f4, f8 Also have bit-based sizes: i4 = int32 f8 = float64 (double-precision) Parsing Strings pd.to_datetime — parse strings to date/time objects Can specify format! pd.to_timedelta — parse strings to difference between times) Making Categoricals Categorical variables save memory and can improve presentation. String: series.astype('category') Integer: pd.Categorical After initialization: rename/reorder etc. Standardizing Data Normalize missing data String encoding like ‘NA’? Numeric sentinel values like -999? Reassign to NA Unify case (upper, lower, title, casefold) [series.str.upper] Replace substrings [series.str.replace] Trim whitespace [strip/rstrip/lstrip] Rename codes [cat.rename_categories] Merge codes [reassign, then cat.remove_unused_categories] Cleaning Data Strings are often corrupt – excess characters, etc. Drop leading/trailing whitespace [strip and friends] Match with regular expressions Expression to match expected data & keep Expression to match invalid data & delete series.str.replace(regex, replace) Extract specific columns [series.str.slice] Clearing Data Sometimes values are unrecoverably corrupt Delete value (replace with NA or INVALID code) May separate UNKNOWN from INVALID Or just use one UNKNOWN code Depends on question – I often separate early, combine later Delete record (if unusable) Don’t delete from underlying files – in memory, or in new files Pandas Features to Study Series methods Top-level Category methods under ‘.cat’ String methods under ‘.str’ Date-time methods under ‘.dt’ Wrapping Up Data is messy. Pandas gives us a number of tools for working with individual values or columns. Photo by Ashim D’Silva on Unsplash

Resources#

In addition to the next reading, you may find these useful:

📃 Pandas Text Operations#

Read Working with Text Data.

🎥 Ethical Issues in Data#

This video provides a very brief overview of some of the ethical issues in data collection and use.

Video (14m10s)

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand HUMAN SUBJECTS ETHICS Learning Outcomes Know the pillars of classical human subjects research ethics Apply these principles to data science applications Know when to get IRB approval Understand boundaries and some concerns not covered by classic ethics Photo by Joel & Jasmin Førestbird on Unsplash Principles of Human Subjects Ethics Respect for Persons Individuals should be treated as autonomous agents Persons with diminished autonomy are entitled to protection Beneficence Avoid harm, provide benefit Justice How are costs and benefits of research distributed? See the Belmont Report Informed Consent Research subjects should Be informed about the research In terms they can understand So they can voluntarily choose whether to participate Photo by Philipp Wüthrich on Unsplash Compliance Ethics is about doing the right thing Compliance is about following rules and regulations Some rules are to enforce a baseline of ethical behavior, but compliance is not ethics. Regulation in US Federal Funding Common Rule regulates federally-funded human subjects research Organizations must have an Institutional Review Board (IRB) IRB must approve human subjects research Uses Belmont principles Reviews consent procedures, analysis, etc. Weighs risks and benefits IRB approval does not mean something is ethical. Human Subjects U.S. Common Rule: Human subject means a living individual about whom an investigator (whether professional or student) conducting research: (i) Obtains information or biospecimens through intervention or interaction with the individual, and uses, studies, or analyzes the information or biospecimens; or (ii) Obtains, uses, studies, analyzes, or generates identifiable private information or identifiable biospecimens. Additional Relevant Concerns Data protection regulation, law, and rights EU GDPR California, many jurisdictions have similar laws Principles are broader than implementation Contextual integrity — people may not expect data to be reused Anti-discrimination (both law and principle) Further development of respect and justice concerns Pushing further on racial and gender justice Wrapping Up Our work has ethical implications. It is our responsibility as scientists and as human beings to work through the ethics of our work and be accountable. Rules and regulations can provide some guidance. Photo by Tiago Felipe Ferreira on Unsplash

📃 The Belmont Report#

Read the Belmont Report.

Additional information, including a video, is available at the HHS Office of Human Research Protections.

📃 The ACM Code of Ethics#

Read the ACM code of ethics.

🚩 Week 7 Quiz#

Take the Week 7 quiz in Canvas.

🎥 A Real Example#

This video describes the data cleaning and integration in a real example from my own research group. I am providing it so you can see the principles in this week’s material applied to an actual problem; details of this specific data set will not be on exams.

Video (48m57s)

CS 533INTRO TO DATA SCIENCE Michael Ekstrand A REAL EXAMPLE Integrating Book Data Learning Outcomes See some of the data cleaning and integration decisions in a real data science project Photo by Sharon McCutcheon on Unsplash Question Are book authors of different genders well-represented in user ratings and reviews ratings in online book platforms? Data Requirements Books “Easy” public data: Library of Congress Open MARC Data (10.5M records) OpenLibrary (26.7M records) GoodReads (crawled by UCSD; limited data) MARC Standard format for book and author data Stores data in fields and subfields with numeric tags Available in XML format Same structure and tags Easier to parse w/ standard tools Still must know MARC fields! Leader/00-23 *****nam##22*****#a#4500 001 <control number> 003 <control number identifier> 005 19920331092212.7 007/00-01 ta 008/00-39 820305s1991####nyu###########001#0#eng## 020 ##$a0845348116 :$c$29.95 (£19.50 U.K.) 020 ##$a0845348205 (pbk.) 040 ##$a[organization code]$c[organization code] 050 14$aPN1992.8.S4$bT47 1991 082 04$a791.45/75/0973$219 100 1#$aTerrace, Vincent,$d1948- 245 10$aFifty years of television :$ba guide to series and pilots, 1937-1988 /$cVincent Terrace. 246 1#$a50 years of television 260 ##$aNew York :$bCornwall Books,$cc1991. 300 ##$a864 p. ;$c24 cm. 500 ##$aIncludes index. 650 #0$aTelevision pilot programs$zUnited States$vCatalogs. 650 #0$aTelevision serials$zUnited States$vCatalogs. What is a Book? LoC catalogs many things: Books Periodicals Government records Music Films A “book” is: Textual (MARC format ‘t’) Also included ‘a’ (Map) to catch atlases Not a government record International Standard Book Number (ISBN) Identifies books across publishers Defined to identify a single edition of a book Come in 10- and 13-digit formats 13-digit is variant of EAN Can convert 10 to 13 Problems: Many formats, surrounding text, etc. Corrupt ISBNs Extra digits Missing digits Incorrect check digit Non-ISBNs in ISBN field $15 Other Identifiers Library of Congress Control Number (LCCN) OpenLibrary edition & work IDs GoodReads book & work IDs Our own table primary keys Amazon Standard Identification Number (ASIN) Define unified book code – number identifying a book record Real “ISBNs” 0836932722 013027190X (v. 2) 8719359022. ISBN 8719359004 (pbk.) 2890401448 : $5.00 $10.00 978-963-87472-2-8 0981482295 (978-0-9814822-9-3 : alk. paper) 0877030863 lib. bdg. (from label mounted on verso of t.p. correcting printed ISBN) (v. 1) Cleaning ISBNs Process ISBN fields: Extract things that look like ISBNs Digits, with allowable separators and terminating X Long enough to be an ISBN Extract “tags” (e.g. ‘pbk’) Define allowable prefixes, postfixes, and separators Clean junk and separators from inside ISBNs to standardize Recursive descent parser driven by regular expressions (CS 361 & 354!) Development Process Start with simple rules Manually verify some matches Look for patterns not yet matched Add to matcher! Make ignore patterns for known-unparsable patterns Iterate until everything is matched, ignored, or unusable OpenLibrary JSON – much easier to work with Shares many ISBN problems – used same logic What is a Book? Books have multiple editions Different publishers Different ISBNs Ratings / reviews usually attached to a single edition OpenLibrary, GoodReads group into “work” Not complete Some records more complete than others Best to recommend works Grouping by Connected Components Work Work Book Book Book Book ISBN ISBN ISBN Each connected component gets a cluster identifier Problems with Grouping Miss groups when we don’t have records to link them ISBN that never appears in any book data Over-grouping when ISBNs are mislinked ISBN erroneously recorded for wrong book Two books’ ISBNs corrupted to same value ISBN reuse Combined volumes Not yet good solutions – documented as limitation Integration So Far Book data from 3 sources ISBNs (and ASINs) extracted Book records mapped into clusters ISBN ⇔ cluster mapping (what ISBNs are in a cluster?) This is (complex) pooling, but w/o standardization Next: linking! Book Ratings and Reviews Collected from online platforms: BookCrossing [BX] (collected from an old book rating site, often used for recommender systems research) Amazon [AZ] (public ratings and reviews, collected by UCSD in 2014) GoodReads [GR] (ratings, reviews, and add-to-shelf actions by users with public profiles, collected by UCSD in 2017) BookCrossing and Amazon Linking BX: user ID, ISBN, rating ISBNs can be corrupt – remove invalid chars Link to book clusters AZ: user ID, ASIN, rating, review text Treat as ISBNs Link to book clusters GoodReads Linking GoodReads contains: User adds book to shelf (user ID, book ID, shelf ID, timestamp) Shelf IDs not meaningful “Is Read” flag for adding to the Read shelf 5-star rating (accompanying some add-to-shelf actions) Review text (with user ID, item ID, timestamp, rating) All have GoodReads book IDs – directly link to cluster Integration So Far Book Cluster LoC Book Records OpenLib Edition Records GoodReads Book Records OpenLib Work Records BookCrossing Ratings Amazon Ratings & Reviews GoodReads Ratings, Adds, Reviews GoodReads Work Records Author Data Book data includes list of authors OpenLibrary & GoodReads have author entries Linked identifiers But we want author gender — missing! Let’s not guess from names. Authority Records Librarians create two kinds of records: Bibliographic records about books Name authority records about authors More MARC, and… MARC Authority field 375 is “gender”! Recording Gender MARC Authority Field 375: Defined as gender identity Uses an open vocabulary Allows for multiple gender records per author with validity dates Program for Cooperative Cataloging Working Group provides guidance on filling the field Un-Recording Gender Library of Congress PCC coordinates contributing libraries Some contributors do not record gender Only covers LoC collections Virtual Internet Authority File Combines LoC with other national libraries Smashes all gender records into male/female without dates Linking Authority MARC records do not have linking identifiers Bibliographic record stores author name Authority record has control number, but it isn’t referenced OpenLibrary records VIAF control numbers for a few authors Linking by Name Find VIAF record(s) with names matching first author’s name Extract gender identity field Assumption: it is unlikely for two authors to have the same full name and different genders. Combine all matching gender records into male, female, ambiguous, or unknown Only reasonable because of goal/question Names Authority records list multiple name variants for an author OpenLibrary also lists multiple names for author Book only lists one Inconsistent on “Family, Given” or “Given Family” Solution: If authority record has “B, A”, generate additional “A B” name Accept match on any name Integration Book Cluster LoC Book Records OpenLib Edition Records GoodReads Book Records OpenLib Work Records BookCrossing Ratings Amazon Ratings & Reviews GoodReads Ratings, Adds, Reviews GoodReads Work Records First Author Gender Measuring Representation Limitations Over-links authors, which can cause some gender errors Over-links books, causing some gender and book errors Only records binary gender Unknown gender distribution of unknown books Presence of the label is probably biased, we don’t know how LoC has linked data that should help; not yet publicly available Software Rust program Parses raw data files Loads into SQL database Cleans ISBNs Rust: Compiled language (fast like C) High-level features (memory management, types, strings, XML parsers, etc.) PostgreSQL Stores data Extracts desired fields from natural objects Integrates data sources Python Clusters books Computes statistics DVC controls whole process Wrapping Up Many decisions go into a serious data integration project. Document and justify them all. Design workflow to make it transparent and reproducible. Photo by Erol Ahmed on Unsplash

Resources#

🎥 Workflow Advice#

This video talks about general principles for processing and integration workflows.

Video (5m24s)

Slides

📃 Further Reading#

These aren’t part of the assigned reading, but are for you to learn more.

Social Data: Biases, Methodological Pitfalls, and Ethical Boundaries — strongly recommended
CITI Training on Human-Subjects Research — free for Boise State students, faculty, and staff; this training is required if you are involved in carrying out human-subjects research at Boise State
Data Cleansing Best Practices & Strategy Plan
Principles of Data Integration

📩 Assignment 3#

Assignment 3 is due on Sunday, October 9, 2022 at the end of the day.

CS 533 Fall 2022

Week 7 — Getting Data (10/3–7)

Contents

Week 7 — Getting Data (10/3–7)#

🧐 Content Overview#

📅 Deadlines#

🎥 Introduction#

🎥 Finding Data#

Resources#

🎥 Data Formats#

Resources#

🎥 Integrating Data#

🎥 Values and Types#

Resources#

📃 Pandas Text Operations#

🎥 Ethical Issues in Data#

📃 The Belmont Report#

📃 The ACM Code of Ethics#

🚩 Week 7 Quiz#

🎥 A Real Example#

Resources#

🎥 Workflow Advice#

📃 Further Reading#

📩 Assignment 3#