Week 7 — Getting Data
This week has the following learning outcomes:
- Locate data sources
- Integrate data from multiple sources
- Reason about bias and social effects of data
Pursued through the following activities:
This week's videos are also in a Panopto playlist.
Introduction
What are we talking about this week? I also discuss general principles that will drive the week's material.
Finding Data
Where do we go to find data?
Resources
- Data.gov
- UCI Machine Learning Repository — quality of documentation varies widely
- Awesome Public Datasets
Data Formats
In this video I describe different formats in which you may find data.
Resources
- Pandas IO tools describes Pandas support for reading and writing various data formats
Integrating Data
This video talks about the key ideas of integrating multiple data sources.
Week 7 Quiz
Take the Week 7 quiz in Blackboard.
Values and Types
This video discusses how to deal with and clean up various data types.
Resources
In addition to the next reading, you may find these useful:
Pandas Text Operations
Read Working with Text Data.
Ethical Issues in Data
This video provides a very brief overview of some of the ethical issues in data collection and use.
The Belmont Report
Read the Belmont Report.
Additional information, including a video, is available at the HHS Office of Human Research Protections.
The ACM Code of Ethics
Read the ACM code of ethics.
A Real Example
This video describes the data cleaning and integration in a real example from my own research group. I am providing it so you can see the principles in this week's material applied to an actual problem; details of this specific data set will not be on exams.
Resources
Workflow Advice
This video talks about general principles for processing and integration workflows.
Further Reading
These aren't part of the assigned reading, but are for you to learn more.
- Social Data: Biases, Methodological Pitfalls, and Ethical Boundaries — strongly recommended
- CITI Training on Human-Subjects Research — free for Boise State students, faculty, and staff; this training is required if you are involved in carrying out human-subjects research at Boise State
- Data Cleansing Best Practices & Strategy Plan
- Principles of Data Integration
Assignment 3
Assignment 3 is due on Oct. 11 at the end of the day.