# Week 7 — Getting Data (Oct. 4–8)

This week has the following learning outcomes:

- Locate data sources
- Integrate data from multiple sources
- Reason about bias and social effects of data

## {{moverview}} Content Overview

:::{module} week7
:folder: 3e59e9f-4955-4cea-a709-adb8012572f7
:::

## {{mcal}} Deadlines

- Week 7 Quiz **Thursday Oct. 7 at 8AM**
- Assignment 3 **Sunday Oct. 10 at 11:59 PM**

## {{mvideo}} Introduction

What are we talking about this week?  I also discuss general principles that will drive the week's material.

:::{video}
:name: 7-1 - Data
:length: 8m22s
:slide-id: 495979F9A431DDB0%2173225
:slide-auth: APts3vR7Lc59NkI
:::

## {{mvideo}} Finding Data

Where do we go to find data?

:::{video}
:name: 7-2 - Finding Data
:length: 7m25s
:slide-id: 495979F9A431DDB0%2173227
:slide-auth: AIXRrNdcFNGB7sE
:::

### Resources

- [Data.gov](https://data.gov)
- [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) — quality of documentation varies widely
- [Awesome Public Datasets](https://github.com/awesomedata/awesome-public-datasets)

## {{mvideo}} Data Formats

In this video I describe different formats in which you may find data.

:::{video}
:name: 7-3 - Data Formats
:length: 13m55s
:slide-id: 495979F9A431DDB0%2173229
:slide-auth: AL8FsyUoRQ7_NXs
:::

### Resources

- [Pandas IO tools](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) describes Pandas support for reading and writing various data formats

## {{mvideo}} Integrating Data

This video talks about the key ideas of integrating multiple data sources.

:::{video}
:name: 7-4 - Integrating Data
:length: 11m40s
:slide-id: 495979F9A431DDB0%2173230
:slide-auth: AKZU3thNfrXQ1v0
:::

## {{mvideo}} Values and Types

This video discusses how to deal with and clean up various data types.

:::{video}
:name: 7-5 - Values and Types
:length: 8m15s
:slide-id: 495979F9A431DDB0%2173233
:slide-auth: AHyCBqpfJRcrq68
:::

### Resources

In addition to the next reading, you may find these useful:

- [Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html)
- [How to use Regex in Pandas](https://kanoki.org/2019/11/12/how-to-use-regex-in-pandas/)


## {{mdoc}} Pandas Text Operations

:::{reading}
:title: Working with Text Data
:url: https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html
:length: 4200 words
:::

Read [Working with Text Data](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html).

## {{mvideo}} Ethical Issues in Data

This video provides a **very brief** overview of some of the ethical issues in data collection and use.

:::{video}
:name: 7-6 - Ethics
:length: 14m10s
:slide-id: 495979F9A431DDB0%2173240
:slide-auth: AHtpuXAGUcIYFNI
:::

## {{mdoc}} The Belmont Report

:::{reading}
:title: The Belmont Report
:length: 5500 words
:url: https://www.hhs.gov/ohrp/regulations-and-policy/belmont-report/read-the-belmont-report/index.html
:::

Read [the Belmont Report](https://www.hhs.gov/ohrp/regulations-and-policy/belmont-report/read-the-belmont-report/index.html).

Additional information, including a video, is available at the [HHS Office of Human Research Protections](https://www.hhs.gov/ohrp/regulations-and-policy/belmont-report/index.html).

## {{mdoc}} The ACM Code of Ethics

:::{reading}
:title: ACM Code of Ethics and Professional Responsibility
:url: https://www.acm.org/code-of-ethics
:length: 3500 words
:::

Read [the ACM code of ethics](https://www.acm.org/code-of-ethics).

## {{mquiz}} Week 7 Quiz

Take the Week 7 quiz in {{LMS}}.

## {{mvideo}} A Real Example

This video describes the data cleaning and integration in a real example from my own research group.
I am providing it so you can see the principles in this week's material applied to an actual problem; details of this specific data set will not be on exams.

:::{video}
:name: 7-7 - Real Example
:length: 48m57s
:slide-id: 495979F9A431DDB0%2173242
:slide-auth: AF1TnxA02bYHuSA
:::

### Resources

- [Data set documentation](https://bookdata.piret.info/)
- [Data integration code](https://github.com/BoiseState/bookdata-tools)
- [Paper using this data](https://md.ekstrandom.net/pubs/bag-extended)

## {{mvideo}} Workflow Advice

This video talks about general principles for processing and integration workflows.

:::{video}
:name: 7-8 - Workflow Advice
:length: 5m24s
:slide-id: 495979F9A431DDB0%2173241
:slide-auth: AMRLqk4xIA_OSL0
:::

## {{mdoc}} Further Reading

These aren't part of the assigned reading, but are for you to learn more.

- [Social Data: Biases, Methodological Pitfalls, and Ethical Boundaries](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2886526) — **strongly recommended**
- [CITI Training on Human-Subjects Research](https://www.boisestate.edu/research-compliance/citi-training/) — free for Boise State students, faculty, and staff; this training is required if you are involved in carrying out human-subjects research at Boise State
- [Data Cleansing Best Practices & Strategy Plan](https://www.dataisbeauty.com/data-cleansing-best-practices-strategy/)
- [<cite>Principles of Data Integration</cite>](https://boisestate.on.worldcat.org/oclc/796466994)

## {{massignment}} Assignment 3

Assignment 3 is due on **Oct. 10** at the end of the day.
