The purpose of Assignment 3 is to give you experience integrating data from multiple sources to explore a two-variable question.
It is due Sunday, October 9, 2022 at the end of the day (11:59 PM). Submit your
.ipynb and PDF files to Canvas.
This assignment is not structured step-by-step like the first two. I describe the data, and the questions you need to answer, and you need to build a well-structured notebook that answers them.
Read the entire assignment before you begin!
Central Question: Are health outcomes correlated with poverty levels in a community?
That is the key objective of this assignment. To answer this, we are going to obtain poverty data from the U.S. Census Bureau and health data from the Global Health Data Exchange and the US Centers for Disease Control, and look at the relationship between a state’s poverty rate and various health outcomes.
We are going to operationalize these as follows:
Poverty rate: fraction of family households underneath the poverty line in 2014
Mortality rates from infectious diseases in 2014
Infant mortality rate in 2014
Unit of analysis: state (we will discuss implications of this!)
For more background, read Healthy People 2020’s description of social determinants of health.
Correlation and Causation
This analysis is purely correlational, not causal. It will not provide evidence as to whether or not poverty causes particular health outcomes — much more sophisticated analysis techniques would be needed for such conclusions; you can learn such techniques in the Econometrics classes.
Correlational analyses are still useful, however, because they allow us to locate possible effects for further inquiry. If poverty and mortality is correlated, than analysis that seeks to identify the relevant causes (e.g. poverty itself, some specific effect aspect of poverty, or another thing causing both poverty and mortality) can inform potential future interventions. Be careful in your reporting, however, that you do not claim causality.
We’re going to get data from 3 sources.
Income — US Census#
We are going to get income data from the US Census Bureau. See Using the Census for an introduction to working with this data.
The in-class exercise on September 30 will give you practice with census data.
Before using census data, you need to get an API key. Request one here.
You will also need to install the Python package for the Census API, and a package of US state code data. These are not available in the main Conda repository, so use Pip:
pip install census us
The census data comes in a variety of files. These files include:
sf1— Summary File 1, containing complete count information on the decennial census.
acs5— American Community Survey, a supplementary annual survey of a sample of the population carried out by the census bureau, 5-year estimates.
We are going to be using ACS5. It has thousands of variables. The variable list describes them, and the ones of interest are all reported as estimated population counts. That means variable
B06009_003E is an estimate, based on the sample, of the number of people in a geographic region whose highest educational attainment is a high school degree.
B05010_* variables describe population with respect to the poverty line.
B05010_001E is the estimated total number of family households, and
B05010_002E is the estimated number under the poverty line. These variables are described in more detail in the Social Data Explorer.
You can use these two variables to caculate the poverty level as defined above: the fraction of households under the poverty line.
Fetching Census Data
If you have a
Census object defined in variable
c, as we did in the Sep. 30 example, you can fetch these variables for all states with:
state_pop = pd.DataFrame.from_records(c.acs5.state(('NAME', 'B05010_001E', 'B05010_002E'), '*', year=2014))
state column in the resulting table is the state’s FIPS code, as a string.
The Social Explorer has a very good database of census fields. This database describes all the census sub-tables, and lets you browse them by group (e.g. B05010).
Within the page for a group, such as B05010, there
is a list of all the sub-variables, organized hierarchically so you can see how they relate to each other.
The variable names do need a bit of modification; when Social Explorer says
B05010001, you need to request that from the census as
B05010_001E (the Estimated value).
The census API puts an underscore (
_) between th evariable group code (
B05010) and the specific sub-variable identifier (
001E); Social Explorer does not include the underscore.
The explorer also includes excerpts from the full census technical documentation providing more detail about the different variables and their codings.
Infectious Diseases — GHDx#
We will obtain infectious disease mortality rates from the Global Health Data Exchange. These files contain county- and state-level mortality data from the U.S. from 1980 to 2014. The files themselves are on the “Files” tab; download “National: All States and Counties (5-year Intervals)”.
Pandas provides a
read_excel function to read data tables from Excel files.
This file has 6 sheets, one for each of 6 different disease families.
Reading GHDx Data
You can read these data files with
sheets = pd.read_excel('IHME_USA_COUNTY_INFECT_DIS_MORT_1980_2014_NATIONAL_Y2018M03D27.XLSX', sheet_name=None, header=1, skipfooter=2)
sheets variable is a dictionary (remember those?) mapping sheet names (disease families) to data frames.
sheet_name=None option tells it to load all sheets, and
header=1 says to skip the first row and read the column headers from the second.
skipfooter=2 tells it to skip the last two rows as well. Look at the files to see why!
For this function to work, you will need the
conda install openpyxl
You won’t need to use openpyxl directly, but Pandas uses it to read Excel files.
Alternatively, you can open the data file in Excel or LibreOffice and export the parts you need to a CSV file.
This data frame has annoying column names, and also annoying fields: the field values include confidence intervals.
Extracting Column Data
pandas.Series.str.replace() method, combined with a type conversion, lets us extract the actual estimate from the cell that includes the confidence intervals.
We can do this by replacing the parenthsized intervals with nothing. Like this:
df['Mortality'] = df['Mortality Rate, 2014*'].str.replace(r'\s*\(.+\)', '').astype('f8')
This mortality data is recorded as “deaths per 100,000 people”. The rates are age-standardized: look up age-standardized mortality rates to read about what this means.
Infant Mortality — CDC#
The Centers for Disease Control and Prevention provide a table of infant mortality data by state. Click “Download Data (CSV)” at the bottom of the page to download the file.
This file has multiple years; you will need to select the rows with the year you want.
A FIPS code is a numeric code that identifies a U.S. state or county.
State codes are two digits; you can therefore extract the state-level data from an Infectious Disease table by selecting all rows where the FIPS column is less than 100 (the national-level
NaN will also be excluded by such a filter).
Both the census and the infectious disease table use FIPS codes, so you can join or merge (subsets of) the two tables by their FIPS code.
The census will return FIPS data as a string, so you will need to convert it to a number (
.astype('i4')) before joining.
It is fine to join an integer column with a floating-point column.
The infant mortality data does not use FIPS codes — it uses state abbreviations.
To link FIPS codes to state abbreviations, you can use the
us Python package (and create a data table with the appropriate linking identifiers), or download the “National FIPS and GNIS Codes File” from the US Census Bureau.
It is separated by vertical bars (
|), so you can read it with:
You may need to convert the data type of the FIPS code column. Once you have this file, you can merge infant mortality with census data by:
Merge infant mortality with the state codes file by state abbreviation to get FIPS codes.
Merge the resulting table with census data by FIPS code.
Submit a notebook using the data above to do the following (all for year 2014 data):
(20%) Describe, both numerically and graphically, the distribution of state poverty rates, as well as a chart or table showing the poverty rate for each state. A table showing the poverty rate and relevant mortality rates for each state, as an input to the remaining sections, would be very helpful to the reader.
(30%) Show the relationship of poverty rates to mortality rates of two different disease families — Meningitis and Diarrheal diseases — using appropriate plots.
(20%) Quantify the relationship between poverty and each of the 2 mortality rates in (2) by computing correlation coefficients with bootstrapped confidence intervals for each disease.
(20%) Repeat (2) and (3) for infant mortality.
(10%) Write 2–3 paragraphs about what you learn from these data. Discuss also limitations that you see in the data and analysis! This should include insights from the analysis, not just a rehearsal of the tasks you performed.
For each aspect, it is not sufficient to simply present the results of the computations. Your notebook text needs to interpret the numeric results and plots in the context of the problem and research question(s). What do the numbers mean? For example, if the confidence interval of a correlation coefficient includes 0, what does this mean about the correlation? Data science, as we have discussed, is about quantitative insight on questions. Your assignment needs to make the connection between the computation and the goals and questions so that you the data can turn into actual insight.
The Correlation Notebook shows how to bootstrap a confidence interval for a correlation.
There are a couple of different ways you can think about organizing your data and questions:
Load and merge all data sets in a “Data” section, and then proceed with the analysis.
Load and merge data in analysis-specific subsections.
I’m going to try something new this time, and please let me know if it’s helpful or not. I’m going to provide a little info on how long I expect each piece might take, given what we’ve done in the class so far. These numbers are aimed to target about the 80th %ile of how long it would take the students in this class, but I admit it’s just my best guess. My hope is that it helps you calibrate your work and expectations. I would love feedback on this! Should I keep doing it?
Fetching census data & describing poverty rates: 1.5 hours (we’ve done a lot of data descriptions by now)
Reading GHDx data and linking it to census data: 1 hour
Showing infectious disease and mortality: 1 hour
Computing correlations and bootstrapped CIs for infectious diseases: 2 hours
Reading and linking infant mortality data: 1.5 hours
Presenting infant mortality correlations: 1 hour (you should be able to reuse ideas & code from the infectious diseases)
Cleaning up, presenting, and writing your conclusions: 2 hours