This assignment is to give you experience with exploratory analysis and describing a data set.
It is due on September 13, 2020 at the end of the day (11:59 PM).
.ipynb and PDF files to Blackboard.
- September 10, 2020
- Clarify that charts should have one bar or box per characteristic value.
- Clarify that “control” is how the documentation refers to the private/public status.
- Neither of these revisions changes any requirements.
- September 8, 2020
- Rephrased assignment tasks to make it clearer that the paragraph provides detail for the task. This change does not alter any requirements, it simply clarifies them.
- September 7, 2020
- Correct the definitions of high and low selectivity for acceptance rate
- Add link to Building Data tips to acceptance rate
- Add notes about helpful tutorial notebooks
- September 2, 2020
- Added rubric
This assignment is designed to develop and assess your ability to:
- Obtain a data set from a public source, and use its documentation to understand it
- Set up a Jupyter notebook and data set to begin a new analysis
- Carry out an exploratory analysis to understand a data set’s contents and communicate them to others.
Because of LO2, I will not be providing a notebook. You need to create your own notebook file.
To advance LO1, there are details you will need to obtain from the documentation that comes with the data. For example, I discussion “Carnegie classification” in question — how is that stored in the data?
For this assignment, you will need material through weeks 3, and the tutorial notebooks.
Context and Data
For this assignment, you will work with the data set underlying the Department of Education’s College Scorecard.
This is a medium-sized data set. For this assignment, we will only work with the most recent cohort’s data.
On this site, you will find a few things:
- The data itself. We want “Most Recent Institution-Level Data”
- The documentation. Of particular note:
- The Data Dictionary contains concise definitions of key fields
- The Technical Documentation for Institution-Level Data Files contains more thorough discussion of the different variables, and on pages 33–37 describes where the data came from and how was collected.
The lack of direct links is deliberate. I want you to navigate the site yourself to find the relevant files, so you gain experience collecting information about a data set.
We are primarily interested in understanding completion rates and the distribution of various variables that describe the type of school. Consult the dictionary and documentation for how to get this!
Viewing the Data File
Excel or LibreOffice Calc is useful for viewing the CSV data file such that you can scroll back and forth.
You can also view column names with
schools.columns, if you have loaded it into a Pandas data frame stored in the variable
Our analysis is motivated by the following question:
How do completion rates differ for different types of students and different types of schools?
We want to look at this by:
- Student race
- Admission policy and rate
- Carnegie classification
- Public/private status (the documention describes this as “control” — what is the legal or financial control status of the school?)
We do not have the statistical tools yet to assess the significance of any differences we may observe between completion rates. We will learn that in Week 4, and one of the self-study practice exercises that week will be to go back to this assignment and more accurately measure the differences. Our purpose here is just to observe the differences in the data, without testing whether they are “real” effects or not.
Notes on Variables
This section describes considerations for some of the variables you need to work with. You will need to consult the documentation — these provide guidance on how to apply the documentation.
Notes on Multiple Variables
Page 21 and following of the technical documentation describe the completion rate statistics. There are several different variables for recording completion rates, broken down in a few ways:
- Completion within 100% or 150% of expected time (for a 4-year school, this is 4 or 6 years)
- Both pooled and unpooled versions are available
- 4-year and less-than-4-year schools report outcomes with different sets of variables
- Overall completion rate and rates broken down by race
You will need to compare the variable names available in your data with the description in the documentation to correctly understand how the data is laid out.
For the 4/L4 distinction (4-year vs. less-than-4-year), are any schools in both categories? If not, does it make sense to combine two variables into one completion rate variable? The combine_first method will combine two series. To fill in the missing values in
s1 with values from
The Missing Data notebook provides hints for working with missing values in general.
If you know SQL, Pandas
combine_first works like SQL
For the other options (pooling, 100/150%), reading the documentation about completion rates, and pick the variables to use. Justify your decision.
If we want to look at completion rate by race, and want our analysis to be self-consistent, what implications does that have for our choice of completion rate variables?
Notes on Admission
Pages 10–11 describe admissions data. There are two variables that describe the admissions practices. First, whether it has an open admissions policy. Second, schools that do not have an open admissions policy report the admission rate.
Since we don't yet have the tools to examine relationships between numeric variables, we're going to bucketize the admissions policy. Define three categories of schools:
- Open-admission schools
- Low-selectivity schools, where the admission rate is higher than the median admission rate
- High-selectivity schools, where the admission rate is lower than the median admission rate
Note that the definitions of selectivity here are the reverse from their use in databases. In database terminology, a ‘highly selective’ condition is one that selects many rows, but in higher education, ‘highly selective’ means it does not select many of its applicants.
The Building Data notebook provides useful hints for this part of the assignment.
Submit a Jupyter notebook that describes the following:
A basic structural description of the data set:
- How many schools and variables?
- How many schools are there per state?
- How are schools-per-state distributed? Compute a state-level variable '# of schools', and describe its distribution numerically and visually.
The distribution of the overall completion rate:
- Provide choice of completion rate variable with a justification for that choice.
- Describe the distribution of that variable numerically and visually.
- What is the mean? Is the distribution skewed?
The distribution of the admission rate, both numerically and graphically. After describing its distribution, compute the admissions category (open, low-selectivity, or high-selectivity). Do not hard-code the median — compute the median, and use the computed value (stored in a Python variable) to bucketize the admissions rates.
The break down of completion rate by race, by the school characteristics described in “Question”, and by one additional school characteristic you select. Give a justification for your choice of characteristic — why do you think it might be interesting?
You need to show these breakdowns both numerically and graphically. Box plots are useful for this, as are bar charts.
- Race is a per-student characteristic; schools report completion rate separely for each racial category, in addition to the overall completion rate. The resulting chart should have one bar or box for each racial group.
- The other characteristics — selectivity, Carnegie classification, public/private status, and your chosen additional one — are per-school statistics. The resulting chart should have one box or bar for each value of the chosen characteristic (e.g. for selectivity, these are open, low, and high).
Describe differences you see, with references to specific features in the charts. What kinds of schools seem to be doing the best in terms of getting students to completion?
Race is recorded as multiple variables per school, but it is difficult to directly pass data in this form to one of the plotting routines. The DataFrame.melt method is really useful for this — it can take the various race completion rate columns and pivot them into long format, where there's a column for the variable name and another for the value. You can then recode the name variable, and then you can e.g. use the name variable for an axis in your chart, or group by it!
The answers to 5 questions of your choice from sections 3.1, 3.2, and 3.3 of Datasheets for Datasets, based on the documentation for the college scorecard data. Questions should come from at least 2 different sections.
Write 2 paragraphs reflecting on what you learned about this data, higher education, and data science through this assignment.
Look at differences in completion rates by race for schools with different characteristics. Do some kinds of schools seem to do better at racial equality in completion rates than others?
Notebook Notes & Requirements
Structure your notebook well. Use heading levels, in a suitable order.
Read your notebook. Do you find it pleasant to read?
Describe your work. Use Markdown cells to explain why we are doing different analyses, and what we learn from them. When using a data set variable, explain what the variable means, because they have cryptic names.
This assignment will be graded using the general rubric, with the following point values for each component:
A portion of your grade will be based on notebook readability (the difference between Basically Correct and Exemplary). Week 3's material (Visualizing and Presenting Results) will include a Notebook Checklist to help you format your notebook, along with a rubric for the assignment.