CS 533 Assignment 0¶
This is the 0th assignment for CS 533, Introduction to Data Science. It is due Sunday, Aug. 30, at 11:59 pm.
The purpose of this lab is to make sure that you can run Python notebooks and successfully submit an assignment.
Keep the following in mind for all notebooks you develop:
- Structure your notebook. Use headings with meaningful levels in Markdown cells, and explain the questions each piece of code is to answer or the reason it is there.
- Make sure your notebook can always be rerun from top to bottom.
For this notebook, I am actually giving you the answers - the PDF accompanying this assignment contains the code to enter in each of the cells below. I want you to work through the process of entering, running, and submitting code before you need to worry about how to write it, so we can focus on learning one thing at a time. For the next assignment, when you need to write your own code, you've already been through the mechanics.
Submission Instructions¶
To submit this lab, submit your .ipynb
file along with a PDF version to Blackboard. Create the PDF version by using your browser's Print feature and printing to a PDF.
Your submitted notebook must include results. I recommend that you submit after a clean run: select the 'Kernel' menu and choose 'Restart and Run All'. This will also help you test requirement (2) above: that the notebook can be rerun from top to bottom.
Setup¶
I usually start my notebooks with a Setup section that loads the relevant Python modules and does any configuration needed for the notebook to work.
Almost all projects will need basic data analysis packages - Pandas and Seaborn:
import pandas as pd
import seaborn as sns
For other projects, you will need some more imports:
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
These are not necessary for this lab.
Load the Data¶
After initializing the Python database, we load the data.
Obtaining the Data File¶
Note that I did not provide the data file for you. I want you to get used to downloading data files, so you learn where they come from.
Download the Capital Bike Share data set from https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset. Click 'Data Folder' and download the days.csv
file.
Reading the Data File¶
We will use the Pandas read_csv function. Our data file does not have column headers, so we need to specify the names.
Immediately after loading a data frame, I usually include a command to provide a brief preview of the data. There are two good ways to do this. The first is to use head
to show the first few rows of the table:
The other good way is to use info
to show a description of the columns, along with the shape and memory use of the data frame:
Note: I usually include the .info()
or .head()
call in the same cell as the data load. I separate them out in this notebook so that I can discuss them in the markdown cells, but future notebooks will include them together.
Plotting the Data¶
Let's make a bar plot showing the mean number of riders per weekday:
Now, the X-axis labels isn't very helpful. Which day is 0?
This is a question about how the data is coded. We'll talk more about data encoding next week. Unfortunately, the data documentation doesn't actually say how weekdays are coded! But we can infer from the data in this case: first data point is January 1, 2011, which was a Saturday, coded as weekday 6; it then resets to 0 for the next day, and starts counting up.
Always look at your data.
Often, we will not be able to infer the data encoding from the data itself - we need to consult the codebook or data set description. We got lucky this time. But looking at the data can help us make sense of the codebook.
Let's turn these weekday numbers into a categorical variable so Pandas knows how to label them:
And if we plot again, Seaborn will use the names:
Congrats! We have now plotted the average rides per day. When we do not tell catplot
what to do with multiple points for the same value (in this case the weekday name), it computes the mean and a bootstrapped 95% confidence interval. We'll learn what those are in a couple weeks.
Viewing over Time¶
We can also try to view what happens to the data over time. How did rides-per-day change over the course of the data set?
This kind of data - a sequence of data points associated with times - is called a time series. Helpfully, the data set gives us an instant
column that records the data number since the start of the data set:
But we can deal with this as actual times by converting the dteday
column, which records the date, to a Pandas datetime object, and using that:
We can also plot the weekly rides by resampling. Right now, our bikes
data is indexed by row number in the CSV file. We can change its index to another column, such as our dt
column with the date, which then lets us do things like resample by week:
What that code did, in one line, is:
- Set the data frame's index to
dt
(bikes.set_index('dt')
), returning a new DF - Select the count column (
['cnt']
), returning a series - Resample the series by week (
.resample('1W')
) - Combine measurements within each sample by summing them (
.sum()
) - Plotting the results using Pandas' defaults (
.plot()
)
Pandas default plotting functions are useful for quick plots to see what's in a data frame or series. They often are difficult to use to turn in to publication-ready charts.
Finished¶
That's it, submit your final notebook in Blackboard!