Week 3 — Presentation (9/6–10)
These are the learning outcomes for this week:
Create plots for data
Identify the appropriate type of plot for data in question
Read and interpret a plot
Refine a plot to more clearly show data
Write a well-organized notebook to present data analysis with text and visuals
We will primarily be using Seaborn and Matplotlib for our graphics, because it is easy to
get them fully working for both notebook and document-ready graphics in any Anaconda environment and
efficiently handles very large data sets. There are several other packages that are useful for
Python data visualization, and in some cases are easier to use. I personally use plotnine for
most of my graphics, and plotly is a very capable package with particularly strong support for
interactive graphics. The core graphics principles we study in this module will apply to most
packages you may use in the future.
Tip
I do not recommend that you use Plotly for this course. While it is very good for interactive graphics,
its support for static graphics to render in printable documents is rather new.
Seaborn upgrades
Seaborn is undergoing some changes in its syntax. In the old syntax, we pass the x
and y
parameters as positional paremeters to a plotting function:
sns.lineplot('time', 'price', data=stocks)
In the new syntax, which will be required in a future Seaborn release, we use named parameters
for everything:
sns.lineplot(data=stocks, x='time', y='price')
All new material going forward will use the new syntax, but it takes time to update all of the
slides and videos. You may see the old syntax. It still works, but it issues a warning to let
you know the future syntax is changing.
🧐 Content Overview
This week has 1h33m of video and 8150 words of assigned readings. This week’s videos are available in a Panopto folder and as a podcast.
📅 Deadlines
Finding a plot before class on Thursday
Week 3 quiz at 8am on Thursday
Assignment 1 at midnight on Sunday
🎥 Presentation Goals and Audiences
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
PRESENTING DATA
Learning Outcomes (Week)
Create plots for data
Identify the appropriate type of plot for data in question
Read and interpret a plot
Refine a plot to more clearly show data
Write a well-organized notebook to present data analysis
Photo by Austin Distel on Unsplash
Purposes of Data Presentation
Guide reader attention to important results
Focus
Make it easy to ask the key questions
Substantiate results and conclusions
Do so with integrity
Audiences
You’ll need to present data to several audiences:
Yourself
Your collaborators, supervisors, etc.
Expert readers (know subject, not your work)
Decision-makers
The general public
Guiding Questions
What did you seek to find out?
What did you learn?
Why should the reader trust your conclusions?
Presentation with integrity shows the reader what you learned.
Dishonest presentation manipulates them.
Created by W.E.B. Du Bois for the 1900 Paris Exposition.
From https://www.loc.gov/resource/ppmsca.33892/
Created by W.E.B. Du Bois for the 1900 Paris Exposition.
From https://www.loc.gov/item/2013650445/
Wrapping Up
The goal of good presentation is to guide the reader to what we learned and how we know it.
Effective presentation will highlight the important things without distraction or deception.
Photo by Ben White on Unsplash
📓 Data and Notebook
These resources are used throughout many of the videos in this class:
🎥 Introducing Statistical Graphics
This video introduces basic principles of statistical graphics.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
STATISTICAL GRAPHICS
Learning Outcomes
Understand the value of graphics for presenting data
Identify the parts of a statistical image
Understand some pitfalls in graphics
Graphic by W.E.B. Du Bois, for the 1900 World’s Fair in Paris.
Example Chart
Parts to identify:
X-axis
Y-axis
Axis labels
Data
From “Enhancing Classroom Instruction with Online News”, by Michael D. Ekstrand, Katherine Landau Wright, and Maria Soledad Pera (Aslib Journal of Information Mangement, online June 2020)
What Charts Can Reveal
Patterns (or lack thereof)
Comparisons
Compare two bars
See where points are
Trends
Do lines go up or down? Or jaggedy?
Documenting Charts
Clearly state:
What is being presented (what is each point?)
What values are plotted on the axes
Often, this is done in the axis label
Make sure units are clear if relevant
Sometimes this can be implicit, but if in any doubt: be explicit
The chart + caption should be interpretable on their own!
Observations may be saved for referencing text.
Captions, Titles, and Context
In a paper, a figure has a caption
Labels the figure
Can provide interpretive guidance
Other contexts, we often need a title
Shorter, doesn’t explain details
Labels the whole figure
In notebooks, surrounding text may be sufficient, but title often a good idea for quick reading.
Pitfalls
Distorting perspective or distances – make sure lengths accurately represent quantities
Bar charts start at 0!
Violating conventions
Users have expectations, e.g. linear, even spacing
Don’t Rely on Graphics
Graphics illustrate an effect
They may help find an effect
They are not conclusive proof of an effect
Wrapping Up
Graphics can make data clearer, leverage human perception to understand it.
Graphics are not a replacement for numeric analysis, but give it context.
Clearly label and describe graphics.
Photo by Susn Matthiessen on Unsplash
🎥 Manipulating Data
This video goes over the core Pandas data selection and manipulation operations.
It is primarily a tour guide — the technical content is in the notebooks.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
MANIPULATING DATA
Learning Outcomes
Know key reshaping operations and corresponding Pandas functions
Think about the process of transforming data in steps
Tour guide to notebook – it has the actual code.
Photo by Mika Baumeister on Unsplash
Data Shapes
Rows - # of them
Columns - # and type
Assumption: each row is another observation of the same kind of thing.
Fixing that will be a topic for later
R calls this tidy data
Each method returns a new frame
Selecting Columns
Have: frame
Want: same frame, fewer columns
Pick one column: frame['column']
Pick multiple columns: frame[['c1', 'c2']]
Remove column(s): frame.drop(columns=['c1', 'c2'])
Selecting Rows
Have: frame
Want: same frame, subset of rows
Select by Boolean mask
Good for selecting by column values
Select by position (.iloc)
Select by index key (.loc)
If using RangeIndex, these are the same
Collapsing Rows
Have: frame w/ column(s) identifying group membership
Want: frame or series w/ row per group
Group-by + aggregate
Tall and Wide
Wide Data – column per variable
Tall/Long Data – (id, name, value)
Wide is common source format
Tall data is useful for plotting, grouping
Wide to tall: melt
Tall to wide: pivot, pivot_table
Tall from List
Have: data frame where one column contains lists
Want: one row per list element, duplicating other columns
explode
Series / Data Frame
Data frame to series: select column
Series to frame:
Create single-column frame w/ to_frame
Reset index with reset_index
Pull out one index level with unstack
Strategy
Decide what you want the end product to look like
What are your target observations?
What are your target variables?
Plot a path from current data to end product
Wrapping Up
Pandas has many tools for reshaping data.
Start with the end in mind – work from what you have to what you need.
Read tutorial notebook!
Photo by Element5 Digital on Unsplash
📓 Missing Data
Read the 📓 Missing Data tutorial notebook.
I encourage you to read relevant tutorial notebooks throughout the semester, and link to them when
appropriate; I am making this one specifically an assigned reading.
🎥 Types of Charts
In this video, I discuss several common types of charts for statistical graphics, and how to choose an appropriate one.
It complements the “Statistical Data Presentation” reading.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
TYPES OF CHARTS
Learning Outcomes
Identify the appropriate type of chart for data and a question
Understand key rules to avoid common errors
Photo by KOBU Agency on Unsplash
Software
Seaborn (sns)
Matplotlib (plt)
Plotnine / ggplot2 (pn)
Chart Types
XKCD #688, ⓒ Randall Monroe. Used under CC-BY-NC
Bar Charts
Show numeric values grouped by a categorical (or ordinal) variable
Best with moderate number of categories
Can have second categorical in bar color
Y often mean, sum, or count within group
Can rotate to horizontal bar
Whiskers: confidence interval
Titanic Passenger Survival Rates by Gender and Passage Class
From Seaborn gallery
Bar Charts
Functions:
sns.countplot(count by category)
sns.catplot(mean by category)
plt.bar
pn.geom_bar
Titanic Passenger Survival Rates by Gender and Passage Class
From Seaborn gallery
Bar Chart Rules
Never start y axis at anything but 0 – skews relative sizes
If including whiskers: define how they are computed
If using SNS catplot or countplot without a color group: set color, or they’ll recolor for no reason.
Histograms
Bar chart where ‘categorical’ is bins of a numeric value.
Bar chart showing relative frequency of categorical values also a histogram
Y is either number or fraction of occurrences
Goal is to see relative frequency of different values
One way to graphically describe a distribution.
Scatter Plots
Shows two numeric values
Observations have two numeric variables
Want to see how they relate
Does one increase with the other?
Do points clump in space?
Are there other patterns? Outliers?
Restaurant Tips and Bills
From Seaborn documentation
Refinements
Color by categorical variable
Plot a trend or context line (not shown)
X can be categorical (point plot or strip plot)
Functions:
plt.scatter
sns.scatterplot
pn.geom_point
Restaurant Tips and Bills
From Seaborn documentation
Line Plots
Two numeric values
One y per x value
Emphasizes progression (or continuity) from one to the next
Very common for time series
Functions
sns.lineplot
plt.line
pn.geom_line
From Seaborn tutorial
Box Plots
Show distribution of numeric variable grouped by categorical
Median
Quartiles
Min/max
Outliers (in much software)
Functions:
sns.boxplot
plt.boxplot
pn.geom_box
From Seaborn gallery
More Plot Types
Violin plots (like box, but mean-based)
Swarm plots (categorical scatter plot)
Pie (usually best avoid – bar or stacked bar)
Donut
Rug (displaying distribution in a margin)
Learning More
Class readings
Textbook
Seaborn and Matplotlib docs
Tutorials
Gallery
Wrapping Up
Many types of charts.
Learning good graphics techniques takes time and practice.
Review plotting library galleries!
Photo by Edgar Chaparro on Unsplash
🎥 Metrics and Differences
We talked about the notion of “relative” differences, but what are they?
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
STATISTICS AND DIFFERENCES
Learning Outcomes
Review statistics (or metrics)
Compute absolute and relative differences
Interpret a relative difference
Photo by Franck V. on Unsplash
Statistics
We compute various statistics
Mean
Median
Max
Success rate
90 %ile
Count
Sum
Many others…
Often serve as metric for analysis or evaluation
Comparing Values
Take two values: population (2018 est.) of
Boise: 228,790
Salt Lake City: 200,591
Can compare in two ways:
Absolute – Boise has 28K more people than SLC
Relative – Boise has 14.05% more people than SLC
Absolute Comparison
Relative Comparison
Ratio Comparison
Appropriate Comparison
Depends on context and problem.
Netflix paid $1M to the team that beat its movie recommender by 10% on their chosen metric
Difference-in-difference
Visual Comparison
Bar charts emphasize relative difference
Point plots emphasize absolute difference
Also relative difference-in-difference
Wrapping Up
There are three primary ways to compare statistics: absolute, relative, and ratio.
Be clear and unambiguous when writing the results of a comparison.
Seek to accurately understand others’ writing.
Photo by Nick Fewings on Unsplash
🎥 Charts from the Ground Up
In this video, I discuss how to design a chart from your questions, goals, and data.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
CHARTS FROM THE GROUND UP
Learning Outcomes
Design a chart by thinking first of the questions, goals, and data
Photo by Ryan Quintal on Unsplash
What is the Question?
What question do we want to answer?
How, precisely, did we operationalize it?
What data are we using?
Specifically, variables
Example: Titanic Survival
Question: were passengers in higher classes more likely to survive?
Outcome variable: survival
Logical variable, 0/1 encoded
Taking mean yields numeric response!
Mean is probability of survival (1)
Also called response variable or dependent variable
Explanatory variable: passage class
Categorical
Also called independent variable
Bar chart
Showing Relationships
Most plots show how a numeric variable changes between different values of one or more other variables.
What is the variable to show?
What do we want to show about it?
Statistic?
Distribution?
How do we want to compare between EV values?
Even histograms follow this!
Response: frequency (count, proportion, density)
Explanatory: value or bin
Pipeline
Identifying these informs:
Data processing (e.g. binning, group-aggregate, transform)
Choice of plot type
Axes, labels, colors, facets
Variable Types
Response is numeric (or transformed to be)
Categorical → relative frequency
Explanatory can be anything
Numeric → continuous axis
Categorical → discrete axis
Ordinal → discrete axis preserving order; may omit some labels
One Explanatory Variable
If EV is continuous: scatter or line
Scatter plots we sometimes blur response/explanatory
If EV is discrete:
Bar chart shows statistic, emphasizing relative difference
Point plot shows statistic, emphasizing absolute difference
Box or violin plot shows distribution
Two Explanatory Variables
Pseudo-3D:
Contour plot: identify peak(s) & shape
Heat map: see where high & low points are
Demonstration
Two Explanatory Variables
Pseudo-3D:
Contour plot: identify peak & shape
Heat map: see where high & low points are
Other aesthetics for secondary variables:
color, shape, size
Occasionally used to indicate second response variable
Titanic by Class and Sex
More than Two EVs
Can’t really have more than 1, maybe 2 numeric
Others can be binned
Facets let us break down the plot by more categorical explanatory variables.
Facet Plot
More than Two EVs
Can’t really have more than 1, maybe 2 numeric
Others can be binned
Facets let us break down the plot by more categorical explanatory variables.
Pay attention to order. It strongly affects what readers compare.
Stacking
Stacking lets us see differences in composition – how do the parts of a whole change?
Stacked bar charts
Stacked area charts
Can be either raw values or fractions.
Transformations
Sometimes we transform the axis
Log-10 scale – shows order of magnitude
Generally don’t do this to bars
Sometimes we transform the data
Rescale, log, square root
Normalize by a value
Be Careful
Avoid excessive complexity
Be careful with color (easy to make indistuingishable)
A good graphic reveals the data, and does not distort or obscure
Wrapping Up
Identify variables and relationships you want to highlight.
Design a plot that illustrates them.
Study plotting library APIs.
Photo by Daniel Cheung on Unsplash
✅ Plots in the Wild
In preparation for Thursday’s class, find a data presentation (plot, table, etc.) in a recent online
publication, and share it with your team through a post on Piazza (in the ‘discuss’ category) with a
link, a copy of the image. This can be from a journal paper, a newspaper article, a blog post, or
another source the class can all access.
In class we will discuss these plots!
Tip
Don’t spend more than 30 minutes on this assignment.
📓 Finishing Touches
The Finishing Touches notebook describes how to apply some finishing touches to your plots and save them to files.
🚩 Week 3 Quiz
The Week 3 quiz will be over all of the assigned material for this week, and is in Canvas.
The sections below this are for your further study and practice.
📖 Textbook
This week primarily uses Chapter 9 of 📖 Python for Data Analysis, with some material from chapters 8 and 10.
📚 Futher Reading
For further study on these topics, see:
The Seaborn and Matplotlib galleries
The Visual Display of Quantitative Information by Edward R. Tufte
W. E. B. Du Bois’s Data Portraits: Visualizing Black America, edited by Whitney Battle-Baptiste and Britt Rusert
✅ Practice
Doing this work well takes a lot of practice. Create some notebooks and experiment with drawing interesting charts from some of the data sets we have been exploring, or new data you find!
The HETREC data has a number of variables of different types that are useful for practicing manipulations and visualizations.
📩 Assignment 1
Assignment 1 is due on Sunday, Sep. 12 at the end of the day (11:59 pm).
The tutorial notebooks are going to be very useful for this assignment.