Week 3 — Presentation (9/5–9)
These are the learning outcomes for this week:
Create plots for data
Identify the appropriate type of plot for data in question
Read and interpret a plot
Refine a plot to more clearly show data
Write a well-organized notebook to present data analysis with text and visuals
We will primarily be using Seaborn and Matplotlib for our graphics, because it is easy to
get them fully working for both notebook and document-ready graphics in any Anaconda environment and
efficiently handles very large data sets. There are several other packages that are useful for
Python data visualization, and in some cases are easier to use. I personally use plotnine for
most of my graphics, and plotly is a very capable package with particularly strong support for
interactive graphics. The core graphics principles we study in this module will apply to most
packages you may use in the future.
Tip
I do not recommend that you use Plotly for this course. While it is very good for interactive graphics,
its support for static graphics to render in printable documents is rather new.
Seaborn upgrades
Seaborn is undergoing some changes in its syntax. In the old syntax, we pass the x
and y
parameters as positional paremeters to a plotting function:
sns.lineplot('time', 'price', data=stocks)
In the new syntax, which will be required in a future Seaborn release, we use named parameters
for everything:
sns.lineplot(data=stocks, x='time', y='price')
All new material going forward will use the new syntax, but it takes time to update all of the
slides and videos. You may see the old syntax. It still works, but it issues a warning to let
you know the future syntax is changing.
🧐 Content Overview
This week has 1h33m of video and 12379 words of assigned readings. This week’s videos are available in a Panopto folder.
📅 Deadlines
Finding a plot before class on Thursday
Week 3 quiz at 8am on Thursday
Assignment 1 at midnight on Sunday
🎥 Presentation Goals and Audiences
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
PRESENTING DATA
Learning Outcomes (Week)
Create plots for data
Identify the appropriate type of plot for data in question
Read and interpret a plot
Refine a plot to more clearly show data
Write a well-organized notebook to present data analysis
Photo by Austin Distel on Unsplash
Purposes of Data Presentation
Guide reader attention to important results
Focus
Make it easy to ask the key questions
Substantiate results and conclusions
Do so with integrity
Audiences
You’ll need to present data to several audiences:
Yourself
Your collaborators, supervisors, etc.
Expert readers (know subject, not your work)
Decision-makers
The general public
Guiding Questions
What did you seek to find out?
What did you learn?
Why should the reader trust your conclusions?
Presentation with integrity shows the reader what you learned.
Dishonest presentation manipulates them.
Created by W.E.B. Du Bois for the 1900 Paris Exposition.
From https://www.loc.gov/resource/ppmsca.33892/
Created by W.E.B. Du Bois for the 1900 Paris Exposition.
From https://www.loc.gov/item/2013650445/
Wrapping Up
The goal of good presentation is to guide the reader to what we learned and how we know it.
Effective presentation will highlight the important things without distraction or deception.
Photo by Ben White on Unsplash
- Blow and this video, I'm going to introduce our week three module on presenting data.
- So our learning outcomes for this week are for you to be able to create plots from data,
- identify the appropriate plot for type of plot for data in question.
- I want you to be able to read and interpret a plot, refine a plot more clearly, show its data,
- but then also put these plots and our other presentations in discussions of data into a well-organized notebook to present your data analysis.
- Before we dove into how to actually present data, I want to start with talking about some of the purposes of data presentation,
- because these purposes should guide your presentation design decisions.
- They should guide your evaluation of your own presentations and those of others.
- They'll also guide my evaluation of your presentation when you are submitting assignments.
- And one of the first things we need to do is guide the reader attention to important results.
- And effective presentation is not going to have every piece of thing in it.
- It's going to draw focus and attention to the important results and make it easy for
- your reader to ask the key questions around the data analysis you're presenting.
- But then it also needs to substantiate the results and conclusions.
- So when we're presenting the data, we want to guide the reader to focus on what it is that we want them to learn,
- but also in a context that gives them the information needed to assess the validity of the conclusions that we're presenting.
- And we want to do so with integrity. It's easy to make charts that highlight the result.
- You want the user to see whether or not or not that result is rigorously defensible from the data.
- And we want to avoid making those kinds of misleading data visualizations and presentations.
- So in doing this, we want to be able to think about the audience and you're gonna be presenting data to several different audiences.
- The first audience is yourself.
- When you're working with a data set, when you're trying to understand what you're learning from it, the results of your inferences,
- your presenting data to yourself, and you need the presentation to be clear so that you understand what it is that you're learning from the data.
- So you see the next question to asks. You're not misleading yourself in the data analysis process.
- But those kinds of charts don't necessarily need the same level of Polish that a chart for an external audience.
- Would your collaborators, supervisors, et cetera, need to be able to see the data, see what you're learning?
- Maybe it's in the weekly meeting you have with your research advisor.
- You're presenting them with the results that you found.
- They are people who have a lot of knowledge of the project you're working on another problem base you're working on.
- They're gonna help you guide and refine your questions.
- Again, they perhaps don't need as much Polish as a final published result, but they're not just the ones that you're creating internally for yourself.
- You're going to be presenting to expert readers.
- If you write a scientific paper, the readers are probably usually wouldn't have some level of expertize in the topic that you're talking about.
- They may know. They'll probably know the subject in general, but they may not know your specific work.
- You may be presenting this to decision makers, especially if you're doing a data science project in an industrial or corporate environment.
- You're providing data.
- That's going to inform the decisions that your boss, who may not have significant statistical expertize or data expertize that may,
- but they're going to be using those decision, those the data and the the data that you present in order to make decisions.
- And then finally, you may occasionally be be producing or presenting data for the general public at large.
- Each of these audiences is going to require different things from your data presentation.
- So you need to understand who it is that you're presenting the data to in order to make appropriate data presentation decisions.
- When we're presenting the data, here are some questions that are going to help us understand what it is that we need to guide the reader towards.
- So we need to be clear on the reader needs to come away knowing what we sought to find out.
- This might be just explicitly stating our research questions, but they need to know the purpose that the data we're presenting is supposed to serve.
- What are they supposed to learn from it? We then they didn't then need to see what we do learn.
- And then they need to see the supporting evidence, the context to trust the conclusions.
- It's not just enough to say here's the results, but in many cases you need to provide enough data,
- enough context that they not only see what you learned, but they see why you believe it is true.
- And presentation with integrity shows the reader and really makes it clear to the reader what we learned.
- The evidentiary support behind it, why it flows from the data.
- Whereas dishonest presentation manipulates them into the conclusion without having the rigorous foundation underneath it.
- So I want to show you an example of a very bad graphic that came up out of the state of Georgia earlier this year.
- They presented campaign ad hundred network television, a graph that's purporting to show Kofod cases in various high population counties over time.
- But if you look closely at the Y at the X axis of this graph and you'll see this more clearly when you go and look at the slides,
- the axis is not sorted. It starts with the twenty eighth of April and then it goes to the twenty seventh, followed by the twenty ninth.
- Then May 1st. Then April 30th. At the end we have May 2nd.
- May 7th. April 26. May 3rd. It violates the expected convention.
- And what they're what you'd need to do to show a trend over time that time goes from left to right.
- They're sorting things in. They're putting things out of order to show the trend.
- They want. In a way that's not substantiated by the data and it takes a lot of work to make a chart.
- This bat. I'm not sure how to do it in any of the statistical software I actually use, but this is a this is an egregious example,
- but it's an example of one of the things that can happen when where when we focus on.
- The effect we want to demonstrate over the evidentiary support for it.
- I want to contrast with a. I want to contrast with a chart from W edi w e, b do BWAS charts created for the nineteen hundred parece X position.
- And these were a series of charts for an exhibition to show the economic, educational,
- etc. progress of black Americans from emancipation to the turn of the century.
- And he made a series of charts showing economic situations and things.
- And here's a bar chart that clearly shows the result.
- The distribution of economic statuses for farmers after a year of farm labor.
- And it shows we have the first categories of bankrupt and in debt.
- And then we have four different or five different levels of of non-negative return up to clearing.
- Fifty dollars or more. And it shows them the bars are proportional to the length of the data.
- Very clearly highlights these things.
- It then also does a creative thing of pulls out the separate bar that it indicates is the composite of all of the non-negative bars.
- And we can see that even if you if you add all of that, they non-negative bars together.
- It's not as many farmers as as the indebt category.
- That's not a very standard thing. These charts were hand drawn.
- But it's a creative use of the visualization to highlight in a way that's supported by the data,
- the relative distributions of a different in return levels for black American farmers at the time.
- Another one, that's another one that's creative here is this spiral bar chart.
- Again, it's an unusual thing, but the lines, we can see them going progressively longer.
- And if this were just a part of normal horizontal bar chart without the spiral that the smallest line,
- the first line, 1975, would be so small you couldn't even see us.
- This gives more space. And it shows good visualization doesn't just mean following the checklist of rules.
- It means presenting the data in a way that the conclusions and the takeaways are clear and they're rigorously supported by the underlying data.
- There's no visual tricks to make things look larger or smaller than they are.
- It transparently shows the connection between the conclusion and the underlying data that support it.
- So to wrap up, the goal of good presentation is to guide the reader to what we learned and how we know it.
- Effective. Presentation is going to highlight the important things for the reader to understand without distraction or deception.
- And we're going to see throughout more of this week and throughout more the semester how practically to go about doing that.
📓 Data and Notebook
These resources are used throughout many of the videos in this class:
🎥 Introducing Statistical Graphics
This video introduces basic principles of statistical graphics.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
STATISTICAL GRAPHICS
Learning Outcomes
Understand the value of graphics for presenting data
Identify the parts of a statistical image
Understand some pitfalls in graphics
Graphic by W.E.B. Du Bois, for the 1900 World’s Fair in Paris.
Example Chart
Parts to identify:
X-axis
Y-axis
Axis labels
Data
From “Enhancing Classroom Instruction with Online News”, by Michael D. Ekstrand, Katherine Landau Wright, and Maria Soledad Pera (Aslib Journal of Information Mangement, online June 2020)
What Charts Can Reveal
Patterns (or lack thereof)
Comparisons
Compare two bars
See where points are
Trends
Do lines go up or down? Or jaggedy?
Documenting Charts
Clearly state:
What is being presented (what is each point?)
What values are plotted on the axes
Often, this is done in the axis label
Make sure units are clear if relevant
Sometimes this can be implicit, but if in any doubt: be explicit
The chart + caption should be interpretable on their own!
Observations may be saved for referencing text.
Captions, Titles, and Context
In a paper, a figure has a caption
Labels the figure
Can provide interpretive guidance
Other contexts, we often need a title
Shorter, doesn’t explain details
Labels the whole figure
In notebooks, surrounding text may be sufficient, but title often a good idea for quick reading.
Pitfalls
Distorting perspective or distances – make sure lengths accurately represent quantities
Bar charts start at 0!
Violating conventions
Users have expectations, e.g. linear, even spacing
Don’t Rely on Graphics
Graphics illustrate an effect
They may help find an effect
They are not conclusive proof of an effect
Wrapping Up
Graphics can make data clearer, leverage human perception to understand it.
Graphics are not a replacement for numeric analysis, but give it context.
Clearly label and describe graphics.
Photo by Susn Matthiessen on Unsplash
- Welcome in this video, I'm going to start introducing the basic concepts of statistical graphics.
- I want you to be able to understand the value of graphics for presenting data,
- identify parts of a statistical image, and understand some pitfalls and graphics that we want to try to avoid.
- So here's an example of a chart, and there's a variety of different pieces of this chart.
- We have an x axis. That's the horizontal x axis.
- We have a y axis, the vertical axis. Each of these axes has a label.
- Task to task one. We have a caption up at the top that explains what's going on in the image, provides us with the context to understand it.
- And it says that this graph is showing the number of queries per task with query account distributions in the margins.
- And each dot is one participant. So it tells us we have a data point in it.
- What is that? It tells us what it is that we're charting. Number of queries per task.
- When we then see the axis labels that we have task one and we have task to.
- Those two together, give us the context, understand that. Oh, we have two tasks.
- And this is why of one participant.
- And they're appearing at the point where they have their task, one count on their task, two counts.
- OK, this allows us to to see if there's any relationship between how long it took to.
- How many queries it took to complete the two different tasks. It then says we have query count distribution than the margin.
- So this is a compound plot. And in the left and right, margins are in the X and Y margins.
- We have the distribution, a histogram of the X axis, the task one.
- We have a histogram of the Y axis task to these histograms don't have axes themselves because,
- well, we just wanted to show a distribution the exact particularly for our purposes here.
- The exact number in each bean is not so important. The key thing is just see being able to see relatively where is the mass of the different?
- Where is the mass on the two different task counts?
- And we can see that both of them have a a right skew.
- They're bulked up towards towards the low end of the scale.
- And then we have the all the we have all of the individual data points scat on the chart.
- This is called a scatterplot. We have these different pieces of the chart that we want to be able to identify.
- And when you go particularly as you go to refine a chart, what you're going to need to do is specify what's happening on each of these pieces.
- What is your x axis? What is your y axis?
- Before you even start the chart, you need to set up your data so that we have what is the data point that I'm going to be plotting on this chart?
- So charts can are really useful for revealing a variety of things that can reveal patterns or lack there of.
- In this chart, there's really not much of a pattern.
- And we can see that it's booked up, but particularly if we get out to that larger number of tasks, there's not a lot of pattern.
- The one with the participant, with the most tasks and with the most tasks or queries for task one.
- Has a middling to low number of queries for Task two.
- And the the one who has the most queries for task to while they're in the upper end of of the queries per task on task one, they're biased.
- They're not at all the highest. So we can see there's not a not very much of a relationship here.
- At least that doesn't look like one. They can be useful for comparisons. If we've got a bar chart, we can compare to bars.
- We can see where points lay. We can see like we can see in that chart that we just saw that the the highest number of counts for one task,
- the highest number of counts for another task or different. We can also see trends.
- We can see if a line looks like it goes up or down,
- wiggles around so they can reveal a lot of these kinds of things and they can really leverage our human perception and our human,
- particularly our human visual senses,
- to be able to quickly internalize and understand what is going on in in a set of data when we're creating a chart.
- We need to clearly document a few things. We you clearly state what is being presented when someone looks at a chart.
- They need to be able to understand what each point in the chart is going to be.
- They need to understand what values are plotted on the axis. They need to understand what values are plotted on the axes.
- Often this is done in an axis label in our in the chart I showed you,
- it said the values in the caption in the axis labels said which version of them they were.
- If there are units, that needs to be clear.
- So if you've got something that's millimeters, that's pounds, that's megabytes, whatever, you need to specify the units in your in your chart,
- either in the Axis label or in the caption, some of these things can sometimes be implicit in the type of chart, such as a histogram.
- And you've got a fraction or a percentage in the left hand side.
- It's standard convention that we're talking about, the fraction of the values that are in each bin,
- at least if you label it as a histogram or as a chart showing the distribution. But when in doubt, if there's any doubt about.
- What a value, what an axis label is. Or there's any doubt that the reader will understand what it is.
- Be explicit, explicitly, say what's going on in your chart. That also the chart in the caption should be interpretable on their own.
- You can assume a reasonable level. You have to know your audience for this.
- But someone should be able to just look at the chart with its immediately surrounding description,
- the labels, the caption and understand have a pretty good idea of what's going on.
- The surrounding text with the text that references the chart if you're writing a document.
- That can have your observations, that can provide more context and clarity.
- But someone just looking at the chart should be able to figure out basically what's going on and
- not be too far off of this is particularly important because there's a there's a lot of people,
- whether this is a good or a bad practice, we can debate. But there's a lot of people who, when they're reading a paper,
- they focus on the charts and look at the key charts first to see what it is that's going on in the paper.
- And if our if our charts are self-explanatory and are clear that it makes a lot easier for people to glance at our work,
- see what it's doing and decide whether they are going to pay it further attention.
- So in a paper, if you're putting a chart in a docket, a written document or a paper,
- each figure should have a caption and the caption can it labels the figure and it can also provide interpretive guidance.
- Like,
- it's not uncommon for a caption to be two or three sentences saying things about what's going on in the chart and describing some of the methodology,
- what precisely some of the computations are, etc. In other contexts, we often need a title for the charts.
- So if we have a caption we don't, we need to label our axes, but we don't need a title for the chart itself all the time.
- It doesn't hurt, but often it's redundant with the caption.
- In other contexts, though, we often do need a title such as when we have a chart that's going in a presentation.
- We have a chart in one of our notebooks. A title is often helpful in notebooks.
- The surrounding text may be sufficient,
- but a title is often a good idea for someone who's quickly scanning the notebook to be able to understand what's going on in the chart.
- So a few pitfalls to be aware of when we're thinking about statistical graphics is
- one is distorting the distances or the differences that are happening particularly.
- We need to make sure if something has a length,
- anything that has a length that length should accurately represent quantities, position, relative position.
- If you have two dots, their relative position is what's important. But if we have a length of it, if we have a bar, it has a length.
- It also is an area we need to make sure those accurately represent quantities.
- One really common way to violate this is having a bar chart whose access starts at something other than zero.
- The software we're using doesn't do that by default. Excel does.
- But your bar chart always needs to start at zero because people are beat.
- People don't look at the relative position of the bar. People see the whole height of the bar.
- And so if it doesn't start at zero, it looks like the difference between bars is much higher relative to the bar size than it actually is.
- There's also ways in which we can violate conventions.
- So in the first video I showed you the chart that violated the convention, that the x axis goes in order.
- If we violate the user's expectations, they they'll either be confused by the chart or read it wrong.
- Statistical graphics in each particular type of chart have conventions that people who read a lot of them assimilate by long patterns of reading,
- like you assimilate how to read written text. And if those expect expectations are violated, that can.
- Lead the user to incorrect conclusions from our charts, from our presentation.
- A key thing to remember here that also applies to all of our presentations.
- Research isn't a mystery novel. You don't have to worry about spoiling the surprise or you end the goal here is not to present it,
- not to subvert tropes or present shocking new presentations.
- We might have shocking new evidence, but from a presentation perspective,
- we want it to fit within conventions and not violate readers expectations unnecessarily so that
- they can read it and be confident that they've correctly understood what it is that you're saying.
- Another thing to be aware of is that graphics can illustrate an effect.
- They can also help you find an effect. Like more exploring data. We can look at the graphics to see what effects we might be looking for.
- We have to be careful about that. We'll talk about some of the pitfalls of we have to be careful, more combining.
- We can't combine exploratory and what's called confirmatory analysis, but they can help us.
- Visualizing data can help us look for possible effects and get ideas for what to go look for next.
- But they're not conclusive proof of an effect. We need the numeric results, just the raw numerical.
- The raw numbers as well as the numeric result are the results of inferential techniques that let us
- estimate how big an effect is and whether it's significant in order to come to any conclusions.
- So in this chart, I want to show you, for example, we have if we look at the chart closely,
- we have these two data points and the little blue Xs in them, the func SVOD axis.
- As to the left of the item, item X. So it looks like for this particular metric,
- lower is a better value for it because it's an error metric root mean squared errors with RMX he stands for.
- But it looks like. Okay. This is a little bit better, but that's not sufficient evidence for us to include.
- To conclude that func SVOD is better than item item on the per user are masc metric.
- Exactly what all these things are is, is a topic for another day.
- But the fact that we see the thing to the left, that s that if the effect is real, this illustrates it.
- But seeing it's not enough for us to conclude that it outperforms because it might be a fluke of our experimental strategy,
- it's a relatively small difference. So. They help us see.
- They help us communicate. They're not definitive and conclusive proof.
- Couple of other things I want to highlight. They're going on in this graph. I've introduced two different kinds of symbols here.
- So the earlier graph, we just had one kind of symbol. We had dots here.
- We have two different kinds with a legend that says red circles are global.
- Oremus are a thing called global are masc and blue Xs are a thing called per user are MSE.
- Don't have to understand what those are. But the point is, I'm using different colors and shapes in order to communicate,
- to show different versions of a thing in the same chart, using different shapes.
- In addition to different colors is useful because it's so imprinted on a black and white printer.
- If they if they have some form of color blindness, it helps it make the differences clearer.
- I've also in addition so I've got my Y at my at Y axis, which is indicating different things that I'm plotting here.
- I also have grouped them just to make it easier for the user to see.
- These are the same like these. These first ones are all single algorithms.
- And then we have a blend and a few other things. The details are.
- Aren't important for illustrating them, but it helps guide the user to understand his structures to we have these group breakdowns.
- It also helps save space in the paper because I can present all of these different things in one place.
- It's easy to compare the different stages, even though I have to split the mountain, the discussion in the paper.
- But it gives you one place to compare them and it concisely shows the key results of the entire paper in one chart.
- So to wrap up graphics can make data clearer and they let us leverage human perception to understand it.
- They don't replace our numerical analysis,
- but they give it context and they help us more clearly communicate what it is that we're learning from the data and what's going on in it.
- We do, however, always need to make sure that we clearly label and describe our graphics so that
- readers can understand them and they can draw correct conclusions from them.
🎥 Manipulating Data
This video goes over the core Pandas data selection and manipulation operations.
It is primarily a tour guide — the technical content is in following notebooks.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
MANIPULATING DATA
Learning Outcomes
Know key reshaping operations and corresponding Pandas functions
Think about the process of transforming data in steps
Tour guide to notebook – it has the actual code.
Photo by Mika Baumeister on Unsplash
Data Shapes
Rows - # of them
Columns - # and type
Assumption: each row is another observation of the same kind of thing.
Fixing that will be a topic for later
R calls this tidy data
Each method returns a new frame
Selecting Columns
Have: frame
Want: same frame, fewer columns
Pick one column: frame['column']
Pick multiple columns: frame[['c1', 'c2']]
Remove column(s): frame.drop(columns=['c1', 'c2'])
Selecting Rows
Have: frame
Want: same frame, subset of rows
Select by Boolean mask
Good for selecting by column values
Select by position (.iloc)
Select by index key (.loc)
If using RangeIndex, these are the same
Collapsing Rows
Have: frame w/ column(s) identifying group membership
Want: frame or series w/ row per group
Group-by + aggregate
Tall and Wide
Wide Data – column per variable
Tall/Long Data – (id, name, value)
Wide is common source format
Tall data is useful for plotting, grouping
Wide to tall: melt
Tall to wide: pivot, pivot_table
Tall from List
Have: data frame where one column contains lists
Want: one row per list element, duplicating other columns
explode
Series / Data Frame
Data frame to series: select column
Series to frame:
Create single-column frame w/ to_frame
Reset index with reset_index
Pull out one index level with unstack
Strategy
Decide what you want the end product to look like
What are your target observations?
What are your target variables?
Plot a path from current data to end product
Wrapping Up
Pandas has many tools for reshaping data.
Start with the end in mind – work from what you have to what you need.
Read tutorial notebook!
Photo by Element5 Digital on Unsplash
- Blow in this video. I want to talk with you about basic operations for manipulating data,
- learning outcomes for this video are for you to know key data reshaping operations and the corresponding Penders function.
- Think about the process of transforming data in steps. This is a tour guide to the corresponding notebook.
- I'm not showing the actual code in the video,
- but you're going to see in the notebook how you actually implement different versions of each of these steps.
- So we think about the shape of our data. We have rows and we have columns.
- We have a certain number of rows and columns, each of a number, a type in a name.
- The assumption we're going to make throughout these operations is that each row is another observation of the same kind of thing.
- So our data are well organized. Each row we have the variables.
- And that's gonna be one type of thing. So if you have a data frame of movies, each row represents one movie.
- If we have data that's not in that kind of a format, we're going to talk about that later.
- And sourcing and cleaning data. How do we get data in this kind of a tidy format?
- For now, we're going to assume we have data in this format. This kind of a layout are the are Eco-System calls this tidy Vaida data.
- Now, each of these methods return in new frame. A few of them are going to return a series.
- But in general, we're gonna be transforming data frames to data frames here.
- And so if our input is a data frame, each row is another observation of the same kind of thing.
- The output will be a data frame. Each row is an observation of the same kind of thing.
- It might be an observation, the same kind of thing as the input. It might be an observation of a different kind of thing.
- But these are the different operations that we're going to be talking about here.
- So if we want to select calls, we have a frame and we want the same frame.
- But with fewer columns, we have few options. We can pick one column by treating the frame as a dictionary.
- We can pick multiple columns bypassing in the list of column names to the same way we pick one column.
- One column will yield a series. Multiple columns will yield a frame. If we want to remove a column.
- So we want to keep all of the columns except one or two or however many that we name the drop method.
- Ream returns a frame with all the columns of the original frame except the ones you tell it to drop.
- We want to select rows. We have a frame. We want the same frame, but a subset of the rows.
- A few common ways to do that or to select by a boolean mask. We set up a PAN, the series that has boolean values.
- That's true. And all the for all the data positions we want to keep.
- And then we select and then we. So this is really good if we want to select by column values.
- So we can use a comparison operator to create a mask where all the values for one column are equal to a particular value.
- And then we can select, we can select using that boolean mask.
- We can select by position in the in the frame, starting with zero.
- We can do that no matter what the index is with. I lock that is that's so lock is the location.
- Accessor for Panda's data frames, I lock Access's by integer position, always lock indexes by index keys.
- If we have the index keys we want, if we just load it. So we just loaded the data frame from a CSP file.
- We haven't specified any index options. It's using the default range index.
- Then selecting that position and index key are the same thing.
- If we have a data frame with, call with, whereas we've got our observations and we've got color,
- a column that identifies what some kind of a group that each observation is, then maybe it's ratings.
- It's the movie. Maybe it's movies. And it's the actor, the genre.
- And what we want is a frame or a series that has one row per group of the original data.
- And it's computing some kind of a statistic from a value and all the rows, all of the rows within that group.
- Then we want a group by an aggregate like we saw in the videos last week.
- A couple more transformations are to think about tall versus wide data.
- So why the data has a column per variable.
- So in this case, if this is data, this is data of the of this average speed for each of four different stages of a cycling race.
- And so we've got a column for each of the four different stages and end our rows or for each cyclist, total data.
- Has its simplest form toll data or long data has three columns.
- We have the road. We have the identifier. We have the variable name.
- And we have the variable value. Sometimes this will just be called idee, variable and value.
- But often it's often useful to give the variable and value name columns, meaningful names.
- We could also have more than one idea. Call it if we need to.
- But the idea here is that rather than having the stages in different columns, we split them out into a different row.
- So cycle one cyclist one has four rows, one for each of the four stages.
- We still call this an observation for one thing and for the same kind of thing.
- It's just in the wide data. Each of our observations is for a cyclist and it's an observation of their speed for all four stages.
- Whereas in the long data, each observation is for a cyclist.
- One cyclist in one particular stage. So each cyclist will have four observations, one for each stage.
- Total data is useful for plotting and grouping because a lot of our plotting function,
- YouTube plotting utility functions are going to want to deal with a categorical variable that we use to determine maybe the x axis.
- Maybe the color. And so often we're going to need tall data, especially when we're going to be plotting.
- If you want to term why data in the tall, you use melt. If you want to turn tall data into why use the pivot with a pivot table methods and pandas.
- You can also create tall data from a list. So if we have a data frame when one of the columns actually contains lists,
- we haven't seen any data with this so far except the John Rós column in the Waj Movieland data.
- But if we have a if we have a data frame or one column contains lists and we have and what we want is one row per list element.
- So we want to take this list that's in a column and split it out so that each element gets another row where it's going to go ahead,
- duplicate the rest of the column. So they're going to have their values repeated, whatever we're doing once for each of the elements.
- This list, the pandas explode method. We'll do that. Then finally, to convert between series and data frame.
- So if we have if we have a data frame and we want to get a series, we just select the column from the data.
- We saw that the beginning. If we have a series and we want to get a data frame,
- we can just create a single column frame with two frame and the two frame method on the serious object also.
- But to give it a names that you have a name and the resulting data frame,
- you can also if you want to create a multi column data frame where you've got a column for the value end,
- you have a column for the index of the original of the original series.
- The Pandas, the series Freeze Reset Index Method or pop that index out into a data frame column.
- And then finally, if you have a series with multiple levels to its index, we haven't seen those yet, but we're going to see them from time to time.
- The unstamped method will turn the inner most index labels in the column labels.
- To turn it series into a data. So to think about strategy.
- Each of these is an individual little building block. And we need to put them together to get from the data that we have to the data that we want.
- And so what I recommend is that you decide what you want the end product to look like.
- If you're going to draw a chart or you're going to do an analysis or an inference,
- what are the observations and the variables that you need for that chart or inference?
- And then once you've figured that out, you can plot a path from your current data to your end product.
- So. If you want to show a distribution of the mean ratings for all of the movies in the horror genre, then you're going to need to select.
- The rows that have the movies only that are in the horror genre, you can select that.
- You're probably going to have a join as well in order to get the genre table and the movie table connected, depending on how your data's laid out.
- And once you've filtered it down, OK, these are the horror movies that you need to get the ratings and you need to you need to be able.
- You need to get there.
- You need to have the average ratings, you need to combine those with the movies, as we've seen the ability to do in a previous video.
- And then you have the observations that you want. You need to be able to plot this kind of a path and what you have at the end product.
- In this example, I've reference some Joynes. We saw joints very, very briefly last week.
- We're going to see them again in more detail in the notebooks. So to wrap up, Penders has many tools for reshaping data.
- You want to start with the end in mind, work from what you have to what you need.
- Read the tutorial notebooks for a lot more details.
📓 Selecting Data
Read the 📓 Selecting Data tutorial notebook to learn how to select data from a data frame.
I encourage you to read relevant tutorial notebooks throughout the semester, and link to them when
appropriate; I am making three ones this week specifically assigned readings.
📓 Reshaping Data
Read the 📓 Reshaping Data tutorial notebook to learn
how to manipulate the shape of data frames in various ways, including merging two data frames into one.
🎥 Types of Charts
In this video, I discuss several common types of charts for statistical graphics, and how to choose an appropriate one.
It complements the “Statistical Data Presentation” reading.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
TYPES OF CHARTS
Learning Outcomes
Identify the appropriate type of chart for data and a question
Understand key rules to avoid common errors
Photo by KOBU Agency on Unsplash
Software
Seaborn (sns)
Matplotlib (plt)
Plotnine / ggplot2 (pn)
Chart Types
XKCD #688, ⓒ Randall Monroe. Used under CC-BY-NC
Bar Charts
Show numeric values grouped by a categorical (or ordinal) variable
Best with moderate number of categories
Can have second categorical in bar color
Y often mean, sum, or count within group
Can rotate to horizontal bar
Whiskers: confidence interval
Titanic Passenger Survival Rates by Gender and Passage Class
From Seaborn gallery
Bar Charts
Functions:
sns.countplot(count by category)
sns.catplot(mean by category)
plt.bar
pn.geom_bar
Titanic Passenger Survival Rates by Gender and Passage Class
From Seaborn gallery
Bar Chart Rules
Never start y axis at anything but 0 – skews relative sizes
If including whiskers: define how they are computed
If using SNS catplot or countplot without a color group: set color, or they’ll recolor for no reason.
Histograms
Bar chart where ‘categorical’ is bins of a numeric value.
Bar chart showing relative frequency of categorical values also a histogram
Y is either number or fraction of occurrences
Goal is to see relative frequency of different values
One way to graphically describe a distribution.
Scatter Plots
Shows two numeric values
Observations have two numeric variables
Want to see how they relate
Does one increase with the other?
Do points clump in space?
Are there other patterns? Outliers?
Restaurant Tips and Bills
From Seaborn documentation
Refinements
Color by categorical variable
Plot a trend or context line (not shown)
X can be categorical (point plot or strip plot)
Functions:
plt.scatter
sns.scatterplot
pn.geom_point
Restaurant Tips and Bills
From Seaborn documentation
Line Plots
Two numeric values
One y per x value
Emphasizes progression (or continuity) from one to the next
Very common for time series
Functions
sns.lineplot
plt.line
pn.geom_line
From Seaborn tutorial
Box Plots
Show distribution of numeric variable grouped by categorical
Median
Quartiles
Min/max
Outliers (in much software)
Functions:
sns.boxplot
plt.boxplot
pn.geom_box
From Seaborn gallery
More Plot Types
Violin plots (like box, but mean-based)
Swarm plots (categorical scatter plot)
Pie (usually best avoid – bar or stacked bar)
Donut
Rug (displaying distribution in a margin)
Learning More
Class readings
Textbook
Seaborn and Matplotlib docs
Tutorials
Gallery
Wrapping Up
Many types of charts.
Learning good graphics techniques takes time and practice.
Review plotting library galleries!
Photo by Edgar Chaparro on Unsplash
- Welcome back. This video,
- I'm going to walk you through some of the different types of charts that we're going to be learning how to create outcomes or
- be able to identify the appropriate type of chart for data in a question and understand key rules to avoid common errors.
- I'm not going to be showing the detailed code for these chart types in the video.
- You're going to be able to find that in the documentation link from here. And also,
- I'm going to be preparing a notebook that demonstrates various of these charting
- types with the actual code to create them using the software we discussing.
- So common software for this or Seabourne and matplotlib, those are going to be the primary ones that we're working with this semester.
- When I'm showing the function names, Seabourne is commonly input imported S.A.S.
- So as an ascot function is going to be a seabourne function PLDT,
- the function is going to be a matplotlib of function and also showing the function you can use in plot nine or Ares g.G plot too,
- if you want to use those instead. I often use plot nine for a lot of my graphics.
- That's just for reference though. We're not going to be getting into much detail on Plot nine in the course of this course.
- So there's a variety of different types of charts. Some of them are showing relative proportions.
- Some of them are showing how different amounts relate to each other. Some of them are showing positions and an x y coordinate space.
- A bar chart is a very common type of chart that shows numeric values grouped by a categorical or ordinal variable.
- Sometimes they're grouped by New America as well.
- But usually our x axis is a categorical variable of some kind best with a moderate number of categories.
- We can use a second categorical variable to say color the bars.
- So this chart shows the survival rates of Titanic passengers or the X axis is the passage class for second or third class.
- And then the bars are colored based on the gender of the of the passenger.
- And so we can see the different survival rates.
- The y axis on a bar chart is often a mean or a sum or a count within the cap of the group determined by our categorical variables.
- Sometimes these will be horizontal. So the horizontal bar chart, the categorical is on the Y and the bars run horizontally.
- This also shows some whiskers that come from a confidence interval.
- It's very easy to generate a default, relatively good confidence interval with Seabourne so tough to pluck to plot
- these Seabourne has the count plot function which lets which does a quick,
- basically categorical histogram. How many observations are in each are in each category.
- The cap plot variable will plot by default a mean value for each category.
- And if you have it, do the mean plotting. It will also compute. Ninety five percent confidence intervals.
- That's what's being shown in this in this plot here.
- And then you can also use the bat, the bar function or the plot nine Geon Bar.
- So if you rules about bar charts first is never start the Y axis on a bar chart.
- Anything but zero. And so the reason for this we can see here is that.
- So the top one. So these are these are looking at the mean average ratings.
- We take you to movies, mean rating, and then we compute the mean of the average ratings within a genre.
- What is that? So if we look here, the difference between horror and IMAX, it's a notable difference, but it's a difference of about point five or so.
- The difference between sci fi and short is a difference of a little under one, probably.
- But when we start the Y axis at 2.5 instead of zero,
- what happens is the differences look much larger than they are because the human eye, naturally it's not.
- We not only want to see the difference, but we want to it's very natural for us to compare the difference to the bar length because these are bars.
- They have length, they have an area since they're all the same with the length is proportional to the area.
- Braking length area. Proportionality is a good way to confuse your readers,
- but it looks like IMAX movies have twice as high an average rating as horror movies because the bar is twice as high, but they don't.
- It's really a shift from about 2.8 to three point three or three point four.
- And so it creates a distortion that makes the different like it highlights the differences, but it makes the differences look larger than they are.
- So when I talked about integrity and avoiding deception, when I was introducing statistical graphics,
- this is what I was talking about, the differences there. It's just not as big as it looks like it is.
- And we truncate our bar charts.
- So if you have the general rule here to generalize beyond bar charts is if something has a length that varies based on the data,
- that length needs to actually represent the value, not the value, minus something because you started the axis somewhere else.
- So if you're including Whiskers, like I did in the previous chart,
- define how they're computed and also as one thing to just be careful of seaboard's cat platen count plot.
- If you aren't using the color for a second variable, they will just make every bar a different color for no particular reason,
- which it creates something that's different when it doesn't need to be so.
- It causes the reader to look for a difference that isn't actually their best avoided.
- You can fix that by just specifying the color.
- We saw histograms last week in a histogram as a bar chart, but a categorical was Binz or ranges of a numerical value.
- Also, though, if we have a bar chart that's showing the relative frequency of categorical variables that can also be called a histogram,
- the Y axis is either the number or the fraction of occurrences in this case.
- So we can that. The key thing, though, is the different heights of the bars that I see visually, the relative frequency of different values.
- So it really makes it visually clear how the data is shaped.
- We can see Skewes and things like that. Is there one way to graphically describe a distribution?
- A scatterplot shows two numeric variables. So each observation is a dot.
- Each observation has two numeric variables. And we put the one variable on the x axis.
- The other variable on the Y axis and put the dot at where its variable values would intersect.
- This is really useful for seeing how two variables relate. Does one increase with the other Duplin?
- Do points clump or cluster in an interesting way? Other interesting patterns.
- It helps us find outliers. So this this is scatterplot is showing the tip versus the total bill for a bunch of restaurant bills.
- And each each observation is a bill.
- And then the x axis is the the under the total bill on the Y axis is the tip that the that the the customer added to the bill.
- And we a couple of refinements we can do here. We can color or change to the point tight by a categorical variable.
- So on this one, we've changed it so that the points are different color.
- So those dinners are blue circles and the lunches are orange AXA's.
- We could also add a trend line or some other kind of a line to show some context, for example, on this chart,
- we might want to plot a line that shows that the 20 percent point and that let us easily see where we're going over 20 percent,
- how that the tips are distributed relative to it to a 20 percent mark.
- We can also X can be a categorical variable when that happens. We call this a point plot or a strip plot.
- Functions for doing this are scatter scatterplot and then plotlines John Point,
- the Seabourne documentation has some examples of more of these align plot.
- It's like a scatterplot that we have to numeric variables. However, we it emphasizes the progression or continuity from one variable to the next.
- By combining them with a line, it really works best. We have one Y per X value that we want to plot.
- If we've got more than one, it really starts getting very, very jagged. It's very common for Time series.
- So this is another example from the Seabourne tutorial not labeled super well.
- I don't know what the value actually is, but it shows that we have some kind of a value that's changing over time and it's going negative.
- That was zero. The Y axis is at the top and the values otherwise our negative functions to create.
- These are line plot from seabourne, line from a matplotlib and Gyeom line from plot nine.
- A box plot shows the distribution of a numeric variable grouped by a categorical.
- So the bar chart just showed us, say, the average value, maybe with confidence interval.
- The box plot actually shows us the distribution and it does so in a way that's based on the median.
- So the median, the the horizontal line in the middle of the box is the median value, the top and bottom of the box.
- Are the first and third quarter close to the bottom of the first quartile and the top as the third quartile.
- And what that means is twenty five percent of the values are below the bottom of the box.
- Twenty five percent in the bottom half. Twenty five percent here and then twenty five percent above.
- We then show these these whiskers that extend out to the minimum,
- a maximum of the data and a number of plotting packages will do some kind of an outlier detection.
- This is using seabourne default outlier detection. So if the max is very high and what the rule it uses by default is it allows the whisker to be.
- So you've got the IQ are the inter quartile range. That's the height of the box. It allows the whisker to be one point five times that tall.
- And if you have any data points that are further away than that, it plots them as individual points, makes it easy to see outliers.
- You can change. It's that the whisker goes all the way up to the max, but it lets you quickly see and compare between different groups.
- The median, the first and third quartiles and the men in the max to the data.
- Very useful for comparing observations of a variable when you're grouped by some categorical functions
- for doing this or box plot from both Seabourne and matplotlib and then Gyeom block box from plot nine.
- A few more plots, a violin plot. It's like a box plot, except it's based around the mean and has curved sides.
- The swarm plot is a kind of another kind of a categorical scatterplot.
- It's usually best to avoid pie charts, especially 3D pie charts, or a lot of the of our software is not going to produce 3D charts very easily.
- Don't try to go make a 3D chart. They're almost always more confusing, especially like the 3D bars that you have from vintage PowerPoint.
- But even a pie chart, just because the human perception is not super great at accurately comparing angular areas.
- So usually a bar chart,
- restacked bar chart is going to be a better option than a pie chart or a donut chart is sometimes a better option where you've got to circle.
- This is one place where I disagree with the reading. The reading that I gave you recommends pie charts for showing relative proportions.
- I recommend usually avoiding those use a bar chart is a stacked bar chart if you need to show you
- want to show multiple proportions of different or relative proportions within different categories.
- There's another kind of plot that's not a plot on its own, but it's combined with other kinds of plots.
- That's a rug plot useful for just displaying distributions at a margin.
- So to learn more, I've gone I've taken a whirlwind tour through a number of different plot types, the class readings.
- So the paper that I assigned you to read, it talks through the use cases for a number of different plot types.
- I'm going to be providing tutorial notebooks that walk you through different plot types.
- The textbook talks about graph plotting and data visualization.
- The Seabourne and matplotlib docs are extensive. And for what?
- If you're using another plodding library, its documentation as well. Most plotting libraries also have a gallery student.
- Go through the gallery, look for a plot that has a feature you want in your plot or that you think might be useful for displaying your data.
- Click on it and they'll give you the code to show you how they made that plot.
- You might want to combine pieces from multiple plots. In practice, it takes a lot of trial and error to really get the hang of your plot and library
- and figure out how to make it show you the data in the way you really want it to.
- Learning one plotting library really deep is useful for a lot of the a lot of the python ones,
- especially the ones that are oriented towards static charts. They're built on top of matplotlib.
- So Seabourne is a convenience API on top of matplotlib. If you're using Seabourne,
- you're also going to need to use matplotlib calls a lot of the time when the seabourne gets you 90 percent of the way there,
- but not quite all the way. So to wrap up, there are many different types of charts that have different use cases.
- Learning graphics techniques takes time and practice takes some of the example notebooks that I'm providing.
- Take some of the galleries from the examples from, say, the Seabourne Gallery.
- Play with them, play with them with some data that I'm giving you, play with them with some data that you have elsewhere.
- But it takes time and practice and spend some time with the galleries of the of the the plotting libraries you're using.
🎥 Metrics and Differences
We talked about the notion of “relative” differences, but what are they?
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
STATISTICS AND DIFFERENCES
Learning Outcomes
Review statistics (or metrics)
Compute absolute and relative differences
Interpret a relative difference
Photo by Franck V. on Unsplash
Statistics
We compute various statistics
Mean
Median
Max
Success rate
90 %ile
Count
Sum
Many others…
Often serve as metric for analysis or evaluation
Comparing Values
Take two values: population (2018 est.) of
Boise: 228,790
Salt Lake City: 200,591
Can compare in two ways:
Absolute – Boise has 28K more people than SLC
Relative – Boise has 14.05% more people than SLC
Absolute Comparison
Relative Comparison
Ratio Comparison
Appropriate Comparison
Depends on context and problem.
Netflix paid $1M to the team that beat its movie recommender by 10% on their chosen metric
Difference-in-difference
Visual Comparison
Bar charts emphasize relative difference
Point plots emphasize absolute difference
Also relative difference-in-difference
Wrapping Up
There are three primary ways to compare statistics: absolute, relative, and ratio.
Be clear and unambiguous when writing the results of a comparison.
Seek to accurately understand others’ writing.
Photo by Nick Fewings on Unsplash
- Hello, in this video,
- I want to talk with you just a little bit about statistics that we can measure and how we want to think about differences between them.
- In the previous video or an earlier video, I talked about how bar charts emphasized relative differences.
- So in this one, I'm going to talk just a little bit about what that means.
- So are learning outcomes for this video to review a little bit of the statistics or the metrics that we've been talking about,
- to talk for you to be able to compute absolute and relative differences and to interpret a relative difference between two quantities.
- So as we've talked about, we can compute various statistics over our data means median mode success rates.
- We can compute percentiles, counts many.
- Basically, any statistic you can think of for that, you can compute over a of a set of numbers we can use as some kind of a statistic or a metric.
- And this often serves as the metric for analysis or evaluation. Try to evaluate a program or tried to evaluate a technology.
- We have some metric that is measuring its effectiveness.
- And we want to see whether it's improved or changed somehow.
- So when we compare to values, though, with a few different ways to do it.
- So let's take a couple the population estimate in 2018 of Boise in Salt Lake City.
- And there's two ways that we few different ways that we can compare it.
- Two of them are the absolute difference. Boise has twenty eight thousand more people than Salt Lake City.
- The other is the relative difference. Boise has fourteen point five percent or fourteen percent more people than Salt Lake City.
- So the absolute value is the difference between two values is actually the absolute value of the difference between two values.
- We can also talk about a science difference or a real difference where we don't have the absolute value if we need the direction on the difference.
- That becomes useful. But what we're talking about here is the actual difference in the underlying units.
- So in our example case, number of people. And but the another way we can do it is to talk about the relative difference.
- So this is the it's the difference normalized by.
- The reference quantity, and we have to be clear on which ones,
- the reference quantity and the reference quantity is the one we're starting from whom we're computing the relative difference.
- So, for example, 50 is 25 percent, more than 40 because you take 25 percent of fortius 10 add add that and you get 50.
- But 40 is 20 percent is only 20 percent less than 50 because you take 50.
- 10 is 20 percent a 50, whereas it's 25 percent of 40.
- You subtract it. And you get and you get 40.
- So this different this order difference is really, really important.
- So for the 50, what we've got is we have 50 minus 40, over 40.
- That's this one. And we have for the minus 50, over 50, negative 10, over 50.
- That's 20 percent. Ten over 40 is 25 percent.
- So we need to be really, really careful about the order, another way we can compares with the ratio, we just divide one quantity by the other.
- So this is when we say slike this year's sales of 20 million are twice as much as last year's 10 million.
- This is an apt. It's an absolute change of one of 10 million and it's a relative change of one hundred percent.
- Twenty million is twice as much as one is 10 million.
- And it is one hundred percent higher than 10 million.
- Now, one thing to think about, if I say this year's returns are two times larger than last year's.
- What does that mean? Does it mean it's two times?
- Does it mean it's 200 percent more, which would be three times?
- Is it clear? I would submit that this way of framing it is ambiguous.
- And so we should avoid it. The appropriate comparison really depends on context and problem.
- There's not a hard and fast rule when you need one or another.
- Relative comparisons are quite common because they they can be compared across a variety of contexts.
- But we still also need to pay attention to the underlying absolute difference in what
- the act what this what the change being made in this relative change actually is.
- One example of a high profile relative change.
- If the Netflix prize, which was a run by Netflix a number of years ago, they paid a million dollars to the team that was able to beat It's The Beat,
- their internal movie recommender on the metric that they chose by 10 percent.
- They wanted a 10 percent improvement. And this metric it was. Lower is better, so they wanted you to.
- They wanted to decrease in the metric by 10 percent.
- We can also talk about a difference between differences, because if we compute a difference, that difference itself is just another value.
- So we could say ten sales grew 10 percent more this year than last year.
- So if we define growth as the one year sales minus the other year sales, then we can look at the growth of this year.
- And the growth of last year and we can compute the difference and difference.
- And so we can have a 10 percent increase in growth. Difference in difference has come up a lot in various contexts.
- And so it's important to be able to reason about those as well. And again, be clear both in writing and understanding.
- So we don't want we don't want to. Visual comparison bar charts emphasized relative difference because the height of the bar is right there.
- And the eye very naturally compares the difference between bars to the height of the bar itself.
- Point plots emphasize absolute difference because you don't have the reference point of the size of the bar.
- They're both of them. Make it pretty clear to see the also compare the differences.
- See how different the differences are. Those are of evident both in bar charts and point plots.
- So to wrap up, there are three primary ways to compute statistics, absolute relative and ratio.
- You need to be very clear and unambiguous when you're writing the results of a
- comparison and also when you're trying to understand what others have written.
- Seek to accurately understand it. And if you're providing feeB,
- if you're in a context where you're providing feedback and it's not clear that clarity is something you want to ask for revision.
🎥 Charts from the Ground Up
In this video, I discuss how to design a chart from your questions, goals, and data.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
CHARTS FROM THE GROUND UP
Learning Outcomes
Design a chart by thinking first of the questions, goals, and data
Photo by Ryan Quintal on Unsplash
What is the Question?
What question do we want to answer?
How, precisely, did we operationalize it?
What data are we using?
Specifically, variables
Example: Titanic Survival
Question: were passengers in higher classes more likely to survive?
Outcome variable: survival
Logical variable, 0/1 encoded
Taking mean yields numeric response!
Mean is probability of survival (1)
Also called response variable or dependent variable
Explanatory variable: passage class
Categorical
Also called independent variable
Bar chart
Showing Relationships
Most plots show how a numeric variable changes between different values of one or more other variables.
What is the variable to show?
What do we want to show about it?
Statistic?
Distribution?
How do we want to compare between EV values?
Even histograms follow this!
Response: frequency (count, proportion, density)
Explanatory: value or bin
Pipeline
Identifying these informs:
Data processing (e.g. binning, group-aggregate, transform)
Choice of plot type
Axes, labels, colors, facets
Variable Types
Response is numeric (or transformed to be)
Categorical → relative frequency
Explanatory can be anything
Numeric → continuous axis
Categorical → discrete axis
Ordinal → discrete axis preserving order; may omit some labels
One Explanatory Variable
If EV is continuous: scatter or line
Scatter plots we sometimes blur response/explanatory
If EV is discrete:
Bar chart shows statistic, emphasizing relative difference
Point plot shows statistic, emphasizing absolute difference
Box or violin plot shows distribution
Two Explanatory Variables
Pseudo-3D:
Contour plot: identify peak(s) & shape
Heat map: see where high & low points are
Demonstration
Two Explanatory Variables
Pseudo-3D:
Contour plot: identify peak & shape
Heat map: see where high & low points are
Other aesthetics for secondary variables:
color, shape, size
Occasionally used to indicate second response variable
Titanic by Class and Sex
More than Two EVs
Can’t really have more than 1, maybe 2 numeric
Others can be binned
Facets let us break down the plot by more categorical explanatory variables.
Facet Plot
More than Two EVs
Can’t really have more than 1, maybe 2 numeric
Others can be binned
Facets let us break down the plot by more categorical explanatory variables.
Pay attention to order. It strongly affects what readers compare.
Stacking
Stacking lets us see differences in composition – how do the parts of a whole change?
Stacked bar charts
Stacked area charts
Can be either raw values or fractions.
Transformations
Sometimes we transform the axis
Log-10 scale – shows order of magnitude
Generally don’t do this to bars
Sometimes we transform the data
Rescale, log, square root
Normalize by a value
Be Careful
Avoid excessive complexity
Be careful with color (easy to make indistuingishable)
A good graphic reveals the data, and does not distort or obscure
Wrapping Up
Identify variables and relationships you want to highlight.
Design a plot that illustrates them.
Study plotting library APIs.
Photo by Daniel Cheung on Unsplash
- Hello. And this video, I want to talk to you about how to build up a chart from the ground up as we think
- about the question it's going to try to answer and the pieces that need to go into it.
- So the learning outcome for this video is for us to be able to design a chart by thinking first of the questions,
- the goals and the data that are going to be in and from the chart.
- So a good chart answers a question and the guiding principle for how we design and
- how we lay out our chart is to illuminate the question that we want to answer.
- And this depends. We need to know what question we want to answer in the first place.
- We also need to know precisely how we operationalize that question so we can use that to then inform how we're going into the chart layout.
- And we need to know what data that we're using, specifically what variables we're using as a part of this chart.
- For example, there's a data set, you'll see it in the notebook that goes with this video for of passengers on the
- Titanic and supposedly wanted to examine whether passengers in a higher fare class,
- say, first class or more likely to survive than passengers in lower fare classes.
- In this analysis, we have an outcome variable zero one,
- whether or not the passenger survived the Titanic sinking and a lot of charts are going to have an outcome variable.
- We want to we have some outcome variable and we want to see how it responds to or how it differs with some other variable,
- which we call the explanatory variable, in this case, the passage class where outcome is survival.
- And we want to see how it changes as the the the passengers passage class, the explanatory variable changes.
- The outcome variable is also called the response variable or the dependent variable,
- because it's what we're trying to measure that's responding to the condition we're trying to analyze.
- And the explanatory variable is sometimes called the independent variable because it's changing,
- but it's not changing as a function of the other variables in theory.
- So we can do this with a bar chart and this bar chart shows the x axis is our steerage class through our passage class first,
- second and third, and the Y axis is the average is the fraction of passengers in that class who survive.
- We also see some error bars. We're going to see later what those mean and how to how to compute them.
- But this lets us see how the outcome survival changes as we age, as the pass or with the different passage classes of the passengers.
- And one of the things to note here is that we have our explanatory variable on the X axis and the outcome variable on the Y axis.
- That is the general convention. There are some cases where we might want to flip it.
- So we've we've got a horizontal bar chart where the explanatory is on the Y and the outcome is on the X,
- particularly if we if if it makes the labels more readable.
- But the standard convention for most types of charts is to put explanatory on x axis, the horizontal axis and the outcome variable on the Y axis.
- And this chart shows the relationship, many charts or relationship,
- most of the plot that we're going to be drawing in this class show how some kind of a numeric variable either continues or.
- Or integer changes between different values of one or more other variables, and in this case, even though our response was zero one logical.
- When we convert it into a rate per per passage class, it became a continuous variable.
- And so when we do this, we need to identify a few key things to design our plots.
- We need to identify what variable we want to show. That's going to guide a lot of plots that'll be on our y axis.
- When it's not, it'll usually be on X and it's going to identifying that variable is,
- if anything, probably the most important thing in designing a plot.
- We then need to identify what we want to show about this variable.
- Do we want to show its value for different data points? Do we want to show a statistic?
- The do we want to show, for example, the the mean or the rate?
- In the previous when we showed a statistic, the Titanic example, we showed a statistic, the rate of of survival.
- Do we want to show its distribution?
- And then how do we want to compare that between values of the explanatory variable, particularly, do we want to look at absolute differences?
- Do we want to look at relative or proportional differences? And even the histograms follow this kind of a design because they have an outcome,
- which is the frequency or the count of the abortion or the density, depending on precisely what kind of histogram we're showing.
- And then they have the explanatory variable, which is the value or the.
- So we've got a histogram and we've got some beans. And the response variable is how how many values are in that bend and the explanatory is the.
- So identifying these then informs the entire pipeline of producing our chart, the data processing the beginning.
- We're going to do group aggregation transformation that gets us to the final values we can actually plot.
- It's going to affect our choice of plot type and it's going to affect our choice of axis labels, colors, facets, the other aspects of the plot.
- So. The type of the variable has a significant impact if the response is numeric or can be transformed,
- the response is often numeric or can be transformed to be.
- If we're talking about a of categorical value, we usually want the relative frequency of different of different values of that.
- So either we're doing it like a histogram and the we're going to transform it so that we're showing just the distribution.
- We're going to transform it, so that we're showing that the explanatory becomes the value of the categorical.
- And the response is how many or what fraction have it in a logical.
- It might if it's a two level categorical, we might turn it into a fraction, just a fraction to have one of the levels versus the other explanatory.
- It can be anything. We're going to see how to use numeric explanatory variable is categorical explanatory variables ordinal.
- We were just like categorical,
- except that the that it's a discrete axis that preserves order and we need to make sure that the order and ordinal data is being preserved.
- If you're using pandas ordered category type, it automatically preserves order when you're doing the plot for you.
- So if we just have one explanatory variable, this is the easiest case, if our explanatory variable is continuous,
- we usually want to scatterplot or align plot for showing individual values.
- Sometimes we'll flip the response and explanatory on a scatterplot or will or both might be explanatory.
- We want to show where points lie in a two dimensional space.
- But generally, if we've got an explanatory, a continuous explanatory variable and we've got a and we're trying to show values,
- we're going to use a scatterplot or a line excuse me, we're going to try to show values.
- We're going to try to show statistics like a mean at each at each value of the explanatory variable.
- We're going to use a scatterplot or a line plott. If the explanatory variable is discrete, then we're going to use a bar chart to show a statistic.
- If we want to estimate the relative difference, we want to be able to compare the relative value relatively compared to values,
- because a bar, one bar will be twice as high as another.
- And a point plot shows a statistic or an individual value, and it emphasizes absolute difference.
- You don't have a whole bar in order to to compare heights.
- You just have the point. And then if we want to show a distribution, we usually use a box or a violin plot with this discrete explanatory variable.
- We don't have great ways to show distributions with continuous explanatory variables.
- You can show a variance with an error bar, but that's about where a ribbon.
- But that's about it. For too explanatory variables we get into.
- Too explanatory variables, we have a couple of options. One is we can do a three, a pseudo 3D display where we do a contour plot or a heat map.
- And I'm going to show both of these here. So this is a contour plot.
- The left one is a contour plot and it reads like a topographical map.
- If you envision your your two explanatory variables in this case, we're going to we're showing a two dimensional distribution.
- So one explanatory variable is the score given to a movie by its critics, and another explanatory variable is the score given by its audience.
- And then the response variable is how many movies have that combination?
- And so we can see here, this is the peak, a contour plot is really good for showing us the peak.
- It's going to be that innermost circle and it also shows us the shape because each of these rings is a a a level of decreasing.
- Decreasing height in this map, so if the response if we envision that the response variable is this height and we're looking at a two dimensional map,
- the rings show us the contours around the mountains of that height.
- Good for showing, good for showing shape. The other one of the heat map which uses color.
- And so it's usually going to be from a cool color like, say, black here to to a hot orange,
- or it's going to be sometimes if you have a bidirectional one, which goes blue to red and it lets us see the highest density is here.
- And the as you go out from there, you get lower and lower densities.
- Either one can work for a continuous variable heat map, you often have to it in order to.
- This is a descriptivist heat map where we have been everything in in bins of of a half a star or a half
- a star on the audience score and a four star on the credit score because they're on different scales.
- But heat maps also work well for categorical ordinal data.
- So. Another way we can do it is we can use other esthetics for secondary variables such as color or shape or size,
- sometimes we'll use that to indicate a second response variable,
- like you might have a scatterplot where the size of the point is a second response variable, but often it's for multiple explanatory variables.
- So this shows us how we can do that. So if we wanted to break down Titanic's survival rates by both class and sex,
- we can see we can use we keep our class on the X axis like we did before,
- and then we use color for the passenger sex so we can see significantly higher survival rates for women across all three classes.
- I'm also showing you here the difference between a bar chart and a point plot.
- So the left is the bar chart. The right is the point plot.
- And the bar chart lets us compare the heights of the bars. Note that it starts at zero.
- Bar charts always start at zero. And because so it lets us compare the height of the bars and we can see that.
- It's easy to see from just using our vision that the female passenger first class bar is almost is more than twice as tall as the.
- As the female or the male passenger first class bar,
- the male passenger first class bar is twice as tall as the male passenger passenger second class bar.
- So it lets us compare make relative comparisons between the different values.
- This is why it always starts at zero, because the natural thing to do with the bar is compare its height.
- If your bar chart does not start at zero, suppose our bar chart started at point one,
- then the comparison of height would exaggerate the difference relative to the value.
- And what looks twice as tall isn't actually twice as tall because we cut off a bunch of the bottom.
- So always start at zero. The point plot. Does not it makes it hard to compare relative difference.
- We can't it's difficult for us to tell that the survival rate visually tell.
- We can tell if we look at the numbers,
- but it's difficult to visually tell that the survival rate of women in first classes is twice as high as the survival rate of men.
- But what it does literacy is it lets us see the absolute, absolute difference between these values,
- and it makes it easy to compare the difference in the gaps across the three classes.
- We can see that the the survival rate by by sex is much higher or is much closer in the third class than it is in the first or in the second class.
- So your choice of plot really guides the user to see different things in your choice of plot,
- allows you to emphasize different things and you need to decide.
- You need to choose and design your plot in such a way that's going to tell the story that you need to tell from the data.
- We can also have more than two explanatory variables. It's difficult to have more than one that's numeric or two for doing a contour plot.
- We can bend variables that are then going to let us use some more techniques, such as FaceTime.
- So if we want to break down by more categorical variables,
- so we want let's say we also want to look at a or we want to break down many more variables.
- Let's say we also want to look at age. And so we're going to keep sex on the color.
- We're going to now use age as the x axis. Since this numeric, it really works better on an axis.
- I have bend it into bins of tens that you only have one point for every decade.
- But then we use a fassett and the fassett means we draw a different chart for each of the three classes.
- The charts all share a y axis so we can directly compare across the row of charts and we can see it lets us see
- particularly how does the survival as a function of age change between different different passenger classes,
- for example? And so it is, but it lets us start to build up.
- And if we had a fourth, we could use rows and columns in the faceted plot.
- So we have these mechanisms of building up and we have our x axis or y axis.
- We can use esthetics of the lines of the points, particularly color, size, shape,
- and then we can use facets to build up even more variables into our plot.
- To do fascinating, there's a couple of things you can do, it's built into some of the seabourne row plotting functions.
- The plot and cat plot function functions can both do fascinating on their own.
- They let you control the statistic. They're very, very flexible functions for a wide range of plot.
- The general purpose Fassett Grid allows you to fassett any kind of plot by writing some more Python code on your own.
- Very useful if you want to fassett something that doesn't support Facetune built in.
- And if you're using Plot nine or the R.G. plot to package Fassett Grid and Fassett wrap a control fassett,
- you build that faceted plot you need to pay attention to what variables go where your choice of which variables are going to be on color,
- what variables are going to be facets, which variables are going to be on your axes really affect how the reader is going to interpret and understand
- your plot and you need to choose them strategically to tell the story that addresses your question.
- You also need to do it, though, in a way that is honest and does not mislead your user, your readers.
- The chart needs to honestly show the readers what it is that you learned from the data and show that clearly.
- Another thing we can do to build up a chart, especially if we have more categorical variables,
- if we've got a categorical response variable with more than two levels,
- and we want to show how particularly how the the proportion in different categories changes the response to another variable,
- a stack chart can be very good. Let's see the differences in composition to see how the parts of a hole change.
- And so this chart,
- this is a stacked bar chart and it's a horizontal bar chart where I put the explanatory variable on the x axis excuse me, on the Y axis.
- Just in part to make the labels easier to read and so are explanatory variable is what data set.
- Something came from Locke, M.D. Gry. What those are don't matter for our purposes right now.
- The response variable is the distribution of gender's in this case.
- These are data sets of books, the genders of the authors of those books in the data set.
- And so we have female, we've got mail and we also have codes for we it's ambiguous or unknown or we didn't have data.
- And so we can see, for example, the GYŐRI data set has a higher fraction of women and a significantly lower fraction of men.
- And we can see quite a few more. Books that we don't know what gender on, and so this the order on this chart is very strategic.
- I observed these levels is very strategic. I bunch I batched all of the various kinds of we don't know together so that
- you can look at that whole block and see the and see the various types of.
- We don't know the gender of the book's author together, but you can also see how they're broken down into individual things.
- You can see that UNlinked is a very, very large fraction of of that increase in books where we don't know the author's gender.
- So you need to think you need to think about all of these different things in order
- to be able to generate a chart that's going to clearly and unambiguously communicate.
- You can show either you can show raw values in a stack bar chart at the bars.
- Don't all have to be the same height you can show fractions, in which case they will be.
- I chose to show fractions in this chart. The code that generates this using raw matplotlib is linked in the notes for the video.
- Sometimes we're also going to transform our charts.
- We might transform the axis such as doing a log ten scale, in which case the label would transform the axis.
- The labels are still in their original value. It's just they're spaced out logarithmically.
- We generally won't do this for bars. Reading a bar on a large scale.
- You can draw it, but you have to be really, really careful in order to make sure that your readers are going to accurately interpret it.
- But for line and scatter plots, log transforms are a lot more common.
- Sometimes, though, we're actually going to transform the data itself and we're going to plot a log or a square root or some other rescaling.
- And another kargman transformation is to be in the data, somehow democratize it into fixed bins.
- By some mechanism or another, so the key decisions that you need to make when you're making one of these charts
- are you need to pick the variables and how you're doing their transformations. You need to pick that what's called the esthetics,
- how you're going to map the different variables you're looking at to chart features your X and Y axes,
- your facets row and column your color, your point marker style.
- If you're doing a joint plot, often it's useful to put.
- The same esthetic on both color and style, and that way, if you have a reader who's colorblind, they still get different point styles,
- even if they can't tell the colors apart or if someone's putting it on a black and white printer.
- And then you need the type of the chart line, chart, bar, point box, et cetera.
- So you have to make all of these decisions when you're drawing this chart and they're driven by what variables and data you have and what
- question you're trying to answer and what story you're trying to tell about that you do need to be careful to avoid excessive complexity.
- We can put a different variable on every conceivable esthetic and it's often going to result in a chart that's very difficult to read.
- We also have to be careful with color because it's easy to make a chart that has differences
- that are difficult for the human eye to distinguish or get obscured by printers,
- low quality displays, etc. It's also important to note a good graphic reveals the data and does not distort or obscure the data.
- It's easy to create a graphic that manipulates the data to tell a story that's not very well supported.
- And we want to avoid that when we're doing data science with honesty and integrity.
- So wrap up. You need to identify the variables and relationships that you want to highlight in your chart.
- You want to design a plot that illustrates them,
- and you're going to need to spend some time studying your plodding library APIs and the Plotting Libraries Gallery.
- Any plotting library usually has a gallery of a bunch of different plots and the code that was used to generate them.
- Seabourne has this, matplotlib has this.
- And so you spending some time with that looking, oh, this looks like this looks like the kind of plot that might display my data well.
- And then look and click on it and see what code they use to generate it and borrow it.
✅ Plots in the Wild
In preparation for Thursday’s class, find a data presentation (plot, table, etc.) in a recent online
publication, and share it with your team through a post on Piazza (in the ‘discuss’ category) with a
link, a copy of the image. This can be from a journal paper, a newspaper article, a blog post, or
another source the class can all access.
In class we will discuss these plots!
Tip
Don’t spend more than 30 minutes on this assignment.
📓 Finishing Touches
The Finishing Touches notebook describes how to apply some finishing touches to your plots and save them to files.
🚩 Week 3 Quiz
The Week 3 quiz will be over all of the assigned material for this week, and is in Canvas.
The sections below this are for your further study and practice.
📖 Textbook
This week primarily uses Chapter 9 of 📖 Python for Data Analysis, with some material from chapters 8 and 10.
📚 Futher Reading
For further study on these topics, see:
The Seaborn and Matplotlib galleries
The Visual Display of Quantitative Information by Edward R. Tufte
W. E. B. Du Bois’s Data Portraits: Visualizing Black America, edited by Whitney Battle-Baptiste and Britt Rusert
✅ Practice
Doing this work well takes a lot of practice. Create some notebooks and experiment with drawing interesting charts from some of the data sets we have been exploring, or new data you find!
The HETREC data has a number of variables of different types that are useful for practicing manipulations and visualizations.
📩 Assignment 1
Assignment 1 is due on Sunday, Sep. 12 at the end of the day (11:59 pm).
The tutorial notebooks are going to be very useful for this assignment.