Week 3 — Presentation (9/5–9)#

These are the learning outcomes for this week:

Create plots for data
Identify the appropriate type of plot for data in question
Read and interpret a plot
Refine a plot to more clearly show data
Write a well-organized notebook to present data analysis with text and visuals

We will primarily be using Seaborn and Matplotlib for our graphics, because it is easy to get them fully working for both notebook and document-ready graphics in any Anaconda environment and efficiently handles very large data sets. There are several other packages that are useful for Python data visualization, and in some cases are easier to use. I personally use plotnine for most of my graphics, and plotly is a very capable package with particularly strong support for interactive graphics. The core graphics principles we study in this module will apply to most packages you may use in the future.

Tip

I do not recommend that you use Plotly for this course. While it is very good for interactive graphics, its support for static graphics to render in printable documents is rather new.

Seaborn upgrades

Seaborn is undergoing some changes in its syntax. In the old syntax, we pass the x and y parameters as positional paremeters to a plotting function:

sns.lineplot('time', 'price', data=stocks)

In the new syntax, which will be required in a future Seaborn release, we use named parameters for everything:

sns.lineplot(data=stocks, x='time', y='price')

All new material going forward will use the new syntax, but it takes time to update all of the slides and videos. You may see the old syntax. It still works, but it issues a warning to let you know the future syntax is changing.

🧐 Content Overview#

Element	Length
🎥 Goals and Audiences	9m55s
📃 Statistical Data Presentation	4300 words
🎥 Statistical Graphics	14m15s
🎥 Manipulating Data	9m18s
📃 Selecting Data	1866 words
📃 Reshaping Data	2363 words
📃 Missing Data	3850 words
🎥 Types of Charts	13m30s
🎥 Metrics and Differences	6m50s
🎥 Charts from the Ground Up	22m14s
🎥 Organizing Notebooks	16m50s

This week has 1h33m of video and 12379 words of assigned readings. This week’s videos are available in a Panopto folder.

📅 Deadlines#

Finding a plot before class on Thursday
Week 3 quiz at 8am on Thursday
Assignment 1 at midnight on Sunday

🎥 Presentation Goals and Audiences#

Video (9m55s)

Slides

Blow and this video, I'm going to introduce our week three module on presenting data.
So our learning outcomes for this week are for you to be able to create plots from data,
identify the appropriate plot for type of plot for data in question.
I want you to be able to read and interpret a plot, refine a plot more clearly, show its data,
but then also put these plots and our other presentations in discussions of data into a well-organized notebook to present your data analysis.
Before we dove into how to actually present data, I want to start with talking about some of the purposes of data presentation,
because these purposes should guide your presentation design decisions.
They should guide your evaluation of your own presentations and those of others.
They'll also guide my evaluation of your presentation when you are submitting assignments.
And one of the first things we need to do is guide the reader attention to important results.
And effective presentation is not going to have every piece of thing in it.
It's going to draw focus and attention to the important results and make it easy for
your reader to ask the key questions around the data analysis you're presenting.
But then it also needs to substantiate the results and conclusions.
So when we're presenting the data, we want to guide the reader to focus on what it is that we want them to learn,
but also in a context that gives them the information needed to assess the validity of the conclusions that we're presenting.
And we want to do so with integrity. It's easy to make charts that highlight the result.
You want the user to see whether or not or not that result is rigorously defensible from the data.
And we want to avoid making those kinds of misleading data visualizations and presentations.
So in doing this, we want to be able to think about the audience and you're gonna be presenting data to several different audiences.
The first audience is yourself.
When you're working with a data set, when you're trying to understand what you're learning from it, the results of your inferences,
your presenting data to yourself, and you need the presentation to be clear so that you understand what it is that you're learning from the data.
So you see the next question to asks. You're not misleading yourself in the data analysis process.
But those kinds of charts don't necessarily need the same level of Polish that a chart for an external audience.
Would your collaborators, supervisors, et cetera, need to be able to see the data, see what you're learning?
Maybe it's in the weekly meeting you have with your research advisor.
You're presenting them with the results that you found.
They are people who have a lot of knowledge of the project you're working on another problem base you're working on.
They're gonna help you guide and refine your questions.
Again, they perhaps don't need as much Polish as a final published result, but they're not just the ones that you're creating internally for yourself.
You're going to be presenting to expert readers.
If you write a scientific paper, the readers are probably usually wouldn't have some level of expertize in the topic that you're talking about.
They may know. They'll probably know the subject in general, but they may not know your specific work.
You may be presenting this to decision makers, especially if you're doing a data science project in an industrial or corporate environment.
You're providing data.
That's going to inform the decisions that your boss, who may not have significant statistical expertize or data expertize that may,
but they're going to be using those decision, those the data and the the data that you present in order to make decisions.
And then finally, you may occasionally be be producing or presenting data for the general public at large.
Each of these audiences is going to require different things from your data presentation.
So you need to understand who it is that you're presenting the data to in order to make appropriate data presentation decisions.
When we're presenting the data, here are some questions that are going to help us understand what it is that we need to guide the reader towards.
So we need to be clear on the reader needs to come away knowing what we sought to find out.
This might be just explicitly stating our research questions, but they need to know the purpose that the data we're presenting is supposed to serve.
What are they supposed to learn from it? We then they didn't then need to see what we do learn.
And then they need to see the supporting evidence, the context to trust the conclusions.
It's not just enough to say here's the results, but in many cases you need to provide enough data,
enough context that they not only see what you learned, but they see why you believe it is true.
And presentation with integrity shows the reader and really makes it clear to the reader what we learned.
The evidentiary support behind it, why it flows from the data.
Whereas dishonest presentation manipulates them into the conclusion without having the rigorous foundation underneath it.
So I want to show you an example of a very bad graphic that came up out of the state of Georgia earlier this year.
They presented campaign ad hundred network television, a graph that's purporting to show Kofod cases in various high population counties over time.
But if you look closely at the Y at the X axis of this graph and you'll see this more clearly when you go and look at the slides,
the axis is not sorted. It starts with the twenty eighth of April and then it goes to the twenty seventh, followed by the twenty ninth.
Then May 1st. Then April 30th. At the end we have May 2nd.
May 7th. April 26. May 3rd. It violates the expected convention.
And what they're what you'd need to do to show a trend over time that time goes from left to right.
They're sorting things in. They're putting things out of order to show the trend.
They want. In a way that's not substantiated by the data and it takes a lot of work to make a chart.
This bat. I'm not sure how to do it in any of the statistical software I actually use, but this is a this is an egregious example,
but it's an example of one of the things that can happen when where when we focus on.
The effect we want to demonstrate over the evidentiary support for it.
I want to contrast with a. I want to contrast with a chart from W edi w e, b do BWAS charts created for the nineteen hundred parece X position.
And these were a series of charts for an exhibition to show the economic, educational,
etc. progress of black Americans from emancipation to the turn of the century.
And he made a series of charts showing economic situations and things.
And here's a bar chart that clearly shows the result.
The distribution of economic statuses for farmers after a year of farm labor.
And it shows we have the first categories of bankrupt and in debt.
And then we have four different or five different levels of of non-negative return up to clearing.
Fifty dollars or more. And it shows them the bars are proportional to the length of the data.
Very clearly highlights these things.
It then also does a creative thing of pulls out the separate bar that it indicates is the composite of all of the non-negative bars.
And we can see that even if you if you add all of that, they non-negative bars together.
It's not as many farmers as as the indebt category.
That's not a very standard thing. These charts were hand drawn.
But it's a creative use of the visualization to highlight in a way that's supported by the data,
the relative distributions of a different in return levels for black American farmers at the time.
Another one, that's another one that's creative here is this spiral bar chart.
Again, it's an unusual thing, but the lines, we can see them going progressively longer.
And if this were just a part of normal horizontal bar chart without the spiral that the smallest line,
the first line, 1975, would be so small you couldn't even see us.
This gives more space. And it shows good visualization doesn't just mean following the checklist of rules.
It means presenting the data in a way that the conclusions and the takeaways are clear and they're rigorously supported by the underlying data.
There's no visual tricks to make things look larger or smaller than they are.
It transparently shows the connection between the conclusion and the underlying data that support it.
So to wrap up, the goal of good presentation is to guide the reader to what we learned and how we know it.
Effective. Presentation is going to highlight the important things for the reader to understand without distraction or deception.
And we're going to see throughout more of this week and throughout more the semester how practically to go about doing that.

📓 Data and Notebook#

These resources are used throughout many of the videos in this class:

Movie Scores notebook
The HETREC data

📃 Statistical Data Presentation#

Read Statistical Data Presentation by Junyong In and Sangseok Lee.

🎥 Introducing Statistical Graphics#

This video introduces basic principles of statistical graphics.

Video (14m15s)

Slides

Welcome in this video, I'm going to start introducing the basic concepts of statistical graphics.
I want you to be able to understand the value of graphics for presenting data,
identify parts of a statistical image, and understand some pitfalls and graphics that we want to try to avoid.
So here's an example of a chart, and there's a variety of different pieces of this chart.
We have an x axis. That's the horizontal x axis.
We have a y axis, the vertical axis. Each of these axes has a label.
Task to task one. We have a caption up at the top that explains what's going on in the image, provides us with the context to understand it.
And it says that this graph is showing the number of queries per task with query account distributions in the margins.
And each dot is one participant. So it tells us we have a data point in it.
What is that? It tells us what it is that we're charting. Number of queries per task.
When we then see the axis labels that we have task one and we have task to.
Those two together, give us the context, understand that. Oh, we have two tasks.
And this is why of one participant.
And they're appearing at the point where they have their task, one count on their task, two counts.
OK, this allows us to to see if there's any relationship between how long it took to.
How many queries it took to complete the two different tasks. It then says we have query count distribution than the margin.
So this is a compound plot. And in the left and right, margins are in the X and Y margins.
We have the distribution, a histogram of the X axis, the task one.
We have a histogram of the Y axis task to these histograms don't have axes themselves because,
well, we just wanted to show a distribution the exact particularly for our purposes here.
The exact number in each bean is not so important. The key thing is just see being able to see relatively where is the mass of the different?
Where is the mass on the two different task counts?
And we can see that both of them have a a right skew.
They're bulked up towards towards the low end of the scale.
And then we have the all the we have all of the individual data points scat on the chart.
This is called a scatterplot. We have these different pieces of the chart that we want to be able to identify.
And when you go particularly as you go to refine a chart, what you're going to need to do is specify what's happening on each of these pieces.
What is your x axis? What is your y axis?
Before you even start the chart, you need to set up your data so that we have what is the data point that I'm going to be plotting on this chart?
So charts can are really useful for revealing a variety of things that can reveal patterns or lack there of.
In this chart, there's really not much of a pattern.
And we can see that it's booked up, but particularly if we get out to that larger number of tasks, there's not a lot of pattern.
The one with the participant, with the most tasks and with the most tasks or queries for task one.
Has a middling to low number of queries for Task two.
And the the one who has the most queries for task to while they're in the upper end of of the queries per task on task one, they're biased.
They're not at all the highest. So we can see there's not a not very much of a relationship here.
At least that doesn't look like one. They can be useful for comparisons. If we've got a bar chart, we can compare to bars.
We can see where points lay. We can see like we can see in that chart that we just saw that the the highest number of counts for one task,
the highest number of counts for another task or different. We can also see trends.
We can see if a line looks like it goes up or down,
wiggles around so they can reveal a lot of these kinds of things and they can really leverage our human perception and our human,
particularly our human visual senses,
to be able to quickly internalize and understand what is going on in in a set of data when we're creating a chart.
We need to clearly document a few things. We you clearly state what is being presented when someone looks at a chart.
They need to be able to understand what each point in the chart is going to be.
They need to understand what values are plotted on the axis. They need to understand what values are plotted on the axes.
Often this is done in an axis label in our in the chart I showed you,
it said the values in the caption in the axis labels said which version of them they were.
If there are units, that needs to be clear.
So if you've got something that's millimeters, that's pounds, that's megabytes, whatever, you need to specify the units in your in your chart,
either in the Axis label or in the caption, some of these things can sometimes be implicit in the type of chart, such as a histogram.
And you've got a fraction or a percentage in the left hand side.
It's standard convention that we're talking about, the fraction of the values that are in each bin,
at least if you label it as a histogram or as a chart showing the distribution. But when in doubt, if there's any doubt about.
What a value, what an axis label is. Or there's any doubt that the reader will understand what it is.
Be explicit, explicitly, say what's going on in your chart. That also the chart in the caption should be interpretable on their own.
You can assume a reasonable level. You have to know your audience for this.
But someone should be able to just look at the chart with its immediately surrounding description,
the labels, the caption and understand have a pretty good idea of what's going on.
The surrounding text with the text that references the chart if you're writing a document.
That can have your observations, that can provide more context and clarity.
But someone just looking at the chart should be able to figure out basically what's going on and
not be too far off of this is particularly important because there's a there's a lot of people,
whether this is a good or a bad practice, we can debate. But there's a lot of people who, when they're reading a paper,
they focus on the charts and look at the key charts first to see what it is that's going on in the paper.
And if our if our charts are self-explanatory and are clear that it makes a lot easier for people to glance at our work,
see what it's doing and decide whether they are going to pay it further attention.
So in a paper, if you're putting a chart in a docket, a written document or a paper,
each figure should have a caption and the caption can it labels the figure and it can also provide interpretive guidance.
Like,
it's not uncommon for a caption to be two or three sentences saying things about what's going on in the chart and describing some of the methodology,
what precisely some of the computations are, etc. In other contexts, we often need a title for the charts.
So if we have a caption we don't, we need to label our axes, but we don't need a title for the chart itself all the time.
It doesn't hurt, but often it's redundant with the caption.
In other contexts, though, we often do need a title such as when we have a chart that's going in a presentation.
We have a chart in one of our notebooks. A title is often helpful in notebooks.
The surrounding text may be sufficient,
but a title is often a good idea for someone who's quickly scanning the notebook to be able to understand what's going on in the chart.
So a few pitfalls to be aware of when we're thinking about statistical graphics is
one is distorting the distances or the differences that are happening particularly.
We need to make sure if something has a length,
anything that has a length that length should accurately represent quantities, position, relative position.
If you have two dots, their relative position is what's important. But if we have a length of it, if we have a bar, it has a length.
It also is an area we need to make sure those accurately represent quantities.
One really common way to violate this is having a bar chart whose access starts at something other than zero.
The software we're using doesn't do that by default. Excel does.
But your bar chart always needs to start at zero because people are beat.
People don't look at the relative position of the bar. People see the whole height of the bar.
And so if it doesn't start at zero, it looks like the difference between bars is much higher relative to the bar size than it actually is.
There's also ways in which we can violate conventions.
So in the first video I showed you the chart that violated the convention, that the x axis goes in order.
If we violate the user's expectations, they they'll either be confused by the chart or read it wrong.
Statistical graphics in each particular type of chart have conventions that people who read a lot of them assimilate by long patterns of reading,
like you assimilate how to read written text. And if those expect expectations are violated, that can.
Lead the user to incorrect conclusions from our charts, from our presentation.
A key thing to remember here that also applies to all of our presentations.
Research isn't a mystery novel. You don't have to worry about spoiling the surprise or you end the goal here is not to present it,
not to subvert tropes or present shocking new presentations.
We might have shocking new evidence, but from a presentation perspective,
we want it to fit within conventions and not violate readers expectations unnecessarily so that
they can read it and be confident that they've correctly understood what it is that you're saying.
Another thing to be aware of is that graphics can illustrate an effect.
They can also help you find an effect. Like more exploring data. We can look at the graphics to see what effects we might be looking for.
We have to be careful about that. We'll talk about some of the pitfalls of we have to be careful, more combining.
We can't combine exploratory and what's called confirmatory analysis, but they can help us.
Visualizing data can help us look for possible effects and get ideas for what to go look for next.
But they're not conclusive proof of an effect. We need the numeric results, just the raw numerical.
The raw numbers as well as the numeric result are the results of inferential techniques that let us
estimate how big an effect is and whether it's significant in order to come to any conclusions.
So in this chart, I want to show you, for example, we have if we look at the chart closely,
we have these two data points and the little blue Xs in them, the func SVOD axis.
As to the left of the item, item X. So it looks like for this particular metric,
lower is a better value for it because it's an error metric root mean squared errors with RMX he stands for.
But it looks like. Okay. This is a little bit better, but that's not sufficient evidence for us to include.
To conclude that func SVOD is better than item item on the per user are masc metric.
Exactly what all these things are is, is a topic for another day.
But the fact that we see the thing to the left, that s that if the effect is real, this illustrates it.
But seeing it's not enough for us to conclude that it outperforms because it might be a fluke of our experimental strategy,
it's a relatively small difference. So. They help us see.
They help us communicate. They're not definitive and conclusive proof.
Couple of other things I want to highlight. They're going on in this graph. I've introduced two different kinds of symbols here.
So the earlier graph, we just had one kind of symbol. We had dots here.
We have two different kinds with a legend that says red circles are global.
Oremus are a thing called global are masc and blue Xs are a thing called per user are MSE.
Don't have to understand what those are. But the point is, I'm using different colors and shapes in order to communicate,
to show different versions of a thing in the same chart, using different shapes.
In addition to different colors is useful because it's so imprinted on a black and white printer.
If they if they have some form of color blindness, it helps it make the differences clearer.
I've also in addition so I've got my Y at my at Y axis, which is indicating different things that I'm plotting here.
I also have grouped them just to make it easier for the user to see.
These are the same like these. These first ones are all single algorithms.
And then we have a blend and a few other things. The details are.
Aren't important for illustrating them, but it helps guide the user to understand his structures to we have these group breakdowns.
It also helps save space in the paper because I can present all of these different things in one place.
It's easy to compare the different stages, even though I have to split the mountain, the discussion in the paper.
But it gives you one place to compare them and it concisely shows the key results of the entire paper in one chart.
So to wrap up graphics can make data clearer and they let us leverage human perception to understand it.
They don't replace our numerical analysis,
but they give it context and they help us more clearly communicate what it is that we're learning from the data and what's going on in it.
We do, however, always need to make sure that we clearly label and describe our graphics so that
readers can understand them and they can draw correct conclusions from them.

🎥 Manipulating Data#

This video goes over the core Pandas data selection and manipulation operations. It is primarily a tour guide — the technical content is in following notebooks.

Video (9m18s)

Slides

Blow in this video. I want to talk with you about basic operations for manipulating data,
learning outcomes for this video are for you to know key data reshaping operations and the corresponding Penders function.
Think about the process of transforming data in steps. This is a tour guide to the corresponding notebook.
I'm not showing the actual code in the video,
but you're going to see in the notebook how you actually implement different versions of each of these steps.
So we think about the shape of our data. We have rows and we have columns.
We have a certain number of rows and columns, each of a number, a type in a name.
The assumption we're going to make throughout these operations is that each row is another observation of the same kind of thing.
So our data are well organized. Each row we have the variables.
And that's gonna be one type of thing. So if you have a data frame of movies, each row represents one movie.
If we have data that's not in that kind of a format, we're going to talk about that later.
And sourcing and cleaning data. How do we get data in this kind of a tidy format?
For now, we're going to assume we have data in this format. This kind of a layout are the are Eco-System calls this tidy Vaida data.
Now, each of these methods return in new frame. A few of them are going to return a series.
But in general, we're gonna be transforming data frames to data frames here.
And so if our input is a data frame, each row is another observation of the same kind of thing.
The output will be a data frame. Each row is an observation of the same kind of thing.
It might be an observation, the same kind of thing as the input. It might be an observation of a different kind of thing.
But these are the different operations that we're going to be talking about here.
So if we want to select calls, we have a frame and we want the same frame.
But with fewer columns, we have few options. We can pick one column by treating the frame as a dictionary.
We can pick multiple columns bypassing in the list of column names to the same way we pick one column.
One column will yield a series. Multiple columns will yield a frame. If we want to remove a column.
So we want to keep all of the columns except one or two or however many that we name the drop method.
Ream returns a frame with all the columns of the original frame except the ones you tell it to drop.
We want to select rows. We have a frame. We want the same frame, but a subset of the rows.
A few common ways to do that or to select by a boolean mask. We set up a PAN, the series that has boolean values.
That's true. And all the for all the data positions we want to keep.
And then we select and then we. So this is really good if we want to select by column values.
So we can use a comparison operator to create a mask where all the values for one column are equal to a particular value.
And then we can select, we can select using that boolean mask.
We can select by position in the in the frame, starting with zero.
We can do that no matter what the index is with. I lock that is that's so lock is the location.
Accessor for Panda's data frames, I lock Access's by integer position, always lock indexes by index keys.
If we have the index keys we want, if we just load it. So we just loaded the data frame from a CSP file.
We haven't specified any index options. It's using the default range index.
Then selecting that position and index key are the same thing.
If we have a data frame with, call with, whereas we've got our observations and we've got color,
a column that identifies what some kind of a group that each observation is, then maybe it's ratings.
It's the movie. Maybe it's movies. And it's the actor, the genre.
And what we want is a frame or a series that has one row per group of the original data.
And it's computing some kind of a statistic from a value and all the rows, all of the rows within that group.
Then we want a group by an aggregate like we saw in the videos last week.
A couple more transformations are to think about tall versus wide data.
So why the data has a column per variable.
So in this case, if this is data, this is data of the of this average speed for each of four different stages of a cycling race.
And so we've got a column for each of the four different stages and end our rows or for each cyclist, total data.
Has its simplest form toll data or long data has three columns.
We have the road. We have the identifier. We have the variable name.
And we have the variable value. Sometimes this will just be called idee, variable and value.
But often it's often useful to give the variable and value name columns, meaningful names.
We could also have more than one idea. Call it if we need to.
But the idea here is that rather than having the stages in different columns, we split them out into a different row.
So cycle one cyclist one has four rows, one for each of the four stages.
We still call this an observation for one thing and for the same kind of thing.
It's just in the wide data. Each of our observations is for a cyclist and it's an observation of their speed for all four stages.
Whereas in the long data, each observation is for a cyclist.
One cyclist in one particular stage. So each cyclist will have four observations, one for each stage.
Total data is useful for plotting and grouping because a lot of our plotting function,
YouTube plotting utility functions are going to want to deal with a categorical variable that we use to determine maybe the x axis.
Maybe the color. And so often we're going to need tall data, especially when we're going to be plotting.
If you want to term why data in the tall, you use melt. If you want to turn tall data into why use the pivot with a pivot table methods and pandas.
You can also create tall data from a list. So if we have a data frame when one of the columns actually contains lists,
we haven't seen any data with this so far except the John Rós column in the Waj Movieland data.
But if we have a if we have a data frame or one column contains lists and we have and what we want is one row per list element.
So we want to take this list that's in a column and split it out so that each element gets another row where it's going to go ahead,
duplicate the rest of the column. So they're going to have their values repeated, whatever we're doing once for each of the elements.
This list, the pandas explode method. We'll do that. Then finally, to convert between series and data frame.
So if we have if we have a data frame and we want to get a series, we just select the column from the data.
We saw that the beginning. If we have a series and we want to get a data frame,
we can just create a single column frame with two frame and the two frame method on the serious object also.
But to give it a names that you have a name and the resulting data frame,
you can also if you want to create a multi column data frame where you've got a column for the value end,
you have a column for the index of the original of the original series.
The Pandas, the series Freeze Reset Index Method or pop that index out into a data frame column.
And then finally, if you have a series with multiple levels to its index, we haven't seen those yet, but we're going to see them from time to time.
The unstamped method will turn the inner most index labels in the column labels.
To turn it series into a data. So to think about strategy.
Each of these is an individual little building block. And we need to put them together to get from the data that we have to the data that we want.
And so what I recommend is that you decide what you want the end product to look like.
If you're going to draw a chart or you're going to do an analysis or an inference,
what are the observations and the variables that you need for that chart or inference?
And then once you've figured that out, you can plot a path from your current data to your end product.
So. If you want to show a distribution of the mean ratings for all of the movies in the horror genre, then you're going to need to select.
The rows that have the movies only that are in the horror genre, you can select that.
You're probably going to have a join as well in order to get the genre table and the movie table connected, depending on how your data's laid out.
And once you've filtered it down, OK, these are the horror movies that you need to get the ratings and you need to you need to be able.
You need to get there.
You need to have the average ratings, you need to combine those with the movies, as we've seen the ability to do in a previous video.
And then you have the observations that you want. You need to be able to plot this kind of a path and what you have at the end product.
In this example, I've reference some Joynes. We saw joints very, very briefly last week.
We're going to see them again in more detail in the notebooks. So to wrap up, Penders has many tools for reshaping data.
You want to start with the end in mind, work from what you have to what you need.
Read the tutorial notebooks for a lot more details.

📓 Selecting Data#

Read the 📓 Selecting Data tutorial notebook to learn how to select data from a data frame.

I encourage you to read relevant tutorial notebooks throughout the semester, and link to them when appropriate; I am making three ones this week specifically assigned readings.

📓 Reshaping Data#

Read the 📓 Reshaping Data tutorial notebook to learn how to manipulate the shape of data frames in various ways, including merging two data frames into one.

📓 Missing Data#

Read the 📓 Missing Data tutorial notebook.

🎥 Types of Charts#

In this video, I discuss several common types of charts for statistical graphics, and how to choose an appropriate one. It complements the “Statistical Data Presentation” reading.

Video (13m30s)

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand TYPES OF CHARTS Learning Outcomes Identify the appropriate type of chart for data and a question Understand key rules to avoid common errors Photo by KOBU Agency on Unsplash Software Seaborn (sns) Matplotlib (plt) Plotnine / ggplot2 (pn) Chart Types XKCD #688, ⓒ Randall Monroe. Used under CC-BY-NC Bar Charts Show numeric values grouped by a categorical (or ordinal) variable Best with moderate number of categories Can have second categorical in bar color Y often mean, sum, or count within group Can rotate to horizontal bar Whiskers: confidence interval Titanic Passenger Survival Rates by Gender and Passage Class From Seaborn gallery Bar Charts Functions: sns.countplot(count by category) sns.catplot(mean by category) plt.bar pn.geom_bar Titanic Passenger Survival Rates by Gender and Passage Class From Seaborn gallery Bar Chart Rules Never start y axis at anything but 0 – skews relative sizes If including whiskers: define how they are computed If using SNS catplot or countplot without a color group: set color, or they’ll recolor for no reason. Histograms Bar chart where ‘categorical’ is bins of a numeric value. Bar chart showing relative frequency of categorical values also a histogram Y is either number or fraction of occurrences Goal is to see relative frequency of different values One way to graphically describe a distribution. Scatter Plots Shows two numeric values Observations have two numeric variables Want to see how they relate Does one increase with the other? Do points clump in space? Are there other patterns? Outliers? Restaurant Tips and Bills From Seaborn documentation Refinements Color by categorical variable Plot a trend or context line (not shown) X can be categorical (point plot or strip plot) Functions: plt.scatter sns.scatterplot pn.geom_point Restaurant Tips and Bills From Seaborn documentation Line Plots Two numeric values One y per x value Emphasizes progression (or continuity) from one to the next Very common for time series Functions sns.lineplot plt.line pn.geom_line From Seaborn tutorial Box Plots Show distribution of numeric variable grouped by categorical Median Quartiles Min/max Outliers (in much software) Functions: sns.boxplot plt.boxplot pn.geom_box From Seaborn gallery More Plot Types Violin plots (like box, but mean-based) Swarm plots (categorical scatter plot) Pie (usually best avoid – bar or stacked bar) Donut Rug (displaying distribution in a margin) Learning More Class readings Textbook Seaborn and Matplotlib docs Tutorials Gallery Wrapping Up Many types of charts. Learning good graphics techniques takes time and practice. Review plotting library galleries! Photo by Edgar Chaparro on Unsplash

Welcome back. This video,
I'm going to walk you through some of the different types of charts that we're going to be learning how to create outcomes or
be able to identify the appropriate type of chart for data in a question and understand key rules to avoid common errors.
I'm not going to be showing the detailed code for these chart types in the video.
You're going to be able to find that in the documentation link from here. And also,
I'm going to be preparing a notebook that demonstrates various of these charting
types with the actual code to create them using the software we discussing.
So common software for this or Seabourne and matplotlib, those are going to be the primary ones that we're working with this semester.
When I'm showing the function names, Seabourne is commonly input imported S.A.S.
So as an ascot function is going to be a seabourne function PLDT,
the function is going to be a matplotlib of function and also showing the function you can use in plot nine or Ares g.G plot too,
if you want to use those instead. I often use plot nine for a lot of my graphics.
That's just for reference though. We're not going to be getting into much detail on Plot nine in the course of this course.
So there's a variety of different types of charts. Some of them are showing relative proportions.
Some of them are showing how different amounts relate to each other. Some of them are showing positions and an x y coordinate space.
A bar chart is a very common type of chart that shows numeric values grouped by a categorical or ordinal variable.
Sometimes they're grouped by New America as well.
But usually our x axis is a categorical variable of some kind best with a moderate number of categories.
We can use a second categorical variable to say color the bars.
So this chart shows the survival rates of Titanic passengers or the X axis is the passage class for second or third class.
And then the bars are colored based on the gender of the of the passenger.
And so we can see the different survival rates.
The y axis on a bar chart is often a mean or a sum or a count within the cap of the group determined by our categorical variables.
Sometimes these will be horizontal. So the horizontal bar chart, the categorical is on the Y and the bars run horizontally.
This also shows some whiskers that come from a confidence interval.
It's very easy to generate a default, relatively good confidence interval with Seabourne so tough to pluck to plot
these Seabourne has the count plot function which lets which does a quick,
basically categorical histogram. How many observations are in each are in each category.
The cap plot variable will plot by default a mean value for each category.
And if you have it, do the mean plotting. It will also compute. Ninety five percent confidence intervals.
That's what's being shown in this in this plot here.
And then you can also use the bat, the bar function or the plot nine Geon Bar.
So if you rules about bar charts first is never start the Y axis on a bar chart.
Anything but zero. And so the reason for this we can see here is that.
So the top one. So these are these are looking at the mean average ratings.
We take you to movies, mean rating, and then we compute the mean of the average ratings within a genre.
What is that? So if we look here, the difference between horror and IMAX, it's a notable difference, but it's a difference of about point five or so.
The difference between sci fi and short is a difference of a little under one, probably.
But when we start the Y axis at 2.5 instead of zero,
what happens is the differences look much larger than they are because the human eye, naturally it's not.
We not only want to see the difference, but we want to it's very natural for us to compare the difference to the bar length because these are bars.
They have length, they have an area since they're all the same with the length is proportional to the area.
Braking length area. Proportionality is a good way to confuse your readers,
but it looks like IMAX movies have twice as high an average rating as horror movies because the bar is twice as high, but they don't.
It's really a shift from about 2.8 to three point three or three point four.
And so it creates a distortion that makes the different like it highlights the differences, but it makes the differences look larger than they are.
So when I talked about integrity and avoiding deception, when I was introducing statistical graphics,
this is what I was talking about, the differences there. It's just not as big as it looks like it is.
And we truncate our bar charts.
So if you have the general rule here to generalize beyond bar charts is if something has a length that varies based on the data,
that length needs to actually represent the value, not the value, minus something because you started the axis somewhere else.
So if you're including Whiskers, like I did in the previous chart,
define how they're computed and also as one thing to just be careful of seaboard's cat platen count plot.
If you aren't using the color for a second variable, they will just make every bar a different color for no particular reason,
which it creates something that's different when it doesn't need to be so.
It causes the reader to look for a difference that isn't actually their best avoided.
You can fix that by just specifying the color.
We saw histograms last week in a histogram as a bar chart, but a categorical was Binz or ranges of a numerical value.
Also, though, if we have a bar chart that's showing the relative frequency of categorical variables that can also be called a histogram,
the Y axis is either the number or the fraction of occurrences in this case.
So we can that. The key thing, though, is the different heights of the bars that I see visually, the relative frequency of different values.
So it really makes it visually clear how the data is shaped.
We can see Skewes and things like that. Is there one way to graphically describe a distribution?
A scatterplot shows two numeric variables. So each observation is a dot.
Each observation has two numeric variables. And we put the one variable on the x axis.
The other variable on the Y axis and put the dot at where its variable values would intersect.
This is really useful for seeing how two variables relate. Does one increase with the other Duplin?
Do points clump or cluster in an interesting way? Other interesting patterns.
It helps us find outliers. So this this is scatterplot is showing the tip versus the total bill for a bunch of restaurant bills.
And each each observation is a bill.
And then the x axis is the the under the total bill on the Y axis is the tip that the that the the customer added to the bill.
And we a couple of refinements we can do here. We can color or change to the point tight by a categorical variable.
So on this one, we've changed it so that the points are different color.
So those dinners are blue circles and the lunches are orange AXA's.
We could also add a trend line or some other kind of a line to show some context, for example, on this chart,
we might want to plot a line that shows that the 20 percent point and that let us easily see where we're going over 20 percent,
how that the tips are distributed relative to it to a 20 percent mark.
We can also X can be a categorical variable when that happens. We call this a point plot or a strip plot.
Functions for doing this are scatter scatterplot and then plotlines John Point,
the Seabourne documentation has some examples of more of these align plot.
It's like a scatterplot that we have to numeric variables. However, we it emphasizes the progression or continuity from one variable to the next.
By combining them with a line, it really works best. We have one Y per X value that we want to plot.
If we've got more than one, it really starts getting very, very jagged. It's very common for Time series.
So this is another example from the Seabourne tutorial not labeled super well.
I don't know what the value actually is, but it shows that we have some kind of a value that's changing over time and it's going negative.
That was zero. The Y axis is at the top and the values otherwise our negative functions to create.
These are line plot from seabourne, line from a matplotlib and Gyeom line from plot nine.
A box plot shows the distribution of a numeric variable grouped by a categorical.
So the bar chart just showed us, say, the average value, maybe with confidence interval.
The box plot actually shows us the distribution and it does so in a way that's based on the median.
So the median, the the horizontal line in the middle of the box is the median value, the top and bottom of the box.
Are the first and third quarter close to the bottom of the first quartile and the top as the third quartile.
And what that means is twenty five percent of the values are below the bottom of the box.
Twenty five percent in the bottom half. Twenty five percent here and then twenty five percent above.
We then show these these whiskers that extend out to the minimum,
a maximum of the data and a number of plotting packages will do some kind of an outlier detection.
This is using seabourne default outlier detection. So if the max is very high and what the rule it uses by default is it allows the whisker to be.
So you've got the IQ are the inter quartile range. That's the height of the box. It allows the whisker to be one point five times that tall.
And if you have any data points that are further away than that, it plots them as individual points, makes it easy to see outliers.
You can change. It's that the whisker goes all the way up to the max, but it lets you quickly see and compare between different groups.
The median, the first and third quartiles and the men in the max to the data.
Very useful for comparing observations of a variable when you're grouped by some categorical functions
for doing this or box plot from both Seabourne and matplotlib and then Gyeom block box from plot nine.
A few more plots, a violin plot. It's like a box plot, except it's based around the mean and has curved sides.
The swarm plot is a kind of another kind of a categorical scatterplot.
It's usually best to avoid pie charts, especially 3D pie charts, or a lot of the of our software is not going to produce 3D charts very easily.
Don't try to go make a 3D chart. They're almost always more confusing, especially like the 3D bars that you have from vintage PowerPoint.
But even a pie chart, just because the human perception is not super great at accurately comparing angular areas.
So usually a bar chart,
restacked bar chart is going to be a better option than a pie chart or a donut chart is sometimes a better option where you've got to circle.
This is one place where I disagree with the reading. The reading that I gave you recommends pie charts for showing relative proportions.
I recommend usually avoiding those use a bar chart is a stacked bar chart if you need to show you
want to show multiple proportions of different or relative proportions within different categories.
There's another kind of plot that's not a plot on its own, but it's combined with other kinds of plots.
That's a rug plot useful for just displaying distributions at a margin.
So to learn more, I've gone I've taken a whirlwind tour through a number of different plot types, the class readings.
So the paper that I assigned you to read, it talks through the use cases for a number of different plot types.
I'm going to be providing tutorial notebooks that walk you through different plot types.
The textbook talks about graph plotting and data visualization.
The Seabourne and matplotlib docs are extensive. And for what?
If you're using another plodding library, its documentation as well. Most plotting libraries also have a gallery student.
Go through the gallery, look for a plot that has a feature you want in your plot or that you think might be useful for displaying your data.
Click on it and they'll give you the code to show you how they made that plot.
You might want to combine pieces from multiple plots. In practice, it takes a lot of trial and error to really get the hang of your plot and library
and figure out how to make it show you the data in the way you really want it to.
Learning one plotting library really deep is useful for a lot of the a lot of the python ones,
especially the ones that are oriented towards static charts. They're built on top of matplotlib.
So Seabourne is a convenience API on top of matplotlib. If you're using Seabourne,
you're also going to need to use matplotlib calls a lot of the time when the seabourne gets you 90 percent of the way there,
but not quite all the way. So to wrap up, there are many different types of charts that have different use cases.
Learning graphics techniques takes time and practice takes some of the example notebooks that I'm providing.
Take some of the galleries from the examples from, say, the Seabourne Gallery.
Play with them, play with them with some data that I'm giving you, play with them with some data that you have elsewhere.
But it takes time and practice and spend some time with the galleries of the of the the plotting libraries you're using.

Resources#

Notebook
Seaborn gallery
Seaborn tutorial — organized topically, very good resource
Matplotlib gallery
Plotnine gallery

🎥 Metrics and Differences#

We talked about the notion of “relative” differences, but what are they?

Video (6m50s)

Slides

Hello, in this video,
I want to talk with you just a little bit about statistics that we can measure and how we want to think about differences between them.
In the previous video or an earlier video, I talked about how bar charts emphasized relative differences.
So in this one, I'm going to talk just a little bit about what that means.
So are learning outcomes for this video to review a little bit of the statistics or the metrics that we've been talking about,
to talk for you to be able to compute absolute and relative differences and to interpret a relative difference between two quantities.
So as we've talked about, we can compute various statistics over our data means median mode success rates.
We can compute percentiles, counts many.
Basically, any statistic you can think of for that, you can compute over a of a set of numbers we can use as some kind of a statistic or a metric.
And this often serves as the metric for analysis or evaluation. Try to evaluate a program or tried to evaluate a technology.
We have some metric that is measuring its effectiveness.
And we want to see whether it's improved or changed somehow.
So when we compare to values, though, with a few different ways to do it.
So let's take a couple the population estimate in 2018 of Boise in Salt Lake City.
And there's two ways that we few different ways that we can compare it.
Two of them are the absolute difference. Boise has twenty eight thousand more people than Salt Lake City.
The other is the relative difference. Boise has fourteen point five percent or fourteen percent more people than Salt Lake City.
So the absolute value is the difference between two values is actually the absolute value of the difference between two values.
We can also talk about a science difference or a real difference where we don't have the absolute value if we need the direction on the difference.
That becomes useful. But what we're talking about here is the actual difference in the underlying units.
So in our example case, number of people. And but the another way we can do it is to talk about the relative difference.
So this is the it's the difference normalized by.
The reference quantity, and we have to be clear on which ones,
the reference quantity and the reference quantity is the one we're starting from whom we're computing the relative difference.
So, for example, 50 is 25 percent, more than 40 because you take 25 percent of fortius 10 add add that and you get 50.
But 40 is 20 percent is only 20 percent less than 50 because you take 50.
10 is 20 percent a 50, whereas it's 25 percent of 40.
You subtract it. And you get and you get 40.
So this different this order difference is really, really important.
So for the 50, what we've got is we have 50 minus 40, over 40.
That's this one. And we have for the minus 50, over 50, negative 10, over 50.
That's 20 percent. Ten over 40 is 25 percent.
So we need to be really, really careful about the order, another way we can compares with the ratio, we just divide one quantity by the other.
So this is when we say slike this year's sales of 20 million are twice as much as last year's 10 million.
This is an apt. It's an absolute change of one of 10 million and it's a relative change of one hundred percent.
Twenty million is twice as much as one is 10 million.
And it is one hundred percent higher than 10 million.
Now, one thing to think about, if I say this year's returns are two times larger than last year's.
What does that mean? Does it mean it's two times?
Does it mean it's 200 percent more, which would be three times?
Is it clear? I would submit that this way of framing it is ambiguous.
And so we should avoid it. The appropriate comparison really depends on context and problem.
There's not a hard and fast rule when you need one or another.
Relative comparisons are quite common because they they can be compared across a variety of contexts.
But we still also need to pay attention to the underlying absolute difference in what
the act what this what the change being made in this relative change actually is.
One example of a high profile relative change.
If the Netflix prize, which was a run by Netflix a number of years ago, they paid a million dollars to the team that was able to beat It's The Beat,
their internal movie recommender on the metric that they chose by 10 percent.
They wanted a 10 percent improvement. And this metric it was. Lower is better, so they wanted you to.
They wanted to decrease in the metric by 10 percent.
We can also talk about a difference between differences, because if we compute a difference, that difference itself is just another value.
So we could say ten sales grew 10 percent more this year than last year.
So if we define growth as the one year sales minus the other year sales, then we can look at the growth of this year.
And the growth of last year and we can compute the difference and difference.
And so we can have a 10 percent increase in growth. Difference in difference has come up a lot in various contexts.
And so it's important to be able to reason about those as well. And again, be clear both in writing and understanding.
So we don't want we don't want to. Visual comparison bar charts emphasized relative difference because the height of the bar is right there.
And the eye very naturally compares the difference between bars to the height of the bar itself.
Point plots emphasize absolute difference because you don't have the reference point of the size of the bar.
They're both of them. Make it pretty clear to see the also compare the differences.
See how different the differences are. Those are of evident both in bar charts and point plots.
So to wrap up, there are three primary ways to compute statistics, absolute relative and ratio.
You need to be very clear and unambiguous when you're writing the results of a
comparison and also when you're trying to understand what others have written.
Seek to accurately understand it. And if you're providing feeB,
if you're in a context where you're providing feedback and it's not clear that clarity is something you want to ask for revision.

🎥 Charts from the Ground Up#

In this video, I discuss how to design a chart from your questions, goals, and data.

Video (22m14s)

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand CHARTS FROM THE GROUND UP Learning Outcomes Design a chart by thinking first of the questions, goals, and data Photo by Ryan Quintal on Unsplash What is the Question? What question do we want to answer? How, precisely, did we operationalize it? What data are we using? Specifically, variables Example: Titanic Survival Question: were passengers in higher classes more likely to survive? Outcome variable: survival Logical variable, 0/1 encoded Taking mean yields numeric response! Mean is probability of survival (1) Also called response variable or dependent variable Explanatory variable: passage class Categorical Also called independent variable Bar chart Showing Relationships Most plots show how a numeric variable changes between different values of one or more other variables. What is the variable to show? What do we want to show about it? Statistic? Distribution? How do we want to compare between EV values? Even histograms follow this! Response: frequency (count, proportion, density) Explanatory: value or bin Pipeline Identifying these informs: Data processing (e.g. binning, group-aggregate, transform) Choice of plot type Axes, labels, colors, facets Variable Types Response is numeric (or transformed to be) Categorical → relative frequency Explanatory can be anything Numeric → continuous axis Categorical → discrete axis Ordinal → discrete axis preserving order; may omit some labels One Explanatory Variable If EV is continuous: scatter or line Scatter plots we sometimes blur response/explanatory If EV is discrete: Bar chart shows statistic, emphasizing relative difference Point plot shows statistic, emphasizing absolute difference Box or violin plot shows distribution Two Explanatory Variables Pseudo-3D: Contour plot: identify peak(s) & shape Heat map: see where high & low points are Demonstration Two Explanatory Variables Pseudo-3D: Contour plot: identify peak & shape Heat map: see where high & low points are Other aesthetics for secondary variables: color, shape, size Occasionally used to indicate second response variable Titanic by Class and Sex More than Two EVs Can’t really have more than 1, maybe 2 numeric Others can be binned Facets let us break down the plot by more categorical explanatory variables. Facet Plot More than Two EVs Can’t really have more than 1, maybe 2 numeric Others can be binned Facets let us break down the plot by more categorical explanatory variables. Pay attention to order. It strongly affects what readers compare. Stacking Stacking lets us see differences in composition – how do the parts of a whole change? Stacked bar charts Stacked area charts Can be either raw values or fractions. Transformations Sometimes we transform the axis Log-10 scale – shows order of magnitude Generally don’t do this to bars Sometimes we transform the data Rescale, log, square root Normalize by a value Be Careful Avoid excessive complexity Be careful with color (easy to make indistuingishable) A good graphic reveals the data, and does not distort or obscure Wrapping Up Identify variables and relationships you want to highlight. Design a plot that illustrates them. Study plotting library APIs. Photo by Daniel Cheung on Unsplash

Hello. And this video, I want to talk to you about how to build up a chart from the ground up as we think
about the question it's going to try to answer and the pieces that need to go into it.
So the learning outcome for this video is for us to be able to design a chart by thinking first of the questions,
the goals and the data that are going to be in and from the chart.
So a good chart answers a question and the guiding principle for how we design and
how we lay out our chart is to illuminate the question that we want to answer.
And this depends. We need to know what question we want to answer in the first place.
We also need to know precisely how we operationalize that question so we can use that to then inform how we're going into the chart layout.
And we need to know what data that we're using, specifically what variables we're using as a part of this chart.
For example, there's a data set, you'll see it in the notebook that goes with this video for of passengers on the
Titanic and supposedly wanted to examine whether passengers in a higher fare class,
say, first class or more likely to survive than passengers in lower fare classes.
In this analysis, we have an outcome variable zero one,
whether or not the passenger survived the Titanic sinking and a lot of charts are going to have an outcome variable.
We want to we have some outcome variable and we want to see how it responds to or how it differs with some other variable,
which we call the explanatory variable, in this case, the passage class where outcome is survival.
And we want to see how it changes as the the the passengers passage class, the explanatory variable changes.
The outcome variable is also called the response variable or the dependent variable,
because it's what we're trying to measure that's responding to the condition we're trying to analyze.
And the explanatory variable is sometimes called the independent variable because it's changing,
but it's not changing as a function of the other variables in theory.
So we can do this with a bar chart and this bar chart shows the x axis is our steerage class through our passage class first,
second and third, and the Y axis is the average is the fraction of passengers in that class who survive.
We also see some error bars. We're going to see later what those mean and how to how to compute them.
But this lets us see how the outcome survival changes as we age, as the pass or with the different passage classes of the passengers.
And one of the things to note here is that we have our explanatory variable on the X axis and the outcome variable on the Y axis.
That is the general convention. There are some cases where we might want to flip it.
So we've we've got a horizontal bar chart where the explanatory is on the Y and the outcome is on the X,
particularly if we if if it makes the labels more readable.
But the standard convention for most types of charts is to put explanatory on x axis, the horizontal axis and the outcome variable on the Y axis.
And this chart shows the relationship, many charts or relationship,
most of the plot that we're going to be drawing in this class show how some kind of a numeric variable either continues or.
Or integer changes between different values of one or more other variables, and in this case, even though our response was zero one logical.
When we convert it into a rate per per passage class, it became a continuous variable.
And so when we do this, we need to identify a few key things to design our plots.
We need to identify what variable we want to show. That's going to guide a lot of plots that'll be on our y axis.
When it's not, it'll usually be on X and it's going to identifying that variable is,
if anything, probably the most important thing in designing a plot.
We then need to identify what we want to show about this variable.
Do we want to show its value for different data points? Do we want to show a statistic?
The do we want to show, for example, the the mean or the rate?
In the previous when we showed a statistic, the Titanic example, we showed a statistic, the rate of of survival.
Do we want to show its distribution?
And then how do we want to compare that between values of the explanatory variable, particularly, do we want to look at absolute differences?
Do we want to look at relative or proportional differences? And even the histograms follow this kind of a design because they have an outcome,
which is the frequency or the count of the abortion or the density, depending on precisely what kind of histogram we're showing.
And then they have the explanatory variable, which is the value or the.
So we've got a histogram and we've got some beans. And the response variable is how how many values are in that bend and the explanatory is the.
So identifying these then informs the entire pipeline of producing our chart, the data processing the beginning.
We're going to do group aggregation transformation that gets us to the final values we can actually plot.
It's going to affect our choice of plot type and it's going to affect our choice of axis labels, colors, facets, the other aspects of the plot.
So. The type of the variable has a significant impact if the response is numeric or can be transformed,
the response is often numeric or can be transformed to be.
If we're talking about a of categorical value, we usually want the relative frequency of different of different values of that.
So either we're doing it like a histogram and the we're going to transform it so that we're showing just the distribution.
We're going to transform it, so that we're showing that the explanatory becomes the value of the categorical.
And the response is how many or what fraction have it in a logical.
It might if it's a two level categorical, we might turn it into a fraction, just a fraction to have one of the levels versus the other explanatory.
It can be anything. We're going to see how to use numeric explanatory variable is categorical explanatory variables ordinal.
We were just like categorical,
except that the that it's a discrete axis that preserves order and we need to make sure that the order and ordinal data is being preserved.
If you're using pandas ordered category type, it automatically preserves order when you're doing the plot for you.
So if we just have one explanatory variable, this is the easiest case, if our explanatory variable is continuous,
we usually want to scatterplot or align plot for showing individual values.
Sometimes we'll flip the response and explanatory on a scatterplot or will or both might be explanatory.
We want to show where points lie in a two dimensional space.
But generally, if we've got an explanatory, a continuous explanatory variable and we've got a and we're trying to show values,
we're going to use a scatterplot or a line excuse me, we're going to try to show values.
We're going to try to show statistics like a mean at each at each value of the explanatory variable.
We're going to use a scatterplot or a line plott. If the explanatory variable is discrete, then we're going to use a bar chart to show a statistic.
If we want to estimate the relative difference, we want to be able to compare the relative value relatively compared to values,
because a bar, one bar will be twice as high as another.
And a point plot shows a statistic or an individual value, and it emphasizes absolute difference.
You don't have a whole bar in order to to compare heights.
You just have the point. And then if we want to show a distribution, we usually use a box or a violin plot with this discrete explanatory variable.
We don't have great ways to show distributions with continuous explanatory variables.
You can show a variance with an error bar, but that's about where a ribbon.
But that's about it. For too explanatory variables we get into.
Too explanatory variables, we have a couple of options. One is we can do a three, a pseudo 3D display where we do a contour plot or a heat map.
And I'm going to show both of these here. So this is a contour plot.
The left one is a contour plot and it reads like a topographical map.
If you envision your your two explanatory variables in this case, we're going to we're showing a two dimensional distribution.
So one explanatory variable is the score given to a movie by its critics, and another explanatory variable is the score given by its audience.
And then the response variable is how many movies have that combination?
And so we can see here, this is the peak, a contour plot is really good for showing us the peak.
It's going to be that innermost circle and it also shows us the shape because each of these rings is a a a level of decreasing.
Decreasing height in this map, so if the response if we envision that the response variable is this height and we're looking at a two dimensional map,
the rings show us the contours around the mountains of that height.
Good for showing, good for showing shape. The other one of the heat map which uses color.
And so it's usually going to be from a cool color like, say, black here to to a hot orange,
or it's going to be sometimes if you have a bidirectional one, which goes blue to red and it lets us see the highest density is here.
And the as you go out from there, you get lower and lower densities.
Either one can work for a continuous variable heat map, you often have to it in order to.
This is a descriptivist heat map where we have been everything in in bins of of a half a star or a half
a star on the audience score and a four star on the credit score because they're on different scales.
But heat maps also work well for categorical ordinal data.
So. Another way we can do it is we can use other esthetics for secondary variables such as color or shape or size,
sometimes we'll use that to indicate a second response variable,
like you might have a scatterplot where the size of the point is a second response variable, but often it's for multiple explanatory variables.
So this shows us how we can do that. So if we wanted to break down Titanic's survival rates by both class and sex,
we can see we can use we keep our class on the X axis like we did before,
and then we use color for the passenger sex so we can see significantly higher survival rates for women across all three classes.
I'm also showing you here the difference between a bar chart and a point plot.
So the left is the bar chart. The right is the point plot.
And the bar chart lets us compare the heights of the bars. Note that it starts at zero.
Bar charts always start at zero. And because so it lets us compare the height of the bars and we can see that.
It's easy to see from just using our vision that the female passenger first class bar is almost is more than twice as tall as the.
As the female or the male passenger first class bar,
the male passenger first class bar is twice as tall as the male passenger passenger second class bar.
So it lets us compare make relative comparisons between the different values.
This is why it always starts at zero, because the natural thing to do with the bar is compare its height.
If your bar chart does not start at zero, suppose our bar chart started at point one,
then the comparison of height would exaggerate the difference relative to the value.
And what looks twice as tall isn't actually twice as tall because we cut off a bunch of the bottom.
So always start at zero. The point plot. Does not it makes it hard to compare relative difference.
We can't it's difficult for us to tell that the survival rate visually tell.
We can tell if we look at the numbers,
but it's difficult to visually tell that the survival rate of women in first classes is twice as high as the survival rate of men.
But what it does literacy is it lets us see the absolute, absolute difference between these values,
and it makes it easy to compare the difference in the gaps across the three classes.
We can see that the the survival rate by by sex is much higher or is much closer in the third class than it is in the first or in the second class.
So your choice of plot really guides the user to see different things in your choice of plot,
allows you to emphasize different things and you need to decide.
You need to choose and design your plot in such a way that's going to tell the story that you need to tell from the data.
We can also have more than two explanatory variables. It's difficult to have more than one that's numeric or two for doing a contour plot.
We can bend variables that are then going to let us use some more techniques, such as FaceTime.
So if we want to break down by more categorical variables,
so we want let's say we also want to look at a or we want to break down many more variables.
Let's say we also want to look at age. And so we're going to keep sex on the color.
We're going to now use age as the x axis. Since this numeric, it really works better on an axis.
I have bend it into bins of tens that you only have one point for every decade.
But then we use a fassett and the fassett means we draw a different chart for each of the three classes.
The charts all share a y axis so we can directly compare across the row of charts and we can see it lets us see
particularly how does the survival as a function of age change between different different passenger classes,
for example? And so it is, but it lets us start to build up.
And if we had a fourth, we could use rows and columns in the faceted plot.
So we have these mechanisms of building up and we have our x axis or y axis.
We can use esthetics of the lines of the points, particularly color, size, shape,
and then we can use facets to build up even more variables into our plot.
To do fascinating, there's a couple of things you can do, it's built into some of the seabourne row plotting functions.
The plot and cat plot function functions can both do fascinating on their own.
They let you control the statistic. They're very, very flexible functions for a wide range of plot.
The general purpose Fassett Grid allows you to fassett any kind of plot by writing some more Python code on your own.
Very useful if you want to fassett something that doesn't support Facetune built in.
And if you're using Plot nine or the R.G. plot to package Fassett Grid and Fassett wrap a control fassett,
you build that faceted plot you need to pay attention to what variables go where your choice of which variables are going to be on color,
what variables are going to be facets, which variables are going to be on your axes really affect how the reader is going to interpret and understand
your plot and you need to choose them strategically to tell the story that addresses your question.
You also need to do it, though, in a way that is honest and does not mislead your user, your readers.
The chart needs to honestly show the readers what it is that you learned from the data and show that clearly.
Another thing we can do to build up a chart, especially if we have more categorical variables,
if we've got a categorical response variable with more than two levels,
and we want to show how particularly how the the proportion in different categories changes the response to another variable,
a stack chart can be very good. Let's see the differences in composition to see how the parts of a hole change.
And so this chart,
this is a stacked bar chart and it's a horizontal bar chart where I put the explanatory variable on the x axis excuse me, on the Y axis.
Just in part to make the labels easier to read and so are explanatory variable is what data set.
Something came from Locke, M.D. Gry. What those are don't matter for our purposes right now.
The response variable is the distribution of gender's in this case.
These are data sets of books, the genders of the authors of those books in the data set.
And so we have female, we've got mail and we also have codes for we it's ambiguous or unknown or we didn't have data.
And so we can see, for example, the GYŐRI data set has a higher fraction of women and a significantly lower fraction of men.
And we can see quite a few more. Books that we don't know what gender on, and so this the order on this chart is very strategic.
I observed these levels is very strategic. I bunch I batched all of the various kinds of we don't know together so that
you can look at that whole block and see the and see the various types of.
We don't know the gender of the book's author together, but you can also see how they're broken down into individual things.
You can see that UNlinked is a very, very large fraction of of that increase in books where we don't know the author's gender.
So you need to think you need to think about all of these different things in order
to be able to generate a chart that's going to clearly and unambiguously communicate.
You can show either you can show raw values in a stack bar chart at the bars.
Don't all have to be the same height you can show fractions, in which case they will be.
I chose to show fractions in this chart. The code that generates this using raw matplotlib is linked in the notes for the video.
Sometimes we're also going to transform our charts.
We might transform the axis such as doing a log ten scale, in which case the label would transform the axis.
The labels are still in their original value. It's just they're spaced out logarithmically.
We generally won't do this for bars. Reading a bar on a large scale.
You can draw it, but you have to be really, really careful in order to make sure that your readers are going to accurately interpret it.
But for line and scatter plots, log transforms are a lot more common.
Sometimes, though, we're actually going to transform the data itself and we're going to plot a log or a square root or some other rescaling.
And another kargman transformation is to be in the data, somehow democratize it into fixed bins.
By some mechanism or another, so the key decisions that you need to make when you're making one of these charts
are you need to pick the variables and how you're doing their transformations. You need to pick that what's called the esthetics,
how you're going to map the different variables you're looking at to chart features your X and Y axes,
your facets row and column your color, your point marker style.
If you're doing a joint plot, often it's useful to put.
The same esthetic on both color and style, and that way, if you have a reader who's colorblind, they still get different point styles,
even if they can't tell the colors apart or if someone's putting it on a black and white printer.
And then you need the type of the chart line, chart, bar, point box, et cetera.
So you have to make all of these decisions when you're drawing this chart and they're driven by what variables and data you have and what
question you're trying to answer and what story you're trying to tell about that you do need to be careful to avoid excessive complexity.
We can put a different variable on every conceivable esthetic and it's often going to result in a chart that's very difficult to read.
We also have to be careful with color because it's easy to make a chart that has differences
that are difficult for the human eye to distinguish or get obscured by printers,
low quality displays, etc. It's also important to note a good graphic reveals the data and does not distort or obscure the data.
It's easy to create a graphic that manipulates the data to tell a story that's not very well supported.
And we want to avoid that when we're doing data science with honesty and integrity.
So wrap up. You need to identify the variables and relationships that you want to highlight in your chart.
You want to design a plot that illustrates them,
and you're going to need to spend some time studying your plodding library APIs and the Plotting Libraries Gallery.
Any plotting library usually has a gallery of a bunch of different plots and the code that was used to generate them.
Seabourne has this, matplotlib has this.
And so you spending some time with that looking, oh, this looks like this looks like the kind of plot that might display my data well.
And then look and click on it and see what code they use to generate it and borrow it.

Resources#

Supporting notebook — code for most of the charts (using Seaborn)
Book data statistics notebook — code for the stacked bar charts (using raw Matplotlib)
My book experiments have many facet examples — see DataSummary and ProfileDataPrep.

✅ Plots in the Wild#

In preparation for Thursday’s class, find a data presentation (plot, table, etc.) in a recent online publication, and share it with your team through a post on Piazza (in the ‘discuss’ category) with a link, a copy of the image. This can be from a journal paper, a newspaper article, a blog post, or another source the class can all access.

In class we will discuss these plots!

Tip

Don’t spend more than 30 minutes on this assignment.

📓 Finishing Touches#

The Finishing Touches notebook describes how to apply some finishing touches to your plots and save them to files.

🎥 Organizing and Formatting Notebooks#

How should you organize your notebook? What makes a good notebook? In this video we talk about that!

Video (16m50s)

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand ORGANIZING NOTEBOOKS Learning Outcomes Use Markdown document structure to organize a notebook Use Jupyter and Markdown features to format text in a notebook Create a notebook that clearly tells the story of a data analysis Photo by Kelly Sikkema on Unsplash Notebooks as Documents The notebook is a document – it is meant to be read Some structure imposed by code ordering Factor particularly complex computations out of the notebook! Notebook Components Notebooks have two types of cells Code cells, with Python code and its output Markdown cells, containing formatted text Heading cells are just Markdown cells with headings Code Cells Keep individual cells relatively short A few lines One function If helpful, show results after cell Do this a lot in development Too much can make it hard to read the notebook – clean up to ones that help the reader understand Good to show after: Loading data Complex manipulation Markdown Cells Markdown is a text syntax for simple markup Inline formatting **bold** *italics* `code` (use this one – it’s useful for function names, etc.) LaTeX math: $y = \beta_0 + beta_1 x$ Markdown Block Elements Paragraphs – separated by blank lines Bulleted lists – lines start with ‘-’ or ‘*’ Numbered lists – lines start with ‘1.’, ‘2’. Etc. Code blocks – indent 4 spaces, or put between fences (``` lines) Block mathematics – $$f(x) = x^2$$ (double-$ means block) Headings Markdown headings are lines starting with #, ##, etc. # Heading 1 ## Heading 2 ### Heading 3 These do not mean “big and bold” – they mean “section heading” Nest them properly Start notebook with H1 All other headings 2 or lower Short – section headers, not sentences If they wrap, rethink Writing Text Use the document to tell a story What is the goal? What is the data? What do we know about it going in? Why are we doing each piece of the analysis? What is the approach? Don’t repeat the code, but explain the code’s conceptual operation, esp. if not obvious. What do we learn from it? (when appropriate, which is most of the time) Around a Result What are we going to do? If not immediately clear, how are we going to do it? <code> <results> What do we observe? Document Structure Title and Intro What is the notebook for? Where does the data come from? If appropriate, what’s the research question(s)? Set up environment (python libraries?) Load data (sometimes merged with setup) Show data tables! Perform analysis (more sections as needed) Summarize / conclude Document Maintenance You’ll write a lot of cells & outputs while debugging Clean up before submitting / sharing Remove dead ends, extraneous outputs Consider supplementary notebook with discarded alternatives Re-run top-to-bottom (Kernel → Restart and Rerun All) While editing, code can get out-of-order, and may break! Audience and Purpose Know your audience and your purpose Teaching notebooks (for this class) differ from research notebooks! You, your supervisor, and the public are all different! Not all audiences are well-served by notebooks Often need separate final report In notebook, write copies of plots to files Wrapping Up The notebook is a document. Take advantage of that structure – use it to tell the story of your analysis. Pay attention to class examples. Photo by Bookblock on Unsplash

Hello again. And this video, I want to talk about organizing notebooks as I've promised.
So we've talked about how do we make charts? That's been a lot of what we've been talking about here.
But I wanted to talk about how do we actually put together a notebook that's presenting these charts and presenting our conclusions from them.
So learning outcomes for this video are for you to be able to use markdown document structure to organize a notebook, to use the Jupiter,
a markdown features to format text in a notebook to create a notebook that clearly tells the story of a data analysis.
First thing to understand is that a notebook is a document. It is a convenient way to run Python code and to see the results of it.
But the notebook structure is first and foremost a document. It's meant to be read.
And there's some structure imposed in the document because it has to read in the same order as the code is going to execute.
But. We want to be able to actually read it and understand what's going on as we walk through the notebook.
So we also want to factor particularly complex computations out of the notebook.
So far, nothing. We've been doing a super complex.
But if I have a large, complicated data processing operation, training and an extensive set of machine learning models or something,
I'll put those out of the notebook and other scripts and other modules and leave the notebook for communicating the results of my data analysis.
So a notebook has two primary types of cells. We have code cells, which you've seen a lot.
The Python code and its output. And we have marked down cells that contain formatted text.
Could keep you. I recommend keeping your code cells relatively short.
One, a few lines. One function definition. If you're defining an entire class and it's taking 100 lines within a code cell.
That's a good sign you to pull that out into a python module of some kind.
If helpful, show results after the cell. I do this a lot, particularly in development.
But if you have too much of it, it can make it hard to read the final notebook because you have all of these outputs.
And the notebook wins it being a sea of charts and tables.
And it's difficult to find your way through the notebook and find the pieces that you need to look at.
So go ahead. Do a lot of them, especially while you're debugging in your prototyping before you submit.
Maybe go through and clean up, remove things that were just there for you to test how something worked and leave the cells in
your notebook being the ones that help the reader understand the results of what you're doing.
Remember, the purpose of the presentation is to show the reader what you learned and how you know it's true.
Cells that didn't help you do that. Maybe you can consider removing, though, or that don't help you do that.
At the end of the day, they might have helped you figure out how to do that.
You can save up a copy of your notebook before doing the cleanup so you don't lose them.
You can have a supplementary notebook that has maybe Pazz, you went down.
That didn't work out. Another thing you can consider doing is having an appendix.
So you've got all of the main content, the notebook. And then down at the end, you have a big heading appendix.
And there you have extra things. You want to make sure you can still run from top to bottom.
But there you have some of the other things that maybe dove into more details about the building blocks of some of your computations.
But it is good to show the results after loading data and after doing a complex manipulation,
especially one that significantly changes the shape of the data that you're working with.
And they talk mostly in this video, though, about markdown sales,
because markdown sells or what you use to build up the structure of your document and make it tell a story,
not just be a kind of strange way to present Python code.
So markdown is a text syntax for simple markup. I'm going to provide a link to the markdown documentation in the class notes that go with this video.
But there's several inline formatting things. If you put two stars around some text, that'll make it bold.
One star will make it italics. You can indicate a code using the fit, something that's going to show up as the fixed width code layout using back Tex.
This is one that I see ignored very frequently in and writing up because it's really,
really useful for function names, variable names, things like that.
To be able to set apart like this is a special thing. This is a function name also.
Then you can use tech math syntax by putting it between dollar signs in this markdown notebook.
Pay attention to the details of what your markdown code or what your text formatted text looks like after you render it in the notebook.
Make sure it reads well. Make sure it's clear. Ask yourself if I weren't the one who right wrote this.
What? I like reading this and clean it up and pay attention to those details, to make it look,
to make it look good and to make it be effective at communicating and so that the reader
can clearly understand what the different pieces are and what needs to be emphasized,
etc. Markdown also has a number of block elements.
The basic one is a paragraph, paragraphs or just text separated by blank lines.
You can also have bulleted and numbered lists. You can have code blocks for if you need to have a little.
These aren't super common in a notebook because a lot of your code is in the code cells that you execute.
But if you need to have a little code that you don't execute for some reason,
you can put it in the code block and markdown and then you can also block mathematics, a line on its own that begins and ends with two dollar signs.
And you can actually span multiple lines so long as there aren't any blanks that's going to be treated as a piece of block mathematics.
It's not in line in a sentence, but it becomes its own block and the rendered self.
Headings are an important one to pay attention to. So Mark Down headings are lines that start with one, two,
three up to six hash marks and then a space in the heading text having one heading to hitting three.
Something that's important to know is the hashes do not mean big and bold.
That's what they look like. But that's not what they mean. What they mean is heading.
And so you need to have an outline structure to your notebook using the headings.
And you need to nest them properly, so within each one, you have your H 2s.
And then you have your H threes. You don't go straight from H one to H for you.
You have H three in the middle. Start the notebook with an H one that has the notebook title and that that will become in a lot of rendering context.
That becomes the title at the top of your notebook. And then all your other headings are two or lower.
Also you might if you have an appendix, you might have Appendix B, another H1, but also the section headers should be short, not sentences.
If you're writing an entire sentence in your section header. You're you're putting too much there.
The section header should be a short title and then the section content comes after it.
Now, one of the few reasons why it's important to use the section headers heading levels properly.
One is just visually, it helps break up your notebook so we can easily see which component we're at.
Second, there are extensions that will do things like no your headings or give you a Browsr Bowl table of contents.
You can use to navigate the notebook by heading what? I'm rendering notebooks as a part of the course website.
You'll see this over in the right hand side. You can jump directly to notebook headings.
That only works because I'm consistently using the heading levels to build the structure and outline based structure of my notebook.
Another a third reason is for accessibility. If someone's reading your notebook with an assistive technology such as a screen reader,
the section headings are very important to help them navigate to the parts. The notebook that are both relevant to them at a given time.
So on the section headers, one additional little rule is if your section editor has to wrap onto a second line, really rethink.
It's almost certainly too long particularly.
Don't put an entire question in the section. Maybe. Usually.
Occasionally it's OK to put a whole question, but maybe put a brief like three to five word summary of the questions topic and
then write this question itself as the first paragraph of the of the section.
But pay attention to these different formatting features.
You can build a well-structured notebook that communicates clearly and draws the reader's emphasis to the places where it needs to go.
Writing the text itself. Use the document to tell a story. What's the goal of what you're doing?
Either the whole notebook or of individual pieces of analysis. What's the data that we're doing?
What do we know about it going in at the up at the top, either at the very top of your notebook or where you're loading the data?
It's useful to write some, especially at the notebooks and we report you submit to somebody.
It's useful to write there. What do you know? Where did you get the data? How was it collected?
Not a full data sheet, but at least some summary information to help the reader understand what it is that we're going to be going and looking at.
Why are we doing each piece of the analysis? What's the purpose here?
How does it fit into our broader picture, into our broader goals? What approach are we using?
We don't want to just repeat the code writing a a numbered list here.
The steps and those steps are just a literal translation of the code doesn't help understanding.
It creates an opportunity for a code and documentation to become mismatched.
But explain if there's anything tricky in the code.
Explain why that does the job. Explain the conceptual idea behind why you're approaching things the way you are.
If you're doing a data clean up, explain why that what that cleanup's doing and why that's the right cleanup for your data.
And then what do we learn from it? So oftentimes what I do with us, with an individual piece of it, like a chart.
All right. What question the charge is supposed to be answering.
Or at least the purpose of the chart that we have the code to generate the chart itself.
And then we have a tech cell that has observations about what we learn from the chart.
So what are we doing? How are we going to do it if that's not immediately clear code results?
And then what do we observe from these results?
So the over then the high level document structure that I recommend is to start start with the title and intros.
You've got your title. You're heading one. Then what's the notebook for?
Why does this notebook exist? Are there to include links?
There's hyperlinks and taxes and markdown as well. Read the markdown documentation to see how to use that.
But where does this go? Are there things we need to know?
Background about where why this documents being created?
Where did the data come from? If we have defined research questions, what are those research questions?
You can write those right in the intro, the notebook. Then I have a set up.
I almost always have a setup section that comes next. That has input. I import my python libraries.
I've maybe defined some help or functions that I'm going to be using throughout the
notebook helper function specific to one section I might define in that section.
But then and then I LOEs load the data. Sometimes I load the data as a part of the setup.
So it's OK. Important modules and then load my data. Sometimes if specially if I have more to say about the data, it's its own section.
But then as I load each table, I just show the first few rows of it often so that I can see, OK, I've loaded this data and then it's right there.
We can see as we're going through the rest of the notebook. What is the data just loaded look like?
Then we perform our analysis and this might be two sections. It might be five, six, seven, eight sections.
And then finally at the end, we can summarize and conclude this is going to be really I don't always do this in
my research notebook because often that's the material that goes in the paper.
But this is going to be something particularly in our assignment, and we're submitting notebooks.
Put that at the end of the notebook. What do we learn from this?
Sometimes they going to have specific directions for things.
I want you to reflect on there when like an assignment one, I've broken down the different requirements.
Those become good candidates for your age to your level. Two headings for each of those.
So we've got, I think six require six different requirements. An assignment one.
H2, heading a primary section of your document for each of those is a good starting point for your layout.
In addition to you're probably gonna have another one up at the top for the setup and maybe another for the data load.
But think about this. This the flow, your document be able to communicate.
What are we doing? What are the prerequisites in terms of and data?
How are we actually doing it? And then at the end, what do we learn?
So you're going to write a lot of cells and produce a lot of outputs in your notebook while you're debugging,
before you submit to before you share in other contexts. Spend some time cleaning up your notebook,
remove dead ends and extraneous outputs that you included for debugging, but don't fit in the flow of the story.
Consider putting them in a supplementary notebook. If you want to keep them around and then make sure you can rerun your notebook from top to bottom.
So when the Jupiter interface is the kernel, when you click that and choose, restart and rerun all.
And it will restart the python kernel that's actually running your code so all your variables disappear.
Your data is no longer loaded. And then it starts running the notebook from top to bottom.
You want that to succeed so that someone else working with the notebook can actually rerun and reproduce your results.
If that doesn't succeed, then that means either you deleted something that's that's important or you're the order of
your source in the notebook does not match the order in which it actually has to be executed.
But make sure it succeeds and also read back to the notebook to make sure that the charts all still look right.
The data is the conclusions are all still correct, etc. before you submit the final notebook.
So when you're writing an up of two, you also need to know your audience and your purpose.
For example, the notebooks I'm writing for you for teaching purposes here.
They the things I write in them differ from what I'm going to write in a research notebook that I share with my collaborators,
or I use my own purposes because my purpose partially in their notebooks, is to explain how they're working.
So I'm going to say more in these notebooks about how exactly the what exactly the code is
doing is that you can learn how the code works that I would expect in a research notebook.
But also, you're your own internal your own personal use sharing with your adviser or your supervisor,
sharing with the public, either the professional public working on your topic or the general public.
These are all different audiences and they're going to need different levels of explanation and different things highlighted in your notebook.
Also, not all audiences are well served for notebooks.
Notebooks are fantastic for internal reports, collaboration, et cetera, sharing the results of a data analysis with colleagues or with yourself.
But for final publication, you're often going to need a separate final report.
I don't know that it's possible to write a research paper and Jupiter notebooks. Somebody might have tried, but.
But I'll still have the notebook where I explain the analysis. I often make that notebook available.
So for a lot of my a lot of my published research papers, you can download a zip file or a get repository that contains the notebooks and you
can rerun the experiment and rerun my analysis with the notebooks in the notebook.
Then also I write the files out to disk. And we're not going to see this quite yet.
We're going to see it later when we start talking about workflow. Because right now I'm just having to submit notebooks.
But the note, the figures as they show up in the notebook aren't very high resolution.
So we're gonna want to render a higher resolution version of them to a PMG file or a PDA file or a
postscript file that we can then include in our document and word or law tech or whatever we're writing.
So to wrap up, your notebook is first and foremost a document that contains code to generate the results that you're trying to discuss.
Take advantage of the document structure and use it as a store to tell the story of your analysis.
The conclusion you come to in why we should believe them. Pay attention to the examples I'm giving you in class.
I'm also going to be trying to give you some examples of research oriented notebooks that you can look at to see examples of good notebook practice.

Resources#

GitHub Markdown guide — most of this syntax works in Jupyter as well
Jupyter’s Markdown docs

📃 Notebook Formatting Checklist#

The notebook checklist will help you make sure your notebooks are well-organized.

🚩 Week 3 Quiz#

The Week 3 quiz will be over all of the assigned material for this week, and is in Canvas.

The sections below this are for your further study and practice.

📖 Textbook#

This week primarily uses Chapter 9 of 📖 Python for Data Analysis, with some material from chapters 8 and 10.

📚 Futher Reading#

For further study on these topics, see:

The Seaborn and Matplotlib galleries
The Visual Display of Quantitative Information by Edward R. Tufte
W. E. B. Du Bois’s Data Portraits: Visualizing Black America, edited by Whitney Battle-Baptiste and Britt Rusert

✅ Practice#

Doing this work well takes a lot of practice. Create some notebooks and experiment with drawing interesting charts from some of the data sets we have been exploring, or new data you find! The HETREC data has a number of variables of different types that are useful for practicing manipulations and visualizations.

📩 Assignment 1#

Assignment 1 is due on Sunday, Sep. 12 at the end of the day (11:59 pm).

The tutorial notebooks are going to be very useful for this assignment.

CS 533 Fall 2022

Week 3 — Presentation (9/5–9)

Contents

Week 3 — Presentation (9/5–9)#

🧐 Content Overview#

📅 Deadlines#

🎥 Presentation Goals and Audiences#

📓 Data and Notebook#

📃 Statistical Data Presentation#

🎥 Introducing Statistical Graphics#

🎥 Manipulating Data#

📓 Selecting Data#

📓 Reshaping Data#

📓 Missing Data#

🎥 Types of Charts#

Resources#

🎥 Metrics and Differences#

🎥 Charts from the Ground Up#

Resources#

✅ Plots in the Wild#

📓 Finishing Touches#

🎥 Organizing and Formatting Notebooks#

Resources#

📃 Notebook Formatting Checklist#

🚩 Week 3 Quiz#

📖 Textbook#

📚 Futher Reading#

✅ Practice#

📩 Assignment 1#