Finishing Touches#

This notebook shows how to apply various finishing touches to charts to help them look better.

This notebook uses the “MovieLens + IMDB/RottenTomatoes” data from the HETREC data. It also uses data sets built in to Seaborn.

The code in this notebook will make extensive use of the matplotlib pyplot API directly, often using it to extend Seaborn-generated plots.

I will add to this notebook as more questions come up in class.

Setup#

First we will import our modules:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Then import the HETREC MovieLens data. A few notes:

  • Tab-separated data

  • Not UTF-8 - latin-1 encoding seems to work

  • Missing data encoded as \N (there’s a good chance that what we have is a PostgreSQL data dump!)

Movies#

movies = pd.read_csv('hetrec2011-ml/movies.dat', delimiter='\t', encoding='latin1', na_values=['\\N'])
movies.head()
id title imdbID spanishTitle imdbPictureURL year rtID rtAllCriticsRating rtAllCriticsNumReviews rtAllCriticsNumFresh ... rtAllCriticsScore rtTopCriticsRating rtTopCriticsNumReviews rtTopCriticsNumFresh rtTopCriticsNumRotten rtTopCriticsScore rtAudienceRating rtAudienceNumRatings rtAudienceScore rtPictureURL
0 1 Toy story 114709 Toy story (juguetes) http://ia.media-imdb.com/images/M/MV5BMTMwNDU0... 1995 toy_story 9.0 73.0 73.0 ... 100.0 8.5 17.0 17.0 0.0 100.0 3.7 102338.0 81.0 http://content7.flixster.com/movie/10/93/63/10...
1 2 Jumanji 113497 Jumanji http://ia.media-imdb.com/images/M/MV5BMzM5NjE1... 1995 1068044-jumanji 5.6 28.0 13.0 ... 46.0 5.8 5.0 2.0 3.0 40.0 3.2 44587.0 61.0 http://content8.flixster.com/movie/56/79/73/56...
2 3 Grumpy Old Men 107050 Dos viejos gruñones http://ia.media-imdb.com/images/M/MV5BMTI5MTgy... 1993 grumpy_old_men 5.9 36.0 24.0 ... 66.0 7.0 6.0 5.0 1.0 83.0 3.2 10489.0 66.0 http://content6.flixster.com/movie/25/60/25602...
3 4 Waiting to Exhale 114885 Esperando un respiro http://ia.media-imdb.com/images/M/MV5BMTczMTMy... 1995 waiting_to_exhale 5.6 25.0 14.0 ... 56.0 5.5 11.0 5.0 6.0 45.0 3.3 5666.0 79.0 http://content9.flixster.com/movie/10/94/17/10...
4 5 Father of the Bride Part II 113041 Vuelve el padre de la novia (Ahora también abu... http://ia.media-imdb.com/images/M/MV5BMTg1NDc2... 1995 father_of_the_bride_part_ii 5.3 19.0 9.0 ... 47.0 5.4 5.0 1.0 4.0 20.0 3.0 13761.0 64.0 http://content8.flixster.com/movie/25/54/25542...

5 rows Ă— 21 columns

movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10197 entries, 0 to 10196
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      10197 non-null  int64  
 1   title                   10197 non-null  object 
 2   imdbID                  10197 non-null  int64  
 3   spanishTitle            10197 non-null  object 
 4   imdbPictureURL          10016 non-null  object 
 5   year                    10197 non-null  int64  
 6   rtID                    9886 non-null   object 
 7   rtAllCriticsRating      9967 non-null   float64
 8   rtAllCriticsNumReviews  9967 non-null   float64
 9   rtAllCriticsNumFresh    9967 non-null   float64
 10  rtAllCriticsNumRotten   9967 non-null   float64
 11  rtAllCriticsScore       9967 non-null   float64
 12  rtTopCriticsRating      9967 non-null   float64
 13  rtTopCriticsNumReviews  9967 non-null   float64
 14  rtTopCriticsNumFresh    9967 non-null   float64
 15  rtTopCriticsNumRotten   9967 non-null   float64
 16  rtTopCriticsScore       9967 non-null   float64
 17  rtAudienceRating        9967 non-null   float64
 18  rtAudienceNumRatings    9967 non-null   float64
 19  rtAudienceScore         9967 non-null   float64
 20  rtPictureURL            9967 non-null   object 
dtypes: float64(13), int64(3), object(5)
memory usage: 1.6+ MB

It’s useful to index movies by ID, so let’s just do that now.

movies = movies.set_index('id')

Movie Info#

movie_genres = pd.read_csv('hetrec2011-ml/movie_genres.dat', delimiter='\t', encoding='latin1')
movie_genres.head()
movieID genre
0 1 Adventure
1 1 Animation
2 1 Children
3 1 Comedy
4 1 Fantasy
movie_tags = pd.read_csv('hetrec2011-ml/movie_tags.dat', delimiter='\t', encoding='latin1')
movie_tags.head()
movieID tagID tagWeight
0 1 7 1
1 1 13 3
2 1 25 3
3 1 55 3
4 1 60 1
tags = pd.read_csv('hetrec2011-ml/tags.dat', delimiter='\t', encoding='latin1')
tags.head()
id value
0 1 earth
1 2 police
2 3 boxing
3 4 painter
4 5 whale

Ratings#

ratings = pd.read_csv('hetrec2011-ml/user_ratedmovies-timestamps.dat', delimiter='\t', encoding='latin1')
ratings.head()
userID movieID rating timestamp
0 75 3 1.0 1162160236000
1 75 32 4.5 1162160624000
2 75 110 4.0 1162161008000
3 75 160 2.0 1162160212000
4 75 163 4.0 1162160970000

We’re going to compute movie statistics too:

movie_stats = ratings.groupby('movieID')['rating'].agg(['count', 'mean']).rename(columns={
    'mean': 'MeanRating',
    'count': 'RatingCount'
})
movie_stats.head()
RatingCount MeanRating
movieID
1 1263 3.735154
2 765 2.976471
3 252 2.873016
4 45 2.577778
5 225 2.753333

Seaborn data#

We’ll also use the Titanic data set from Seaborn:

titanic = sns.load_dataset('titanic')
titanic
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
886 0 2 male 27.0 0 0 13.0000 S Second man True NaN Southampton no True
887 1 1 female 19.0 0 0 30.0000 S First woman False B Southampton yes True
888 0 3 female NaN 1 2 23.4500 S Third woman False NaN Southampton no False
889 1 1 male 26.0 0 0 30.0000 C First man True C Cherbourg yes True
890 0 3 male 32.0 0 0 7.7500 Q Third man True NaN Queenstown no True

891 rows Ă— 15 columns

And the Tips data:

tips = sns.load_dataset("tips")
tips
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
... ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2

244 rows Ă— 7 columns

Labels and Titles#

The first thing we want to do is to make sure our chart is well-labeled. Seaborn has pretty good defaults — it uses the column names, when possible — but our column names don’t always fully explain what we want to do.

For example, let’s use our basic bar chart:

sns.catplot('class', 'survived', hue='sex', data=titanic, kind='bar')
<seaborn.axisgrid.FacetGrid at 0x17a871ab760>
../../../_images/ChartFinishingTouches_22_1.png

The y axis label on this chart isn’t very informative — what does “survived” mean as a numeric variable? We can relable the y axis with the ylabel function:

sns.catplot('class', 'survived', hue='sex', data=titanic, kind='bar')
plt.ylabel('Survival Rate')
Text(30.160833333333343, 0.5, 'Survival Rate')
../../../_images/ChartFinishingTouches_24_1.png

We can also clean up the x axis just a bit, and give the chart a title:

sns.catplot('class', 'survived', hue='sex', data=titanic, kind='bar')
plt.ylabel('Survival Rate')
plt.xlabel('Passage Class')
plt.title('Titanic Passenger Survival Rates')
Text(0.5, 1.0, 'Titanic Passenger Survival Rates')
../../../_images/ChartFinishingTouches_26_1.png

Reference Lines#

Remember our scatter plot of the tips data?

sns.scatterplot('total_bill', 'tip', hue='time', data=tips)
plt.xlabel("Total Bill")
plt.ylabel("Tip")
Text(0, 0.5, 'Tip')
../../../_images/ChartFinishingTouches_28_1.png

What if we want to easily see how actual tips related to 20% of the bill? 20% would be represented by a line passing through \((0,0)\) with a slope of \(0.2\). The latest matplotlib version has a function to do this automatically, but it hasn’t made its way to Anaconda yet, so we’ll need to get a little manual.

What we need to do is make a line plot of a bunch of bill values - say from 0 to some reasonable upper bound - and each value multipled by 0.2. Matplotlib is a pretty low-level interface. The numpy.linspace function lets us generate a bunch of x values:

tip20_xs = np.linspace(0, 55, 100)

Which we can multiply by 20%:

tip20_ys = tip20_xs * 0.2

We can plot a line:

plt.plot(tip20_xs, tip20_ys)
[<matplotlib.lines.Line2D at 0x17a879d96d0>]
../../../_images/ChartFinishingTouches_34_1.png

OK, that’s good! It matches one of the colors that Seaborn will use; no worries, we can change the color.

We’re going to draw our line first, so it is behind the Seaborn dots. Let’s give it a try:

plt.plot(tip20_xs, tip20_ys, color='grey')
sns.scatterplot('total_bill', 'tip', hue='time', data=tips)
plt.xlabel("Total Bill")
plt.ylabel("Tip")
plt.show()
../../../_images/ChartFinishingTouches_36_0.png

Let’s also change the markers used for the two times. This will make it easier to read the plot on e.g. a black-and-white printer. In Seaborn, we can do this with style.

Also, maybe we don’t like the colors - let’s change palettes.

plt.plot(tip20_xs, tip20_ys, color='grey')
sns.scatterplot('total_bill', 'tip', hue='time', style='time', data=tips, palette='dark')
plt.xlabel("Total Bill")
plt.ylabel("Tip")
plt.show()
../../../_images/ChartFinishingTouches_38_0.png

In Matplotlib, the parameters for colors and styles have different names.

Controlling Size#

All these plots are using a default size and aspect ratio. What if we want to change that?

The figure call will reconfigure matplotlib for the duration of drawing the figure (until plt.show() is called):

fig = plt.figure(figsize=(5, 4), dpi=150)
plt.plot(tip20_xs, tip20_ys, color='grey')
sns.scatterplot('total_bill', 'tip', hue='time', style='time', data=tips, palette='dark')
plt.xlabel("Total Bill")
plt.ylabel("Tip")
plt.show()
../../../_images/ChartFinishingTouches_41_0.png

Some Seaborn functions are figure-level. Such functions set up the Matplotlib figure themselves, but provide a way to control its size through the height and aspect options. height sets the height of the figure, and aspect the multiplier used to compute the width.

sns.relplot('total_bill', 'tip', hue='time', style='time', data=tips, palette='dark', height=4, aspect=1.5)
plt.xlabel("Total Bill")
plt.ylabel("Tip")
plt.show()
../../../_images/ChartFinishingTouches_43_0.png

In Concluding Remarks, I say more about what “figure-level” means.

Saving to Disk#

We can save a chart to a file with savefig:

plt.plot(tip20_xs, tip20_ys, color='grey')
sns.scatterplot('total_bill', 'tip', hue='time', style='time', data=tips, palette='dark')
plt.xlabel("Total Bill")
plt.ylabel("Tip")
plt.savefig('tips.png')
../../../_images/ChartFinishingTouches_46_0.png

That creates a file tips.png with the image; let’s load it with an HTML img tag into our document:

Tip Figure

You can also save images to .pdf files, which is very useful for writing papers in LaTeX. The resulting figure will be entirely vector-based, so it will look good at any zoom level in your final PDF. For PNG figures used in e.g. Word, I recommend also passing dpi=300 to savefig, to render them at high resolution so they look good. I haven’t found a good vector format that is supported by Matplotlib and embeddable in Word.

If you have a figure with a very large number of points (e.g. a scaatter plot with thousands to millions of points), the resulting PDF will render each point, which will make your document very slow to render and may crash some printers. In such cases, high-DPI (at least 300) PNG is also a good option for LaTeX.

Don’t use JPEG. It will introduce visual artifacts into your figures.

Concluding Remarks#

Seaborn is basically a set of convenience APIs on top of matplotlib (and statsmodels). It is great for relatively standard chart types, and makes things like faceting far easier than core matplotlib makes them. However, since all it does is call matplotlib in a relatively straightforward fashion, you can (for the most part) freely interchange seaborn and matplotlib calls.

Some Seaborn functions are “figure-level”. Matplotlib has concepts of figures and axes; basically an Axes is one plot with x and y axes. A figure can have multiple axes on it (called subplots). Figure-level functions will take over the Matplotlib figure/axes structure, so they can do things like faceting. When a figure-level function creates multiple axes, touching up the plots with direct matplotlib calls is a little more difficult. Other Seaborn functions can actually be given the Axes to draw themselves on, in case you are managing figures and axes yourself.

A detailed discussion of matplotlib’s API design and capabilities is beyond the scope of this course. We’ll be seeing more examples (and this notebook will be expanded) as the course progresses, and I encourage you to look at the galleries and tutorials provided by seaborn and matplotlib for inspiration and instructions on more sophisticated plots.

plotnine makes assembling nuanced, polished charts quite a bit easier, abstracting over matplotlib’s capabilities with a uniform API called the Grammar of Graphics (pioneered by the R package ggplot2), but it is currently harder to install in Anaconda. I want our notebooks for this class to run out of the box in Anaconda when possible, in order to minimize software difficulties.