Drawing Charts

This notebook presents various options for drawing charts of data, to complement the Week 3 chart types video.

This tutorial uses concepts from both the Selection and Reshaping notebooks.

This notebook uses the “MovieLens + IMDB/RottenTomatoes” data from the HETREC data. It also uses data sets built in to Seaborn.

Setup

First we will import our modules:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Then import the HETREC MovieLens data. A few notes:

  • Tab-separated data

  • Not UTF-8 - latin-1 encoding seems to work

  • Missing data encoded as \N (there’s a good chance that what we have is a PostgreSQL data dump!)

Movies

movies = pd.read_csv('hetrec2011-ml/movies.dat', delimiter='\t', encoding='latin1', na_values=['\\N'])
movies.head()
id title imdbID spanishTitle imdbPictureURL year rtID rtAllCriticsRating rtAllCriticsNumReviews rtAllCriticsNumFresh ... rtAllCriticsScore rtTopCriticsRating rtTopCriticsNumReviews rtTopCriticsNumFresh rtTopCriticsNumRotten rtTopCriticsScore rtAudienceRating rtAudienceNumRatings rtAudienceScore rtPictureURL
0 1 Toy story 114709 Toy story (juguetes) http://ia.media-imdb.com/images/M/MV5BMTMwNDU0... 1995 toy_story 9.0 73.0 73.0 ... 100.0 8.5 17.0 17.0 0.0 100.0 3.7 102338.0 81.0 http://content7.flixster.com/movie/10/93/63/10...
1 2 Jumanji 113497 Jumanji http://ia.media-imdb.com/images/M/MV5BMzM5NjE1... 1995 1068044-jumanji 5.6 28.0 13.0 ... 46.0 5.8 5.0 2.0 3.0 40.0 3.2 44587.0 61.0 http://content8.flixster.com/movie/56/79/73/56...
2 3 Grumpy Old Men 107050 Dos viejos gruñones http://ia.media-imdb.com/images/M/MV5BMTI5MTgy... 1993 grumpy_old_men 5.9 36.0 24.0 ... 66.0 7.0 6.0 5.0 1.0 83.0 3.2 10489.0 66.0 http://content6.flixster.com/movie/25/60/25602...
3 4 Waiting to Exhale 114885 Esperando un respiro http://ia.media-imdb.com/images/M/MV5BMTczMTMy... 1995 waiting_to_exhale 5.6 25.0 14.0 ... 56.0 5.5 11.0 5.0 6.0 45.0 3.3 5666.0 79.0 http://content9.flixster.com/movie/10/94/17/10...
4 5 Father of the Bride Part II 113041 Vuelve el padre de la novia (Ahora también abu... http://ia.media-imdb.com/images/M/MV5BMTg1NDc2... 1995 father_of_the_bride_part_ii 5.3 19.0 9.0 ... 47.0 5.4 5.0 1.0 4.0 20.0 3.0 13761.0 64.0 http://content8.flixster.com/movie/25/54/25542...

5 rows × 21 columns

movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10197 entries, 0 to 10196
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      10197 non-null  int64  
 1   title                   10197 non-null  object 
 2   imdbID                  10197 non-null  int64  
 3   spanishTitle            10197 non-null  object 
 4   imdbPictureURL          10016 non-null  object 
 5   year                    10197 non-null  int64  
 6   rtID                    9886 non-null   object 
 7   rtAllCriticsRating      9967 non-null   float64
 8   rtAllCriticsNumReviews  9967 non-null   float64
 9   rtAllCriticsNumFresh    9967 non-null   float64
 10  rtAllCriticsNumRotten   9967 non-null   float64
 11  rtAllCriticsScore       9967 non-null   float64
 12  rtTopCriticsRating      9967 non-null   float64
 13  rtTopCriticsNumReviews  9967 non-null   float64
 14  rtTopCriticsNumFresh    9967 non-null   float64
 15  rtTopCriticsNumRotten   9967 non-null   float64
 16  rtTopCriticsScore       9967 non-null   float64
 17  rtAudienceRating        9967 non-null   float64
 18  rtAudienceNumRatings    9967 non-null   float64
 19  rtAudienceScore         9967 non-null   float64
 20  rtPictureURL            9967 non-null   object 
dtypes: float64(13), int64(3), object(5)
memory usage: 1.6+ MB

It’s useful to index movies by ID, so let’s just do that now.

movies = movies.set_index('id')

Movie Info

movie_genres = pd.read_csv('hetrec2011-ml/movie_genres.dat', delimiter='\t', encoding='latin1')
movie_genres.head()
movieID genre
0 1 Adventure
1 1 Animation
2 1 Children
3 1 Comedy
4 1 Fantasy
movie_tags = pd.read_csv('hetrec2011-ml/movie_tags.dat', delimiter='\t', encoding='latin1')
movie_tags.head()
movieID tagID tagWeight
0 1 7 1
1 1 13 3
2 1 25 3
3 1 55 3
4 1 60 1
tags = pd.read_csv('hetrec2011-ml/tags.dat', delimiter='\t', encoding='latin1')
tags.head()
id value
0 1 earth
1 2 police
2 3 boxing
3 4 painter
4 5 whale

Ratings

ratings = pd.read_csv('hetrec2011-ml/user_ratedmovies-timestamps.dat', delimiter='\t', encoding='latin1')
ratings.head()
userID movieID rating timestamp
0 75 3 1.0 1162160236000
1 75 32 4.5 1162160624000
2 75 110 4.0 1162161008000
3 75 160 2.0 1162160212000
4 75 163 4.0 1162160970000

We’re going to compute movie statistics too:

movie_stats = ratings.groupby('movieID')['rating'].agg(['count', 'mean']).rename(columns={
    'mean': 'MeanRating',
    'count': 'RatingCount'
})
movie_stats.head()
RatingCount MeanRating
movieID
1 1263 3.735154
2 765 2.976471
3 252 2.873016
4 45 2.577778
5 225 2.753333

Titanic data

We’ll also use the Titanic data set from Seaborn:

titanic = sns.load_dataset('titanic')
titanic
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
886 0 2 male 27.0 0 0 13.0000 S Second man True NaN Southampton no True
887 1 1 female 19.0 0 0 30.0000 S First woman False B Southampton yes True
888 0 3 female NaN 1 2 23.4500 S Third woman False NaN Southampton no False
889 1 1 male 26.0 0 0 30.0000 C First man True C Cherbourg yes True
890 0 3 male 32.0 0 0 7.7500 Q Third man True NaN Queenstown no True

891 rows × 15 columns

Bar Charts

If we have: a categorical variable and a numeric response variable
And we want: to see how the mean of the numeric varies with the categorical
Then we can: use Seaborn catplot to create a bar chart:

sns.catplot('class', 'survived', data=titanic, kind='bar')
<seaborn.axisgrid.FacetGrid at 0x2c291b61790>
../../../_images/Charting_20_1.png

There are quite a few things going on here:

  • catplot by default computes the mean and 95% bootstrapped confidence intervals.

  • If we have a 0/1 variable or a logical, such as survived, taking its mean is the same as counting the proportion that are 1 or True; this is also the same as computing the probability of a true value. This is a very useful trick.

  • Most (but not all) Seaborn plotting functions natively work with data frames; we give column names for the x and y axes, respectively, and provide the data frame as data=, and it plots.

  • The bars are different colors for no reason. This is annoying.

We can fix that last problem:

sns.catplot('class', 'survived', data=titanic, kind='bar', color='firebrick')
<seaborn.axisgrid.FacetGrid at 0x2c2921d81c0>
../../../_images/Charting_22_1.png

But we can also go further.

If we have: two categorical variables and a numeric response variable
And we want: to see how the mean of the numeric varies with the combination of categorical variables
Then we can: use Seaborn catplot to create a bar chart with color-coded bars by mapping a variable to hue:

sns.catplot('class', 'survived', data=titanic, kind='bar', hue='sex')
<seaborn.axisgrid.FacetGrid at 0x2c29225d6d0>
../../../_images/Charting_24_1.png

Interlude: Labeling Figures

Seaborn calls Matplotlib under the hood, so all of Matplotlib’s functions are available to clean up our plot.

Let’s label our axes and give the chart a title:

sns.catplot('class', 'survived', data=titanic, kind='bar', hue='sex')
plt.ylabel('Survival Probability')
plt.xlabel('Passage Class')
plt.title('Survival by Class and Sex')
plt.show()
../../../_images/Charting_26_0.png

The plt.show() call tells Matplotlib to show the plot, and returns nothing, which cleans up the notebook display a bit (otherwise we also see the return value of the last plotting function, which is annoying and usually meaningless).

Scatter Plots

If we have: two numeric variables
And we want: to plot each point in two-dimensional space based on the variable values
Then we can: use a scatter plot.

We saw this in an earlier notebook - let’s look at the relationship of Rotten Tomatoes critic and audience scores:

movie_scores = movies[['year', 'rtAllCriticsRating', 'rtAudienceRating']].rename(columns={
    'rtAllCriticsRating': 'Critics',
    'rtAudienceRating': 'Audience'
})
movie_scores
year Critics Audience
id
1 1995 9.0 3.7
2 1995 5.6 3.2
3 1993 5.9 3.2
4 1995 5.6 3.3
5 1995 5.3 3.0
... ... ... ...
65088 2008 4.4 3.5
65091 1934 7.0 3.7
65126 2008 5.6 3.3
65130 2008 6.7 3.5
65133 1999 0.0 0.0

10197 rows × 3 columns

sns.scatterplot('Critics', 'Audience', data=movie_scores)
<matplotlib.axes._subplots.AxesSubplot at 0x2c2922dd640>
../../../_images/Charting_30_1.png

We can see more things, including distributions in the margin, with the more sophisticated jointplot:

sns.jointplot('Critics', 'Audience', data=movie_scores)
<seaborn.axisgrid.JointGrid at 0x2c2923c1550>
../../../_images/Charting_32_1.png

Now the top and right margins show the histograms of the those values!

Line Chart

If we have: two numeric variables, with one that reasonably defines a ‘series’ that they progress along
And we want: to show how variables change from one value of the ‘series’ variable to another
Then we can: use a line chart.

We saw this with the average scores by year previously.

year_scores = movie_scores.groupby('year').mean()
ys_tall = year_scores.reset_index().melt(id_vars='year', var_name='Source', value_name='MeanAvgScore')
ys_tall
year Source MeanAvgScore
0 1903 Critics 7.600000
1 1915 Critics 8.000000
2 1916 Critics 7.800000
3 1917 Critics 0.000000
4 1918 Critics 0.000000
... ... ... ...
191 2007 Audience 3.062162
192 2008 Audience 2.853698
193 2009 Audience 3.192308
194 2010 Audience 0.000000
195 2011 Audience 0.000000

196 rows × 3 columns

sns.lineplot('year', 'MeanAvgScore', hue='Source', data=ys_tall)
<matplotlib.axes._subplots.AxesSubplot at 0x2c29250e640>
../../../_images/Charting_36_1.png

*Practice: add the mean average rating from MovieLens users to this chart.

Box Plots

These show median-based distribution statistics for a numeric variable grouped by a categorical.

If we have: a numeric response variable and a categorical variable
And we want: to visualize median-based distribution statistics (min, max, IQR)
Then we can: use a box plot.

Let’s look at the distribution of rating counts by genre. We first need to join genres and movie stats.

mg_stats = movie_genres.join(movie_stats, on='movieID')
mg_stats
movieID genre RatingCount MeanRating
0 1 Adventure 1263.0 3.735154
1 1 Animation 1263.0 3.735154
2 1 Children 1263.0 3.735154
3 1 Comedy 1263.0 3.735154
4 1 Fantasy 1263.0 3.735154
... ... ... ... ...
20804 65126 Comedy 2.0 3.250000
20805 65126 Drama 2.0 3.250000
20806 65130 Drama 1.0 2.500000
20807 65130 Romance 1.0 2.500000
20808 65133 Comedy 3.0 4.000000

20809 rows × 4 columns

sns.boxplot('RatingCount', 'genre', data=mg_stats)
<matplotlib.axes._subplots.AxesSubplot at 0x2c2925842e0>
../../../_images/Charting_40_1.png

Note that this is horizontal rather than vertical - Seaborn automatically figures out which is numeric and which is categorical, and orients the box plot correctly. Horizontal is easier to have good layouts for the y axis labels.

Violin Plots

The violin plot also shows distributions, but does so with a kernel density estimate:

sns.violinplot('MeanRating', 'genre', data=mg_stats)
<matplotlib.axes._subplots.AxesSubplot at 0x2c2923417c0>
../../../_images/Charting_43_1.png

This chart is too crowded to usefully read.

Wrapping Up

The Seaborn functions have a common interface:

  • x and y as first and second parameters, respectively

  • if data= points to a data frame, then x and y are interpreted as column names

  • can change other aesthetics, such as color-coding points with hue='column'

We’ll see more plot capabilities later.