Charting Movie Scores
Contents
Charting Movie Scores#
This notebook supports several videos in 📅 Week 3 — Presentation (9/5–9).
Setup and Load Data#
Import our standard modules:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
Load the movie data, as we have in the other notebooks:
movies = pd.read_csv('hetrec2011-ml/movies.dat', delimiter='\t', encoding='latin1', na_values=['\\N'])
movies.head()
| id | title | imdbID | spanishTitle | imdbPictureURL | year | rtID | rtAllCriticsRating | rtAllCriticsNumReviews | rtAllCriticsNumFresh | ... | rtAllCriticsScore | rtTopCriticsRating | rtTopCriticsNumReviews | rtTopCriticsNumFresh | rtTopCriticsNumRotten | rtTopCriticsScore | rtAudienceRating | rtAudienceNumRatings | rtAudienceScore | rtPictureURL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Toy story | 114709 | Toy story (juguetes) | http://ia.media-imdb.com/images/M/MV5BMTMwNDU0... | 1995 | toy_story | 9.0 | 73.0 | 73.0 | ... | 100.0 | 8.5 | 17.0 | 17.0 | 0.0 | 100.0 | 3.7 | 102338.0 | 81.0 | http://content7.flixster.com/movie/10/93/63/10... | 
| 1 | 2 | Jumanji | 113497 | Jumanji | http://ia.media-imdb.com/images/M/MV5BMzM5NjE1... | 1995 | 1068044-jumanji | 5.6 | 28.0 | 13.0 | ... | 46.0 | 5.8 | 5.0 | 2.0 | 3.0 | 40.0 | 3.2 | 44587.0 | 61.0 | http://content8.flixster.com/movie/56/79/73/56... | 
| 2 | 3 | Grumpy Old Men | 107050 | Dos viejos gruñones | http://ia.media-imdb.com/images/M/MV5BMTI5MTgy... | 1993 | grumpy_old_men | 5.9 | 36.0 | 24.0 | ... | 66.0 | 7.0 | 6.0 | 5.0 | 1.0 | 83.0 | 3.2 | 10489.0 | 66.0 | http://content6.flixster.com/movie/25/60/25602... | 
| 3 | 4 | Waiting to Exhale | 114885 | Esperando un respiro | http://ia.media-imdb.com/images/M/MV5BMTczMTMy... | 1995 | waiting_to_exhale | 5.6 | 25.0 | 14.0 | ... | 56.0 | 5.5 | 11.0 | 5.0 | 6.0 | 45.0 | 3.3 | 5666.0 | 79.0 | http://content9.flixster.com/movie/10/94/17/10... | 
| 4 | 5 | Father of the Bride Part II | 113041 | Vuelve el padre de la novia (Ahora también abu... | http://ia.media-imdb.com/images/M/MV5BMTg1NDc2... | 1995 | father_of_the_bride_part_ii | 5.3 | 19.0 | 9.0 | ... | 47.0 | 5.4 | 5.0 | 1.0 | 4.0 | 20.0 | 3.0 | 13761.0 | 64.0 | http://content8.flixster.com/movie/25/54/25542... | 
5 rows × 21 columns
movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10197 entries, 0 to 10196
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      10197 non-null  int64  
 1   title                   10197 non-null  object 
 2   imdbID                  10197 non-null  int64  
 3   spanishTitle            10197 non-null  object 
 4   imdbPictureURL          10016 non-null  object 
 5   year                    10197 non-null  int64  
 6   rtID                    9886 non-null   object 
 7   rtAllCriticsRating      9967 non-null   float64
 8   rtAllCriticsNumReviews  9967 non-null   float64
 9   rtAllCriticsNumFresh    9967 non-null   float64
 10  rtAllCriticsNumRotten   9967 non-null   float64
 11  rtAllCriticsScore       9967 non-null   float64
 12  rtTopCriticsRating      9967 non-null   float64
 13  rtTopCriticsNumReviews  9967 non-null   float64
 14  rtTopCriticsNumFresh    9967 non-null   float64
 15  rtTopCriticsNumRotten   9967 non-null   float64
 16  rtTopCriticsScore       9967 non-null   float64
 17  rtAudienceRating        9967 non-null   float64
 18  rtAudienceNumRatings    9967 non-null   float64
 19  rtAudienceScore         9967 non-null   float64
 20  rtPictureURL            9967 non-null   object 
dtypes: float64(13), int64(3), object(5)
memory usage: 1.6+ MB
Describing Distributions#
Let’s start by describing the distributions of the three score variables:
movies['rtAllCriticsScore'].describe()
count    9967.000000
mean       56.705127
std        32.784319
min         0.000000
25%        30.000000
50%        63.000000
75%        86.000000
max       100.000000
Name: rtAllCriticsScore, dtype: float64
movies['rtAllCriticsScore'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x1bb1b387e20>
 
The all-critics score has a left skew (visible in the histogram, and also because the mean is less than the median). It further has spikes at around 0 and 1. We discussed in class some causes of these spikes.
movies['rtTopCriticsScore'].describe()
count    9967.000000
mean       41.611518
std        38.773000
min         0.000000
25%         0.000000
50%        38.000000
75%        80.000000
max       100.000000
Name: rtTopCriticsScore, dtype: float64
movies['rtTopCriticsScore'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x1bb1bb07280>
 
The Top Critics score is similar, but much stronger spike at 0, and around 5. We believe that 0 is probabaly used for ‘missing’ in a lot of cases.
movies['rtAudienceScore'].describe()
count    9967.000000
mean       48.340925
std        32.699404
min         0.000000
25%         0.000000
50%        57.000000
75%        76.000000
max       100.000000
Name: rtAudienceScore, dtype: float64
movies['rtAudienceScore'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x1bb1ca41a30>
 
The audience score doesn’t have the spike at 1, but does have a large spike at 0 (again, missing data?).
Relationships#
To look at how one score relates to another, we want a scatterplot - it’s our go-to for two numeric variables:
sns.scatterplot('rtAllCriticsScore', 'rtAudienceScore', data=movies)
<matplotlib.axes._subplots.AxesSubplot at 0x1bb1cab35e0>
 
Seaborn’s jointplot is also useful here:
sns.jointplot('rtAllCriticsScore', 'rtAudienceScore', data=movies)
<seaborn.axisgrid.JointGrid at 0x18f0abdec70>
 
Underrated#
To look at underrated movies, we’re going to look at the difference between the all-critics score and the audience score. If the critics underestimate the audience, that indicates it might be underrated.
Let’s start by computing the difference:
movies['ac_diff'] = movies['rtAudienceScore'] - movies['rtAllCriticsScore']
Then the 5 movies with the largest difference (audience score much larger than critics score) are underrated:
movies.nlargest(5, 'ac_diff')
| id | title | imdbID | spanishTitle | imdbPictureURL | year | rtID | rtAllCriticsRating | rtAllCriticsNumReviews | rtAllCriticsNumFresh | ... | rtTopCriticsRating | rtTopCriticsNumReviews | rtTopCriticsNumFresh | rtTopCriticsNumRotten | rtTopCriticsScore | rtAudienceRating | rtAudienceNumRatings | rtAudienceScore | rtPictureURL | ac_diff | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7299 | 8007 | Pure Country | 105191 | Pure Country | http://ia.media-imdb.com/images/M/MV5BMTI4OTQz... | 1992 | pure_country | 0.0 | 4.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 3.9 | 3532.0 | 89.0 | http://content8.flixster.com/movie/10/93/85/10... | 89.0 | 
| 4141 | 4442 | The Last Dragon | 89461 | El último dragón | http://ia.media-imdb.com/images/M/MV5BMTU5Nzc5... | 1985 | the-last-dragon | 0.0 | 4.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 3.9 | 4398.0 | 84.0 | http://content7.flixster.com/movie/11/12/45/11... | 84.0 | 
| 4960 | 5281 | The Wrong Guy | 120536 | The Wrong Guy | http://ia.media-imdb.com/images/M/MV5BMTI4MzM1... | 1997 | wrong_guy | 0.0 | 3.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 3.6 | 826.0 | 81.0 | http://content7.flixster.com/movie/10/92/04/10... | 81.0 | 
| 5331 | 5662 | The Wrong Guy | 120536 | The Wrong Guy | http://ia.media-imdb.com/images/M/MV5BMTI4MzM1... | 1997 | wrong_guy | 0.0 | 3.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 3.6 | 826.0 | 81.0 | http://content7.flixster.com/movie/10/92/04/10... | 81.0 | 
| 6799 | 7192 | Only the Strong | 107750 | Sólo el más fuerte | http://ia.media-imdb.com/images/M/MV5BMTY3MDQx... | 1993 | only_the_strong | 2.8 | 7.0 | 0.0 | ... | 2.5 | 5.0 | 0.0 | 5.0 | 0.0 | 3.7 | 1259.0 | 79.0 | http://content9.flixster.com/movie/10/86/99/10... | 79.0 | 
5 rows × 22 columns
The lowest scores are overrated:
movies.nsmallest(5, 'ac_diff')
| id | title | imdbID | spanishTitle | imdbPictureURL | year | rtID | rtAllCriticsRating | rtAllCriticsNumReviews | rtAllCriticsNumFresh | ... | rtTopCriticsRating | rtTopCriticsNumReviews | rtTopCriticsNumFresh | rtTopCriticsNumRotten | rtTopCriticsScore | rtAudienceRating | rtAudienceNumRatings | rtAudienceScore | rtPictureURL | ac_diff | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 57 | 59 | Le confessionnal | 112714 | Le confessionnal | http://ia.media-imdb.com/images/M/MV5BMTk2MDE1... | 1995 | confessional | 0.0 | 1.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | http://images.rottentomatoescdn.com/images/def... | -100.0 | 
| 180 | 189 | The Reckless Moment | 41786 | Almas desnudas | http://ia.media-imdb.com/images/M/MV5BMTUyNzEx... | 1949 | reckless_moment | 0.0 | 3.0 | 3.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | http://content6.flixster.com/movie/55/32/82/55... | -100.0 | 
| 287 | 298 | Tui shou | 105652 | Tui shou | http://ia.media-imdb.com/images/M/MV5BMTIxMzk0... | 1992 | pushing_hands | 0.0 | 2.0 | 2.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | http://content8.flixster.com/movie/10/84/00/10... | -100.0 | 
| 526 | 545 | Harem suaré | 179841 | El último harén | http://ia.media-imdb.com/images/M/MV5BMjcwNjYx... | 1999 | 1121056-harem | 0.0 | 1.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | http://content9.flixster.com/movie/27/77/27770... | -100.0 | 
| 589 | 615 | Pane e cioccolata | 70506 | Aventuras y desventuras de un italiano emigrado | http://ia.media-imdb.com/images/M/MV5BMTQ5NjM5... | 1973 | bread_and_chocolate | 0.0 | 4.0 | 4.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | http://content7.flixster.com/movie/10/89/25/10... | -100.0 | 
5 rows × 22 columns
Our missing data is going to be a problem here, though. We need to clean up a lot of thes 0s in order to have reliable results. I’ll leave that to you!