Describing Distributions¶

Setup¶

Import our modules again:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

And load the MovieLens data. We’re going to pass the memory_use='deep' to info, so we can see the total memory use including the strings.

movies = pd.read_csv('ml-25m/movies.csv')
movies.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  62423 non-null  int64 
 1   title    62423 non-null  object
 2   genres   62423 non-null  object
dtypes: int64(1), object(2)
memory usage: 9.6 MB
ratings = pd.read_csv('ml-25m/ratings.csv')
ratings.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000095 entries, 0 to 25000094
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int64(3)
memory usage: 762.9 MB

Quickly preview the ratings frame:

ratings
userId movieId rating timestamp
0 1 296 5.0 1147880044
1 1 306 3.5 1147868817
2 1 307 5.0 1147868828
3 1 665 5.0 1147878820
4 1 899 3.5 1147868510
... ... ... ... ...
25000090 162541 50872 4.5 1240953372
25000091 162541 55768 2.5 1240951998
25000092 162541 56176 2.0 1240950697
25000093 162541 58559 4.0 1240953434
25000094 162541 63876 5.0 1240952515

25000095 rows × 4 columns

Movie stats:

movie_stats = ratings.groupby('movieId')['rating'].agg(['mean', 'count'])
movie_stats
mean count
movieId
1 3.893708 57309
2 3.251527 24228
3 3.142028 11804
4 2.853547 2523
5 3.058434 11714
... ... ...
209157 1.500000 1
209159 3.000000 1
209163 4.500000 1
209169 3.000000 1
209171 3.000000 1

59047 rows × 2 columns

movie_info = movies.join(movie_stats, on='movieId')
movie_info
movieId title genres mean count
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 3.893708 57309.0
1 2 Jumanji (1995) Adventure|Children|Fantasy 3.251527 24228.0
2 3 Grumpier Old Men (1995) Comedy|Romance 3.142028 11804.0
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance 2.853547 2523.0
4 5 Father of the Bride Part II (1995) Comedy 3.058434 11714.0
... ... ... ... ... ...
62418 209157 We (2018) Drama 1.500000 1.0
62419 209159 Window of the Soul (2001) Documentary 3.000000 1.0
62420 209163 Bad Poems (2018) Comedy|Drama 4.500000 1.0
62421 209169 A Girl Thing (2001) (no genres listed) 3.000000 1.0
62422 209171 Women of Devil's Island (1962) Action|Adventure|Drama 3.000000 1.0

62423 rows × 5 columns

Normal Distribution¶

I want to visualize an array of random, normally-distributed numbers.

We’ll first generate them:

numbers = pd.Series(np.random.randn(10000) + 5)
numbers
0       3.996778
1       4.119017
2       4.360605
3       4.850503
4       6.244754
          ...   
9995    4.614897
9996    5.742292
9997    4.271137
9998    4.117684
9999    4.199830
Length: 10000, dtype: float64

And then describe them:

numbers.describe()
count    10000.000000
mean         4.998850
std          1.006624
min          1.473666
25%          4.305575
50%          4.982345
75%          5.690335
max          9.202777
dtype: float64

And finally visualize them:

plt.hist(numbers, bins=25)
(array([6.000e+00, 8.000e+00, 3.500e+01, 5.500e+01, 1.220e+02, 2.280e+02,
        4.070e+02, 6.110e+02, 8.810e+02, 1.033e+03, 1.175e+03, 1.184e+03,
        1.125e+03, 9.890e+02, 8.230e+02, 5.580e+02, 3.160e+02, 1.990e+02,
        1.360e+02, 6.000e+01, 2.800e+01, 1.400e+01, 6.000e+00, 0.000e+00,
        1.000e+00]),
 array([1.47366561, 1.78283007, 2.09199454, 2.401159  , 2.71032347,
        3.01948793, 3.3286524 , 3.63781687, 3.94698133, 4.2561458 ,
        4.56531026, 4.87447473, 5.18363919, 5.49280366, 5.80196812,
        6.11113259, 6.42029706, 6.72946152, 7.03862599, 7.34779045,
        7.65695492, 7.96611938, 8.27528385, 8.58444831, 8.89361278,
        9.20277725]),
 <a list of 25 Patch objects>)
../../_images/2-6-DescribingDistributions_16_1.png

Average Movie Rating¶

To start looking at some real data, let’s look at the distribution of average movie rating:

movie_info['mean'].describe()
count    59047.000000
mean         3.071374
std          0.739840
min          0.500000
25%          2.687500
50%          3.150000
75%          3.500000
max          5.000000
Name: mean, dtype: float64

Let’s make a histogram:

plt.hist(movie_info['mean'])
plt.show()
C:\Users\michaelekstrand\Anaconda3\lib\site-packages\numpy\lib\histograms.py:839: RuntimeWarning: invalid value encountered in greater_equal
  keep = (tmp_a >= first_edge)
C:\Users\michaelekstrand\Anaconda3\lib\site-packages\numpy\lib\histograms.py:840: RuntimeWarning: invalid value encountered in less_equal
  keep &= (tmp_a <= last_edge)
../../_images/2-6-DescribingDistributions_20_1.png

And with more bins:

plt.hist(movie_info['mean'], bins=50)
plt.show()
../../_images/2-6-DescribingDistributions_22_0.png

Movie Count¶

Now we want to describe the distribution of the ratings-per-movie (movie popularity).

movie_info['count'].describe()
count    59047.000000
mean       423.393144
std       2477.885821
min          1.000000
25%          2.000000
50%          6.000000
75%         36.000000
max      81491.000000
Name: count, dtype: float64
plt.hist(movie_info['count'])
plt.show()
../../_images/2-6-DescribingDistributions_26_0.png
plt.hist(movie_info['count'], bins=100)
plt.show()
../../_images/2-6-DescribingDistributions_27_0.png

That is a very skewed distribution. Will it make more sense on a logarithmic scale?

We don’t want to just log-scale a histogram - it will be very difficult to interpret. We will use a point plot.

The value_counts() method counts the number of times each value appers. The resulting series is indexed by value, so we will use its index as the x-axis of the plot. Indexes are arrays too!

hist = movie_info['count'].value_counts()
plt.scatter(hist.index, hist)
plt.yscale('log')
plt.ylabel('Number of Movies')
plt.xscale('log')
plt.xlabel('Number of Ratings')
Text(0.5, 0, 'Number of Ratings')
../../_images/2-6-DescribingDistributions_29_1.png

Penguins¶

Let’s load the Penguin data (converted from R):

penguins = pd.read_csv('penguins.csv')
penguins
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 Adelie Torgersen NaN NaN NaN NaN NaN 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007
... ... ... ... ... ... ... ... ...
339 Chinstrap Dream 55.8 19.8 207.0 4000.0 male 2009
340 Chinstrap Dream 43.5 18.1 202.0 3400.0 female 2009
341 Chinstrap Dream 49.6 18.2 193.0 3775.0 male 2009
342 Chinstrap Dream 50.8 19.0 210.0 4100.0 male 2009
343 Chinstrap Dream 50.2 18.7 198.0 3775.0 female 2009

344 rows × 8 columns

Now we’ll compute a histogram. There are ways to do this automatically, but for demonstration purposes I want to do the computations ourselves:

spec_counts = penguins['species'].value_counts()
plt.bar(spec_counts.index, spec_counts)
plt.xlabel('Species')
plt.ylabel('# of Penguins')
Text(0, 0.5, '# of Penguins')
../../_images/2-6-DescribingDistributions_33_1.png

What if we want to show the fraction of each species? We can divide by the sum:

spec_fracs = spec_counts / spec_counts.sum()
plt.bar(spec_counts.index, spec_fracs)
plt.xlabel('Species')
plt.ylabel('Fraction of Penguins')
Text(0, 0.5, 'Fraction of Penguins')
../../_images/2-6-DescribingDistributions_35_1.png