Reshaping Data¶
This notebook discusses operations to reshape data:
- Collapsing rows through aggregates
- Joining data frames together (basic joins)
- Pivoting between wide and long formats
The Selection notebook discussed how to select subsets of rows or columns; these operations will change what the rows and columns are, or add new items.
This notebook uses the “MovieLens + IMDB/RottenTomatoes” data from the HETREC data.
Setup¶
First we will import our modules:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Then import the HETREC MovieLens data. A few notes:
- Tab-separated data
- Not UTF-8 - latin-1 encoding seems to work
- Missing data encoded as
\N
(there's a good chance that what we have is a PostgreSQL data dump!)
Movies¶
movies = pd.read_csv('hetrec2011-ml/movies.dat', delimiter='\t', encoding='latin1', na_values=['\\N'])
movies.head()
id | title | imdbID | spanishTitle | imdbPictureURL | year | rtID | rtAllCriticsRating | rtAllCriticsNumReviews | rtAllCriticsNumFresh | ... | rtAllCriticsScore | rtTopCriticsRating | rtTopCriticsNumReviews | rtTopCriticsNumFresh | rtTopCriticsNumRotten | rtTopCriticsScore | rtAudienceRating | rtAudienceNumRatings | rtAudienceScore | rtPictureURL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Toy story | 114709 | Toy story (juguetes) | http://ia.media-imdb.com/images/M/MV5BMTMwNDU0... | 1995 | toy_story | 9.0 | 73.0 | 73.0 | ... | 100.0 | 8.5 | 17.0 | 17.0 | 0.0 | 100.0 | 3.7 | 102338.0 | 81.0 | http://content7.flixster.com/movie/10/93/63/10... |
1 | 2 | Jumanji | 113497 | Jumanji | http://ia.media-imdb.com/images/M/MV5BMzM5NjE1... | 1995 | 1068044-jumanji | 5.6 | 28.0 | 13.0 | ... | 46.0 | 5.8 | 5.0 | 2.0 | 3.0 | 40.0 | 3.2 | 44587.0 | 61.0 | http://content8.flixster.com/movie/56/79/73/56... |
2 | 3 | Grumpy Old Men | 107050 | Dos viejos gruñones | http://ia.media-imdb.com/images/M/MV5BMTI5MTgy... | 1993 | grumpy_old_men | 5.9 | 36.0 | 24.0 | ... | 66.0 | 7.0 | 6.0 | 5.0 | 1.0 | 83.0 | 3.2 | 10489.0 | 66.0 | http://content6.flixster.com/movie/25/60/25602... |
3 | 4 | Waiting to Exhale | 114885 | Esperando un respiro | http://ia.media-imdb.com/images/M/MV5BMTczMTMy... | 1995 | waiting_to_exhale | 5.6 | 25.0 | 14.0 | ... | 56.0 | 5.5 | 11.0 | 5.0 | 6.0 | 45.0 | 3.3 | 5666.0 | 79.0 | http://content9.flixster.com/movie/10/94/17/10... |
4 | 5 | Father of the Bride Part II | 113041 | Vuelve el padre de la novia (Ahora también abu... | http://ia.media-imdb.com/images/M/MV5BMTg1NDc2... | 1995 | father_of_the_bride_part_ii | 5.3 | 19.0 | 9.0 | ... | 47.0 | 5.4 | 5.0 | 1.0 | 4.0 | 20.0 | 3.0 | 13761.0 | 64.0 | http://content8.flixster.com/movie/25/54/25542... |
5 rows × 21 columns
movies.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10197 entries, 0 to 10196 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 10197 non-null int64 1 title 10197 non-null object 2 imdbID 10197 non-null int64 3 spanishTitle 10197 non-null object 4 imdbPictureURL 10016 non-null object 5 year 10197 non-null int64 6 rtID 9886 non-null object 7 rtAllCriticsRating 9967 non-null float64 8 rtAllCriticsNumReviews 9967 non-null float64 9 rtAllCriticsNumFresh 9967 non-null float64 10 rtAllCriticsNumRotten 9967 non-null float64 11 rtAllCriticsScore 9967 non-null float64 12 rtTopCriticsRating 9967 non-null float64 13 rtTopCriticsNumReviews 9967 non-null float64 14 rtTopCriticsNumFresh 9967 non-null float64 15 rtTopCriticsNumRotten 9967 non-null float64 16 rtTopCriticsScore 9967 non-null float64 17 rtAudienceRating 9967 non-null float64 18 rtAudienceNumRatings 9967 non-null float64 19 rtAudienceScore 9967 non-null float64 20 rtPictureURL 9967 non-null object dtypes: float64(13), int64(3), object(5) memory usage: 1.6+ MB
Movie Info¶
movie_genres = pd.read_csv('hetrec2011-ml/movie_genres.dat', delimiter='\t', encoding='latin1')
movie_genres.head()
movieID | genre | |
---|---|---|
0 | 1 | Adventure |
1 | 1 | Animation |
2 | 1 | Children |
3 | 1 | Comedy |
4 | 1 | Fantasy |
movie_tags = pd.read_csv('hetrec2011-ml/movie_tags.dat', delimiter='\t', encoding='latin1')
movie_tags.head()
movieID | tagID | tagWeight | |
---|---|---|---|
0 | 1 | 7 | 1 |
1 | 1 | 13 | 3 |
2 | 1 | 25 | 3 |
3 | 1 | 55 | 3 |
4 | 1 | 60 | 1 |
tags = pd.read_csv('hetrec2011-ml/tags.dat', delimiter='\t', encoding='latin1')
tags.head()
id | value | |
---|---|---|
0 | 1 | earth |
1 | 2 | police |
2 | 3 | boxing |
3 | 4 | painter |
4 | 5 | whale |
Ratings¶
ratings = pd.read_csv('hetrec2011-ml/user_ratedmovies-timestamps.dat', delimiter='\t', encoding='latin1')
ratings.head()
userID | movieID | rating | timestamp | |
---|---|---|---|---|
0 | 75 | 3 | 1.0 | 1162160236000 |
1 | 75 | 32 | 4.5 | 1162160624000 |
2 | 75 | 110 | 4.0 | 1162161008000 |
3 | 75 | 160 | 2.0 | 1162160212000 |
4 | 75 | 163 | 4.0 | 1162160970000 |
Grouping and Aggregating¶
The group and aggregate operations allow us to collapse multiple rows, that share a common value in one or more columns (the grouping key), into a single row. There are three pieces of a group-aggregate operation:
- The grouping key — one or more columns that are used to identify which rows are in the same group
- The column(s) to aggregate
- The aggregate function(s) to apply
A group/aggregate operation transforms a data frames using thes columns as follows:
- Each unique combination of values in the grouping key becomes one row. If the grouping key is one column, this means each distinct value of the grouping column will yield one row in the final results.
- The columns to aggregate are collapsed into a single value using the specified aggregate function(s).
- Other columns are ignored and do not appear in the output.
Aggregate to Series¶
If we have: a data frame
And we have: a column identifying each observation's group membership
And we want: a series with an entry for each group, whose value is an aggregate function of another column
Then we can: use groupby
with one column and the aggregate function.
For example, if we want to find the timestamp of the first rating for each movie, we can group the ratings table by movieID
and take the min
of the timestamp
column:
ratings.groupby('movieID')['timestamp'].min()
movieID 1 876894948000 2 891835167000 3 931525286000 4 951680805000 5 874593700000 ... 65088 1230851753000 65091 1230918649000 65126 1231028759000 65130 1231061731000 65133 1231034528000 Name: timestamp, Length: 10109, dtype: int64
This breaks down into a few pieces. The groupby
method returns a group-by object:
ratings.groupby('movieID')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000270239AA7F0>
The DataFrameGroupBy
is a iterable over (key, df)
tuples, where key
is a single value of the grouping key, and df
is a data frame containing the group's data (the next
function gets the next value of an iterable):
next(iter(ratings.groupby('movieID')))
(1, userID movieID rating timestamp 556 170 1 3.0 1162208198000 639 175 1 4.0 1133674606000 915 190 1 4.5 1057778398000 1211 267 1 2.5 1084284499000 1643 325 1 4.0 1134939391000 ... ... ... ... ... 850646 71420 1 5.0 948256089000 852838 71483 1 4.0 1035837220000 853507 71497 1 5.0 1188256917000 853757 71509 1 4.0 1018636872000 855327 71529 1 4.5 1162098801000 [1263 rows x 4 columns])
The column selection yields a SeriesGroupBy
, which is a group-by for a series instead of a frame:
ratings.groupby('movieID')['timestamp']
<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000270239AAD30>
Finally, calling min
on this will apply the min
function to each group, and put the results together into a series.
Multiple Aggregates¶
If we have: a data frame
And we have: a column identifying each observation's group membership
And we want: a data frame with an entry for each group, with a column for each of one or more aggregate functions of a column
Then we can: use groupby
with one column and the .agg
method with aggregate function names
For example, if we want to find the timestamp of the first rating and last ratings for each movie, we can group the ratings table by movieID
and take the min
and max
of the timestamp
column:
ratings.groupby('movieID')['timestamp'].agg(['min', 'max'])
min | max | |
---|---|---|
movieID | ||
1 | 876894948000 | 1231030405000 |
2 | 891835167000 | 1231030891000 |
3 | 931525286000 | 1231099805000 |
4 | 951680805000 | 1229214105000 |
5 | 874593700000 | 1230915209000 |
... | ... | ... |
65088 | 1230851753000 | 1230851753000 |
65091 | 1230918649000 | 1230918649000 |
65126 | 1231028759000 | 1231097168000 |
65130 | 1231061731000 | 1231061731000 |
65133 | 1231034528000 | 1231129793000 |
10109 rows × 2 columns
Note that in both cases, the resulting frame or series is indexed by the grouping column.
Practice: compute the mean and count of the ratings for each movie
Counting Values¶
There's a useful special case of grouped aggregation - counting how many times each distinct value in a column appears.
If we have: a series
And we want: a series whose index keys are the values of the original series, and whose values are the number of times each value appeared in the original series
Then we can: use the value_counts
method:
ratings['rating'].value_counts()
4.0 215773 3.0 155918 3.5 150582 4.5 88652 5.0 71680 2.5 62454 2.0 57188 1.0 21535 1.5 18328 0.5 13488 Name: rating, dtype: int64
We can also sort the results with the sort_index
method:
ratings['rating'].value_counts().sort_index()
0.5 13488 1.0 21535 1.5 18328 2.0 57188 2.5 62454 3.0 155918 3.5 150582 4.0 215773 4.5 88652 5.0 71680 Name: rating, dtype: int64
sort_index
applies to both series and data frames.
Multiple Columns¶
We can also compute an aggregate over multiple columns at the same time. For example, if we want to look at the mean critic and audience scores of movies over time, we can group movies by year and take the mean of multiple columns.
If we have: a data frame
And we have: one or more columns identifying group membership
And we want: a data frame with the same aggregate computed over all other columns in the frame
Then we can: apply groupby
and the aggregate function without selecting columns.
We're first going to select and rename columns, like we did at the end of the Selection notebook.
movie_scores = movies.set_index('id')[['year', 'rtAllCriticsRating', 'rtTopCriticsRating', 'rtAudienceRating']].rename(columns={
'rtAllCriticsRating': 'All Critics',
'rtTopCriticsRating': 'Top Critics',
'rtAudienceRating': 'Audience'
})
movie_scores
year | All Critics | Top Critics | Audience | |
---|---|---|---|---|
id | ||||
1 | 1995 | 9.0 | 8.5 | 3.7 |
2 | 1995 | 5.6 | 5.8 | 3.2 |
3 | 1993 | 5.9 | 7.0 | 3.2 |
4 | 1995 | 5.6 | 5.5 | 3.3 |
5 | 1995 | 5.3 | 5.4 | 3.0 |
... | ... | ... | ... | ... |
65088 | 2008 | 4.4 | 4.7 | 3.5 |
65091 | 1934 | 7.0 | 0.0 | 3.7 |
65126 | 2008 | 5.6 | 4.9 | 3.3 |
65130 | 2008 | 6.7 | 6.9 | 3.5 |
65133 | 1999 | 0.0 | 0.0 | 0.0 |
10197 rows × 4 columns
Now we're going to do the group-by operation:
year_scores = movie_scores.groupby('year').mean()
year_scores
All Critics | Top Critics | Audience | |
---|---|---|---|
year | |||
1903 | 7.600000 | 0.000000 | 0.000000 |
1915 | 8.000000 | 0.000000 | 3.300000 |
1916 | 7.800000 | 0.000000 | 3.800000 |
1917 | 0.000000 | 0.000000 | 0.000000 |
1918 | 0.000000 | 0.000000 | 0.000000 |
... | ... | ... | ... |
2007 | 5.141278 | 4.769042 | 3.062162 |
2008 | 4.699357 | 4.383280 | 2.853698 |
2009 | 4.850000 | 4.700000 | 3.192308 |
2010 | 0.000000 | 0.000000 | 0.000000 |
2011 | 0.000000 | 0.000000 | 0.000000 |
98 rows × 3 columns
There are two other equivalents to this operation. First, the agg
function can take one name instead of a list:
movie_scores.groupby('year').agg('mean')
All Critics | Top Critics | Audience | |
---|---|---|---|
year | |||
1903 | 7.600000 | 0.000000 | 0.000000 |
1915 | 8.000000 | 0.000000 | 3.300000 |
1916 | 7.800000 | 0.000000 | 3.800000 |
1917 | 0.000000 | 0.000000 | 0.000000 |
1918 | 0.000000 | 0.000000 | 0.000000 |
... | ... | ... | ... |
2007 | 5.141278 | 4.769042 | 3.062162 |
2008 | 4.699357 | 4.383280 | 2.853698 |
2009 | 4.850000 | 4.700000 | 3.192308 |
2010 | 0.000000 | 0.000000 | 0.000000 |
2011 | 0.000000 | 0.000000 | 0.000000 |
98 rows × 3 columns
Second, the agg
function can actually take a function as its argument:
movie_scores.groupby('year').agg(np.mean)
All Critics | Top Critics | Audience | |
---|---|---|---|
year | |||
1903 | 7.600000 | 0.000000 | 0.000000 |
1915 | 8.000000 | 0.000000 | 3.300000 |
1916 | 7.800000 | 0.000000 | 3.800000 |
1917 | 0.000000 | 0.000000 | 0.000000 |
1918 | 0.000000 | 0.000000 | 0.000000 |
... | ... | ... | ... |
2007 | 5.141278 | 4.769042 | 3.062162 |
2008 | 4.699357 | 4.383280 | 2.853698 |
2009 | 4.850000 | 4.700000 | 3.192308 |
2010 | 0.000000 | 0.000000 | 0.000000 |
2011 | 0.000000 | 0.000000 | 0.000000 |
98 rows × 3 columns
No matter which method we used, we can then quickly do a line plot, which will default to plotting each column as a different colored line, with the index on the x axis:
year_scores.plot.line()
plt.ylabel('Mean Average Score')
plt.show()
Other Grouping Modes¶
There are a couple of other ways you can use groupby
and aggregates:
- Select multiple columns from the data frame grouper, by using a list of columns.
- Apply a different aggregate to each function by passing a dictionary to
agg
. - Apply multiple aggregates to multiple columns by passing a list to
agg
and selecting a list of columns. The resulting data frame has a hierarchical index for its columns, which is honestly a little annoying to actually use. - Group by multiple columns, resulting in a hierarchical index for the rows
Experiment with these on your own!
Note: if you use multiple columns, and one or more of the columns of the frame is a Categorical
, groupby will produce a row for each possible value, even if one is never observed for that combination of the other grouping keys. To turn this off, pass the observed=True
option to groupby
.
Joining Frames¶
We can join two frames together based on common values of one or more of their columns, possibly using indexes as well.
The simplest way to join two frames without any indexes is with pd.merge
.
Column Join¶
If we have: two data frames with at least one column in common
And we want: a data frame with rows based on matching the column values
Then we can: use pd.merge
to join the frames.
For example, to combine genres with titles:
pd.merge(movies[['id', 'title']].rename(columns={'id': 'movieID'}),
movie_genres, on='movieID')
movieID | title | genre | |
---|---|---|---|
0 | 1 | Toy story | Adventure |
1 | 1 | Toy story | Animation |
2 | 1 | Toy story | Children |
3 | 1 | Toy story | Comedy |
4 | 1 | Toy story | Fantasy |
... | ... | ... | ... |
20804 | 65126 | Choke | Comedy |
20805 | 65126 | Choke | Drama |
20806 | 65130 | Revolutionary Road | Drama |
20807 | 65130 | Revolutionary Road | Romance |
20808 | 65133 | Blackadder Back & Forth | Comedy |
20809 rows × 3 columns
If there is more than one row in one table with a match for a value in the other table, it duplicates the other row. So we can see that “Toy story” is duplicated, once for each genre it appears in.
Pull out the selection & rename on movies
into a new cell for practice, and to see what it does.
Column/Index Join¶
The join
method can do index-index and column-index joins.
If we have: a data frame
And we have: another data frame whose index is values appearing in a column of the first frame
And we want: a frame that merges the two frames by matching column values in the first with the index keys of the second
Then we can: use .join
with the on
option.
To set this up, we want something with a suitable index. Let's create a frame that has mean & count of MovieLens user ratings:
movie_stats = ratings.groupby('movieID')['rating'].agg(['mean', 'count'])
movie_stats = movie_stats.rename(columns={'mean': 'MeanRating', 'count': 'RatingCount'})
movie_stats
MeanRating | RatingCount | |
---|---|---|
movieID | ||
1 | 3.735154 | 1263 |
2 | 2.976471 | 765 |
3 | 2.873016 | 252 |
4 | 2.577778 | 45 |
5 | 2.753333 | 225 |
... | ... | ... |
65088 | 3.500000 | 1 |
65091 | 4.000000 | 1 |
65126 | 3.250000 | 2 |
65130 | 2.500000 | 1 |
65133 | 4.000000 | 3 |
10109 rows × 2 columns
This frame is indexed by movieID. Let's combine that with movie title & year info:
movies[['id', 'title', 'year']].join(movie_stats, on='id')
id | title | year | MeanRating | RatingCount | |
---|---|---|---|---|---|
0 | 1 | Toy story | 1995 | 3.735154 | 1263.0 |
1 | 2 | Jumanji | 1995 | 2.976471 | 765.0 |
2 | 3 | Grumpy Old Men | 1993 | 2.873016 | 252.0 |
3 | 4 | Waiting to Exhale | 1995 | 2.577778 | 45.0 |
4 | 5 | Father of the Bride Part II | 1995 | 2.753333 | 225.0 |
... | ... | ... | ... | ... | ... |
10192 | 65088 | Bedtime Stories | 2008 | 3.500000 | 1.0 |
10193 | 65091 | Manhattan Melodrama | 1934 | 4.000000 | 1.0 |
10194 | 65126 | Choke | 2008 | 3.250000 | 2.0 |
10195 | 65130 | Revolutionary Road | 2008 | 2.500000 | 1.0 |
10196 | 65133 | Blackadder Back & Forth | 1999 | 4.000000 | 3.0 |
10197 rows × 5 columns
Note that we used the id
column, which is in the movies
table, as the on
option. Join automatically matches it with the index in movie_stats
.
Index/Index Join¶
Let's now see an index-index join.
If we have: two data frames with matching indexes
And we want: a frame that merges the two frames matching their index values
Then we can: use .join
with no extra options.
Remember that set_index
will set an index on the movies
frame:
movies.set_index('id')
title | imdbID | spanishTitle | imdbPictureURL | year | rtID | rtAllCriticsRating | rtAllCriticsNumReviews | rtAllCriticsNumFresh | rtAllCriticsNumRotten | rtAllCriticsScore | rtTopCriticsRating | rtTopCriticsNumReviews | rtTopCriticsNumFresh | rtTopCriticsNumRotten | rtTopCriticsScore | rtAudienceRating | rtAudienceNumRatings | rtAudienceScore | rtPictureURL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||||
1 | Toy story | 114709 | Toy story (juguetes) | http://ia.media-imdb.com/images/M/MV5BMTMwNDU0... | 1995 | toy_story | 9.0 | 73.0 | 73.0 | 0.0 | 100.0 | 8.5 | 17.0 | 17.0 | 0.0 | 100.0 | 3.7 | 102338.0 | 81.0 | http://content7.flixster.com/movie/10/93/63/10... |
2 | Jumanji | 113497 | Jumanji | http://ia.media-imdb.com/images/M/MV5BMzM5NjE1... | 1995 | 1068044-jumanji | 5.6 | 28.0 | 13.0 | 15.0 | 46.0 | 5.8 | 5.0 | 2.0 | 3.0 | 40.0 | 3.2 | 44587.0 | 61.0 | http://content8.flixster.com/movie/56/79/73/56... |
3 | Grumpy Old Men | 107050 | Dos viejos gruñones | http://ia.media-imdb.com/images/M/MV5BMTI5MTgy... | 1993 | grumpy_old_men | 5.9 | 36.0 | 24.0 | 12.0 | 66.0 | 7.0 | 6.0 | 5.0 | 1.0 | 83.0 | 3.2 | 10489.0 | 66.0 | http://content6.flixster.com/movie/25/60/25602... |
4 | Waiting to Exhale | 114885 | Esperando un respiro | http://ia.media-imdb.com/images/M/MV5BMTczMTMy... | 1995 | waiting_to_exhale | 5.6 | 25.0 | 14.0 | 11.0 | 56.0 | 5.5 | 11.0 | 5.0 | 6.0 | 45.0 | 3.3 | 5666.0 | 79.0 | http://content9.flixster.com/movie/10/94/17/10... |
5 | Father of the Bride Part II | 113041 | Vuelve el padre de la novia (Ahora también abu... | http://ia.media-imdb.com/images/M/MV5BMTg1NDc2... | 1995 | father_of_the_bride_part_ii | 5.3 | 19.0 | 9.0 | 10.0 | 47.0 | 5.4 | 5.0 | 1.0 | 4.0 | 20.0 | 3.0 | 13761.0 | 64.0 | http://content8.flixster.com/movie/25/54/25542... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
65088 | Bedtime Stories | 960731 | Más allá de los sueños | http://ia.media-imdb.com/images/M/MV5BMjA5Njk5... | 2008 | bedtime_stories | 4.4 | 104.0 | 26.0 | 78.0 | 25.0 | 4.7 | 26.0 | 6.0 | 20.0 | 23.0 | 3.5 | 108877.0 | 63.0 | http://content6.flixster.com/movie/10/94/33/10... |
65091 | Manhattan Melodrama | 25464 | El enemigo público número 1 | http://ia.media-imdb.com/images/M/MV5BMTUyODE3... | 1934 | manhattan_melodrama | 7.0 | 12.0 | 10.0 | 2.0 | 83.0 | 0.0 | 4.0 | 2.0 | 2.0 | 50.0 | 3.7 | 344.0 | 71.0 | http://content9.flixster.com/movie/66/44/64/66... |
65126 | Choke | 1024715 | Choke | http://ia.media-imdb.com/images/M/MV5BMTMxMDI4... | 2008 | choke | 5.6 | 135.0 | 73.0 | 62.0 | 54.0 | 4.9 | 26.0 | 8.0 | 18.0 | 30.0 | 3.3 | 13893.0 | 55.0 | http://content6.flixster.com/movie/10/85/09/10... |
65130 | Revolutionary Road | 959337 | Revolutionary Road | http://ia.media-imdb.com/images/M/MV5BMTI2MzY2... | 2008 | revolutionary_road | 6.7 | 194.0 | 133.0 | 61.0 | 68.0 | 6.9 | 36.0 | 25.0 | 11.0 | 69.0 | 3.5 | 46044.0 | 70.0 | http://content8.flixster.com/movie/10/88/40/10... |
65133 | Blackadder Back & Forth | 212579 | Blackadder Back & Forth | http://ia.media-imdb.com/images/M/MV5BMjA5MjU4... | 1999 | blackadder-back-forth | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | http://content7.flixster.com/movie/34/10/69/34... |
10197 rows × 20 columns
So we'll set the index, pick a couple columns, and join with movie stats:
movies.set_index('id')[['title', 'year']].join(movie_stats)
title | year | MeanRating | RatingCount | |
---|---|---|---|---|
id | ||||
1 | Toy story | 1995 | 3.735154 | 1263.0 |
2 | Jumanji | 1995 | 2.976471 | 765.0 |
3 | Grumpy Old Men | 1993 | 2.873016 | 252.0 |
4 | Waiting to Exhale | 1995 | 2.577778 | 45.0 |
5 | Father of the Bride Part II | 1995 | 2.753333 | 225.0 |
... | ... | ... | ... | ... |
65088 | Bedtime Stories | 2008 | 3.500000 | 1.0 |
65091 | Manhattan Melodrama | 1934 | 4.000000 | 1.0 |
65126 | Choke | 2008 | 3.250000 | 2.0 |
65130 | Revolutionary Road | 2008 | 2.500000 | 1.0 |
65133 | Blackadder Back & Forth | 1999 | 4.000000 | 3.0 |
10197 rows × 4 columns
This has the same result.
Join Types¶
There are multiple types of joins. The types of Pandas joins correspond to SQL join types, if you are familiar with those.
We refer to the frames as left and right. These correspond to positions in the functions as follows:
pd.merge(left, right)
left.join(right)
- An inner join requires a match to appear in both frames. If there is a row in one frame whose join column value has no match in the other frame, the row is excluded. This applies to both frames.
- A left join includes every row in the left-hand frame at least once; if there is a left-hand row with no match in the right-hand frame, one copy is included with missing values for all of the right-hand columns. Right-hand rows with no matches in the left frame are excluded.
- A right join is the reverse of a left join: it includes every value in the right-hand frame at least once.
- An outer join includes every row from both frames at least once.
pd.merge
defaults to an inner join, and left.join(right)
defaults to a left join. Both take a how=
option to change the merge type.
Join Tips¶
The join I find myself using most frequently is a column-index join. I also usually join on a single column, although multiple-column joins are very well-supported.
Tall and Wide¶
As discussed in the video, data comes in tall and wide formats. If we observations of multiple variables for something, it can be in either format:
- In wide format, each variable is a column
- In tall (or long) format, each variable produces a new row; there is a column for the variable name and another for the value
We can convert back and forth between them.
Tall to Wide¶
The pivot
function converts tall data to wide data.
First, let's make some tall data. The movie-tags table has the tags that apply to each movie, along with their weight: how important that tag is. Let's look (joining to tag
to get the tag names):
movie_tags.join(tags.set_index('id'), on='tagID')
movieID | tagID | tagWeight | value | |
---|---|---|---|---|
0 | 1 | 7 | 1 | funny |
1 | 1 | 13 | 3 | time travel |
2 | 1 | 25 | 3 | tim allen |
3 | 1 | 55 | 3 | comedy |
4 | 1 | 60 | 1 | fun |
... | ... | ... | ... | ... |
51790 | 65037 | 792 | 1 | autism |
51791 | 65037 | 2214 | 1 | internet |
51792 | 65126 | 5281 | 1 | based on book |
51793 | 65126 | 13168 | 1 | chuck palahniuk |
51794 | 65130 | 2924 | 1 | toplist08 |
51795 rows × 4 columns
value
is a funny name, and once we have name we don't need tag ID, so let's clean up these columns:
movie_tag_weights = movie_tags.join(tags.set_index('id'), on='tagID')
movie_tag_weights = movie_tag_weights[['movieID', 'value', 'tagWeight']]
movie_tag_weights = movie_tag_weights.rename(columns={
'value': 'tag',
'tagWeight': 'weight'
})
movie_tag_weights
movieID | tag | weight | |
---|---|---|---|
0 | 1 | funny | 1 |
1 | 1 | time travel | 3 |
2 | 1 | tim allen | 3 |
3 | 1 | comedy | 3 |
4 | 1 | fun | 1 |
... | ... | ... | ... |
51790 | 65037 | autism | 1 |
51791 | 65037 | internet | 1 |
51792 | 65126 | based on book | 1 |
51793 | 65126 | chuck palahniuk | 1 |
51794 | 65130 | toplist08 | 1 |
51795 rows × 3 columns
This is in tall format: for each movie, we have a row for each tag with its weight.
If we want to turn this into a (very!) wide table, with a column for each tag, we can use pivot
.
If we have: a data frame in tall format (one column has variable names, and another their values)
And we want: the same data in wide format (variables split out into columns)
Then we can: use pivot
to pivot the wide data into tall format:
mtw_wide = movie_tag_weights.pivot('movieID', 'tag', 'weight')
mtw_wide
tag | (s)vcd | 007 (series) | 15th century | 16mm | 16th century | 17th century | 1800s | 1890s | 18th century | 1900s | ... | zeppelin | zero mostel | zibri studio | zim | ziyi zhang | zombie | zombie movie | zombies | zoo | zooey deschanel |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
movieID | |||||||||||||||||||||
1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
64993 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
65006 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
65037 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
65126 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
65130 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7155 rows × 5297 columns
We see a lot of missing values (NaN). The describe
method on a data frame will describe all the columns, which will include counting the values. Let's see it:
mtw_desc = mtw_wide.describe()
mtw_desc
tag | (s)vcd | 007 (series) | 15th century | 16mm | 16th century | 17th century | 1800s | 1890s | 18th century | 1900s | ... | zeppelin | zero mostel | zibri studio | zim | ziyi zhang | zombie | zombie movie | zombies | zoo | zooey deschanel |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 31.000000 | 23.000000 | 2.0 | 5.0 | 6.0 | 10.000000 | 2.0 | 4.0 | 12.000000 | 7.000000 | ... | 2.0 | 2.0 | 9.000000 | 7.0 | 2.0 | 25.00000 | 3.0 | 43.000000 | 2.0 | 2.0 |
mean | 1.032258 | 1.043478 | 1.0 | 1.0 | 1.0 | 1.300000 | 1.0 | 1.0 | 1.583333 | 1.142857 | ... | 1.0 | 1.0 | 1.111111 | 1.0 | 1.0 | 1.52000 | 2.0 | 3.465116 | 1.0 | 1.0 |
std | 0.179605 | 0.208514 | 0.0 | 0.0 | 0.0 | 0.483046 | 0.0 | 0.0 | 0.996205 | 0.377964 | ... | 0.0 | 0.0 | 0.333333 | 0.0 | 0.0 | 0.87178 | 1.0 | 3.607854 | 0.0 | 0.0 |
min | 1.000000 | 1.000000 | 1.0 | 1.0 | 1.0 | 1.000000 | 1.0 | 1.0 | 1.000000 | 1.000000 | ... | 1.0 | 1.0 | 1.000000 | 1.0 | 1.0 | 1.00000 | 1.0 | 1.000000 | 1.0 | 1.0 |
25% | 1.000000 | 1.000000 | 1.0 | 1.0 | 1.0 | 1.000000 | 1.0 | 1.0 | 1.000000 | 1.000000 | ... | 1.0 | 1.0 | 1.000000 | 1.0 | 1.0 | 1.00000 | 1.5 | 1.000000 | 1.0 | 1.0 |
50% | 1.000000 | 1.000000 | 1.0 | 1.0 | 1.0 | 1.000000 | 1.0 | 1.0 | 1.000000 | 1.000000 | ... | 1.0 | 1.0 | 1.000000 | 1.0 | 1.0 | 1.00000 | 2.0 | 2.000000 | 1.0 | 1.0 |
75% | 1.000000 | 1.000000 | 1.0 | 1.0 | 1.0 | 1.750000 | 1.0 | 1.0 | 2.000000 | 1.000000 | ... | 1.0 | 1.0 | 1.000000 | 1.0 | 1.0 | 2.00000 | 2.5 | 5.000000 | 1.0 | 1.0 |
max | 2.000000 | 2.000000 | 1.0 | 1.0 | 1.0 | 2.000000 | 1.0 | 1.0 | 4.000000 | 2.000000 | ... | 1.0 | 1.0 | 2.000000 | 1.0 | 1.0 | 4.00000 | 3.0 | 19.000000 | 1.0 | 1.0 |
8 rows × 5297 columns
That might look better if we switch columns and rows:
mtw_desc.T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
tag | ||||||||
(s)vcd | 31.0 | 1.032258 | 0.179605 | 1.0 | 1.0 | 1.0 | 1.0 | 2.0 |
007 (series) | 23.0 | 1.043478 | 0.208514 | 1.0 | 1.0 | 1.0 | 1.0 | 2.0 |
15th century | 2.0 | 1.000000 | 0.000000 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
16mm | 5.0 | 1.000000 | 0.000000 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
16th century | 6.0 | 1.000000 | 0.000000 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
zombie | 25.0 | 1.520000 | 0.871780 | 1.0 | 1.0 | 1.0 | 2.0 | 4.0 |
zombie movie | 3.0 | 2.000000 | 1.000000 | 1.0 | 1.5 | 2.0 | 2.5 | 3.0 |
zombies | 43.0 | 3.465116 | 3.607854 | 1.0 | 1.0 | 2.0 | 5.0 | 19.0 |
zoo | 2.0 | 1.000000 | 0.000000 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
zooey deschanel | 2.0 | 1.000000 | 0.000000 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
5297 rows × 8 columns
Ok isn't that kinda cool?
movie_top_rates = movies[['id', 'rtTopCriticsRating']].rename(columns={'rtTopCriticsRating': 'rating'})
movie_top_rates['source'] = 'Top Critics'
movie_all_rates = movies[['id', 'rtAllCriticsRating']].rename(columns={'rtAllCriticsRating': 'rating'})
movie_all_rates['source'] = 'All Critics'
movie_rates = pd.concat([movie_top_rates, movie_all_rates], ignore_index=True)
movie_rates
id | rating | source | |
---|---|---|---|
0 | 1 | 8.5 | Top Critics |
1 | 2 | 5.8 | Top Critics |
2 | 3 | 7.0 | Top Critics |
3 | 4 | 5.5 | Top Critics |
4 | 5 | 5.4 | Top Critics |
... | ... | ... | ... |
20389 | 65088 | 4.4 | All Critics |
20390 | 65091 | 7.0 | All Critics |
20391 | 65126 | 5.6 | All Critics |
20392 | 65130 | 6.7 | All Critics |
20393 | 65133 | 0.0 | All Critics |
20394 rows × 3 columns
The ignore_index=True
option causes pd.concat
to throw away the indexes and make a new RangeIndex
. This is fine because the indexes don't contain any useful information.
We can also concatenate with the index; this is most useful when the frames have disjoint indexes, or when we want to add another index level containing the source. We'll discuss this in the indexing notebook.
Wide to Tall¶
What if we want to go the other way? That's what melt
is for.
Remember our movie stats by year?
year_scores
All Critics | Top Critics | Audience | |
---|---|---|---|
year | |||
1903 | 7.600000 | 0.000000 | 0.000000 |
1915 | 8.000000 | 0.000000 | 3.300000 |
1916 | 7.800000 | 0.000000 | 3.800000 |
1917 | 0.000000 | 0.000000 | 0.000000 |
1918 | 0.000000 | 0.000000 | 0.000000 |
... | ... | ... | ... |
2007 | 5.141278 | 4.769042 | 3.062162 |
2008 | 4.699357 | 4.383280 | 2.853698 |
2009 | 4.850000 | 4.700000 | 3.192308 |
2010 | 0.000000 | 0.000000 | 0.000000 |
2011 | 0.000000 | 0.000000 | 0.000000 |
98 rows × 3 columns
If we have: a data frame in wide format
And we want: the same data in tall format
Then we can: use the melt
method or function.
We first have to reset the index, because melt destroys it:
year_scores.reset_index()
year | All Critics | Top Critics | Audience | |
---|---|---|---|---|
0 | 1903 | 7.600000 | 0.000000 | 0.000000 |
1 | 1915 | 8.000000 | 0.000000 | 3.300000 |
2 | 1916 | 7.800000 | 0.000000 | 3.800000 |
3 | 1917 | 0.000000 | 0.000000 | 0.000000 |
4 | 1918 | 0.000000 | 0.000000 | 0.000000 |
... | ... | ... | ... | ... |
93 | 2007 | 5.141278 | 4.769042 | 3.062162 |
94 | 2008 | 4.699357 | 4.383280 | 2.853698 |
95 | 2009 | 4.850000 | 4.700000 | 3.192308 |
96 | 2010 | 0.000000 | 0.000000 | 0.000000 |
97 | 2011 | 0.000000 | 0.000000 | 0.000000 |
98 rows × 4 columns
That pops the year back out to being a column, so we can pass it as the id_var
to melt
:
ys_tall = year_scores.reset_index().melt(id_vars='year')
ys_tall
year | variable | value | |
---|---|---|---|
0 | 1903 | All Critics | 7.600000 |
1 | 1915 | All Critics | 8.000000 |
2 | 1916 | All Critics | 7.800000 |
3 | 1917 | All Critics | 0.000000 |
4 | 1918 | All Critics | 0.000000 |
... | ... | ... | ... |
289 | 2007 | Audience | 3.062162 |
290 | 2008 | Audience | 2.853698 |
291 | 2009 | Audience | 3.192308 |
292 | 2010 | Audience | 0.000000 |
293 | 2011 | Audience | 0.000000 |
294 rows × 3 columns
We can give better names to these things:
ys_tall = year_scores.reset_index().melt(id_vars='year', var_name='Source', value_name='MeanAvgScore')
ys_tall
year | Source | MeanAvgScore | |
---|---|---|---|
0 | 1903 | All Critics | 7.600000 |
1 | 1915 | All Critics | 8.000000 |
2 | 1916 | All Critics | 7.800000 |
3 | 1917 | All Critics | 0.000000 |
4 | 1918 | All Critics | 0.000000 |
... | ... | ... | ... |
289 | 2007 | Audience | 3.062162 |
290 | 2008 | Audience | 2.853698 |
291 | 2009 | Audience | 3.192308 |
292 | 2010 | Audience | 0.000000 |
293 | 2011 | Audience | 0.000000 |
294 rows × 3 columns
This is useful for plotting - year is the x axis, score, the y, and source the color or hue:
sns.lineplot('year', 'MeanAvgScore', hue='Source', data=ys_tall)
<matplotlib.axes._subplots.AxesSubplot at 0x270245c6460>
In practice, setting up plots is one of my most frequent use cases for melt
. The Pandas defaults we saw earlier are useful, but this give us much more control.