Building Data
Contents
Building Data#
This notebook discusses how to build a series or data frame, possibly using pieces of other series or data frames, to either make new columns for our data or set up a new data frame.
We are going to use the HETREC MovieLens data again.
Setup and Data Load#
Load our Python modules:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Load the movie data:
movies = pd.read_table('hetrec2011-ml/movies.dat', sep='\t', encoding='latin1', na_values=['\\N'])
movies.head()
id | title | imdbID | spanishTitle | imdbPictureURL | year | rtID | rtAllCriticsRating | rtAllCriticsNumReviews | rtAllCriticsNumFresh | ... | rtAllCriticsScore | rtTopCriticsRating | rtTopCriticsNumReviews | rtTopCriticsNumFresh | rtTopCriticsNumRotten | rtTopCriticsScore | rtAudienceRating | rtAudienceNumRatings | rtAudienceScore | rtPictureURL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Toy story | 114709 | Toy story (juguetes) | http://ia.media-imdb.com/images/M/MV5BMTMwNDU0... | 1995 | toy_story | 9.0 | 73.0 | 73.0 | ... | 100.0 | 8.5 | 17.0 | 17.0 | 0.0 | 100.0 | 3.7 | 102338.0 | 81.0 | http://content7.flixster.com/movie/10/93/63/10... |
1 | 2 | Jumanji | 113497 | Jumanji | http://ia.media-imdb.com/images/M/MV5BMzM5NjE1... | 1995 | 1068044-jumanji | 5.6 | 28.0 | 13.0 | ... | 46.0 | 5.8 | 5.0 | 2.0 | 3.0 | 40.0 | 3.2 | 44587.0 | 61.0 | http://content8.flixster.com/movie/56/79/73/56... |
2 | 3 | Grumpy Old Men | 107050 | Dos viejos gruñones | http://ia.media-imdb.com/images/M/MV5BMTI5MTgy... | 1993 | grumpy_old_men | 5.9 | 36.0 | 24.0 | ... | 66.0 | 7.0 | 6.0 | 5.0 | 1.0 | 83.0 | 3.2 | 10489.0 | 66.0 | http://content6.flixster.com/movie/25/60/25602... |
3 | 4 | Waiting to Exhale | 114885 | Esperando un respiro | http://ia.media-imdb.com/images/M/MV5BMTczMTMy... | 1995 | waiting_to_exhale | 5.6 | 25.0 | 14.0 | ... | 56.0 | 5.5 | 11.0 | 5.0 | 6.0 | 45.0 | 3.3 | 5666.0 | 79.0 | http://content9.flixster.com/movie/10/94/17/10... |
4 | 5 | Father of the Bride Part II | 113041 | Vuelve el padre de la novia (Ahora también abu... | http://ia.media-imdb.com/images/M/MV5BMTg1NDc2... | 1995 | father_of_the_bride_part_ii | 5.3 | 19.0 | 9.0 | ... | 47.0 | 5.4 | 5.0 | 1.0 | 4.0 | 20.0 | 3.0 | 13761.0 | 64.0 | http://content8.flixster.com/movie/25/54/25542... |
5 rows × 21 columns
And the ratings:
ratings = pd.read_table('hetrec2011-ml/user_ratedmovies-timestamps.dat', sep='\t')
ratings.head()
userID | movieID | rating | timestamp | |
---|---|---|---|---|
0 | 75 | 3 | 1.0 | 1162160236000 |
1 | 75 | 32 | 4.5 | 1162160624000 |
2 | 75 | 110 | 4.0 | 1162161008000 |
3 | 75 | 160 | 2.0 | 1162160212000 |
4 | 75 | 163 | 4.0 | 1162160970000 |
We’re also going to use genres:
movie_genres = pd.read_table('hetrec2011-ml/movie_genres.dat', sep='\t')
movie_genres.head()
movieID | genre | |
---|---|---|
0 | 1 | Adventure |
1 | 1 | Animation |
2 | 1 | Children |
3 | 1 | Comedy |
4 | 1 | Fantasy |
movie_genres['genre'].value_counts()
Drama 5076
Comedy 3566
Thriller 1664
Romance 1644
Action 1445
Crime 1086
Adventure 1003
Horror 978
Sci-Fi 740
Fantasy 535
Children 519
Mystery 497
War 494
Documentary 430
Musical 421
Animation 279
Western 261
Film-Noir 145
IMAX 25
Short 1
Name: genre, dtype: int64
Creating Arrays#
Let’s start with how to build a NumPy array, before we even get to series and frames.
NumPy gives us several ways to create arrays. We can use np.array
to create one from a list, array, data series, or anything else that NumPy thinks looks like an array (the NumPy and Pandas documentation call such objects “array-like”).
So let’s make an array:
np.array([0, 2, 5, 9, -3])
array([ 0, 2, 5, 9, -3])
We can also create an array from a range, with np.arange
:
np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Another super useful way to make an array is np.linspace
, which creates an array consisting of evenly-spaced points between two other values, including both of the endpoints:
np.linspace(0, 5, 10)
array([0. , 0.55555556, 1.11111111, 1.66666667, 2.22222222,
2.77777778, 3.33333333, 3.88888889, 4.44444444, 5. ])
Because it’s creating 10 points including 0 and 5, it doesn’t include many integer values. Creating 11 points would fix that:
np.linspace(0, 5, 11)
array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ])
We can also randomly generate arrays. There are a lot of RNG features in NumPy (see the documentation); we’ll start with generating 100 “standard normal” values (normally distributed with mean 0 and variance 1):
rng = np.random.default_rng()
rng.standard_normal(100)
array([-0.33873561, 0.51678273, 0.14711789, 0.19838675, 0.54761942,
0.19131156, -0.4752898 , 0.02024525, -1.33948309, 0.32448087,
-0.19374478, 0.8344856 , -1.2166371 , -0.37482094, 0.86518527,
2.18446418, 0.6075668 , -2.31585428, -0.74720259, 1.19992932,
0.99405759, 2.23903908, 1.81167178, 1.00598331, 0.55223142,
0.67745515, 0.58957695, -0.46088752, 0.33887571, -0.27662875,
-1.32340716, 0.13839638, -0.57016739, -0.58275332, 1.09105615,
-0.61806743, 2.12021099, 1.24762095, -0.91650255, -0.50467559,
-1.01163067, 0.28069416, -0.35173081, -0.89731573, 1.08407209,
-1.69627407, 0.07576669, -0.08138856, 1.46850213, 0.36764678,
0.0687904 , 1.61086781, -0.90819236, -0.60426599, 0.81166916,
0.29945399, -0.56891978, -1.67152667, -0.69264555, -0.45211381,
0.30505253, 0.11921514, -1.26780228, -0.017003 , 0.91574513,
-0.26769486, -0.72651488, 0.04675951, -0.53645195, -0.03073815,
-0.71287014, -0.40548066, -1.16452118, 1.23470357, 1.14105292,
0.20514277, -1.58428461, 1.1993169 , -0.18210597, 0.00771399,
0.45130548, 0.98739545, 1.71867103, -1.28923256, 0.70000841,
-2.64988258, 1.91856809, 2.27360025, -0.50856635, -0.84726367,
0.50152163, -0.4735734 , 0.49285562, -0.970693 , 0.27513337,
0.26262928, 0.73655456, 1.09339154, -0.88900962, 1.62158089])
Creating Series#
Now let’s look at creating pandas Series
objects.
From an Array-Like#
The Series
constructor takes an array-like:
pd.Series(np.arange(5))
0 0
1 1
2 2
3 3
4 4
dtype: int32
This series has no name, no index name, and it has the integer values 0 through 4. Since it has no specified index, it uses a RangeIndex
, which indexes, again from 0 to 4.
We can specify an index as an array-like:
abc = pd.Series([1, 3, 14], index=['apple', 'banana', 'carrot'])
abc
apple 1
banana 3
carrot 14
dtype: int64
Now we have the values 1, 3, and 14 in a series, associated with index keys ‘apple’, ‘banana’, and ‘carrot’ respectively. We can look up values like we’ve seen:
abc.loc['banana']
3
From a Single Value#
The constructor can also take a single value, and create a series with that value for every element. For this use case, we must provide an index.
For example:
pd.Series(0, index=['a', 'b', 'c', 'd'])
a 0
b 0
c 0
d 0
dtype: int64
Building Up#
With this, we can then assign other values at particular places:
s = pd.Series(0, index=['a', 'b', 'c', 'd'])
s
a 0
b 0
c 0
d 0
dtype: int64
s.loc['b'] = 7
s
a 0
b 7
c 0
d 0
dtype: int64
We can do this with boolean masks, too — suppose we want to create a series that indicates if its index keys are odd or even:
s = pd.Series('odd', index=np.arange(10))
s[s.index.values % 2 == 0] = 'even'
s
0 even
1 odd
2 even
3 odd
4 even
5 odd
6 even
7 odd
8 even
9 odd
dtype: object
Example: Ordinal#
Our movies
frame has a year
column, indicating the year.
Certain sci-fi movies were era-defining: they had an unmistakable influence on the movies after them. Let’s, for sake of an argument, define the eras as follows:
Vintage (movies through 1977, when Star Wars came out)
Star Wars (1978–1999)
Post-Matrix (after 1999)
We can do that with the following steps:
Create a
Series
with the index matching the movie frame, and one of the values.Use masks based on conditions selecting the other items to fill in the othe values
(if wanted) Set up the categorical data type with a specified order.
Let’s start by finding the sci-fi movies - all movies with Sci-Fi
as one of their genres:
scifi_ids = movie_genres.loc[movie_genres['genre'] == 'Sci-Fi', 'movieID']
scifi = movies.set_index('id').loc[scifi_ids]
scifi
title | imdbID | spanishTitle | imdbPictureURL | year | rtID | rtAllCriticsRating | rtAllCriticsNumReviews | rtAllCriticsNumFresh | rtAllCriticsNumRotten | rtAllCriticsScore | rtTopCriticsRating | rtTopCriticsNumReviews | rtTopCriticsNumFresh | rtTopCriticsNumRotten | rtTopCriticsScore | rtAudienceRating | rtAudienceNumRatings | rtAudienceScore | rtPictureURL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||||
24 | Powder | 114168 | Powder - pura energía | http://ia.media-imdb.com/images/M/MV5BMTIwMTQ3... | 1995 | powder | 5.2 | 19.0 | 9.0 | 10.0 | 47.0 | 4.9 | 5.0 | 0.0 | 5.0 | 0.0 | 3.1 | 6650.0 | 67.0 | http://content7.flixster.com/movie/25/59/25590... |
29 | La cité des enfants perdus | 112682 | La ciudad de los niños perdidos | http://ia.media-imdb.com/images/M/MV5BMTI4MDc0... | 1995 | city_of_lost_children | 7.0 | 33.0 | 27.0 | 6.0 | 81.0 | 7.2 | 7.0 | 5.0 | 2.0 | 71.0 | 4.0 | 22420.0 | 89.0 | http://content6.flixster.com/movie/26/78/26782... |
32 | Twelve Monkeys | 114746 | Twelve monkeys: doce monos | http://ia.media-imdb.com/images/M/MV5BMTQwNzg2... | 1995 | prem-no-public-issue | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | http://content6.flixster.com/movie/10/93/02/10... |
66 | Lawnmower Man 2: Beyond Cyberspace | 116839 | El cortador de césped II: más allá del ciberes... | http://ia.media-imdb.com/images/M/MV5BMTg1MjM3... | 1996 | lawnmower_man_2_jobes_war | 2.8 | 9.0 | 1.0 | 8.0 | 11.0 | 0.0 | 3.0 | 1.0 | 2.0 | 33.0 | 2.1 | 952.0 | 18.0 | http://content7.flixster.com/movie/10/86/80/10... |
76 | Screamers | 114367 | Asesinos cibernéticos | http://ia.media-imdb.com/images/M/MV5BMTY5NTQz... | 1995 | 1069696-screamers | 4.7 | 30.0 | 9.0 | 21.0 | 30.0 | 4.9 | 9.0 | 2.0 | 7.0 | 22.0 | 3.0 | 2070.0 | 45.0 | http://content7.flixster.com/movie/27/23/27230... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
63617 | Christmas on Mars | 363240 | Christmas on Mars | http://ia.media-imdb.com/images/M/MV5BMTYzNjQ3... | 2008 | christmas_on_mars | 5.5 | 6.0 | 3.0 | 3.0 | 50.0 | 0.0 | 4.0 | 2.0 | 2.0 | 50.0 | 3.3 | 530.0 | 56.0 | http://content6.flixster.com/movie/10/88/98/10... |
63676 | Tanin no kao | 61065 | Tanin no kao | http://ia.media-imdb.com/images/M/MV5BMTk3OTkz... | 1966 | face_of_another | 7.4 | 6.0 | 6.0 | 0.0 | 100.0 | 0.0 | 1.0 | 1.0 | 0.0 | 100.0 | 4.1 | 598.0 | 90.0 | http://content8.flixster.com/movie/93/50/55/93... |
63989 | The Devil-Doll | 27521 | Muñecos infernales | http://ia.media-imdb.com/images/M/MV5BMTU5MTk0... | 1936 | the-devildoll-the-witch-of-timbuctoo | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | http://content6.flixster.com/movie/10/87/02/10... |
64497 | The Day the Earth Stood Still | 43456 | Ultimátum a la Tierra | http://ia.media-imdb.com/images/M/MV5BMTQ2MzYy... | 1951 | 1005371-day_the_earth_stood_still | 8.0 | 50.0 | 47.0 | 3.0 | 94.0 | 7.0 | 5.0 | 4.0 | 1.0 | 80.0 | 3.8 | 8999.0 | 86.0 | http://content7.flixster.com/movie/10/87/31/10... |
64508 | Starship Troopers 3: Marauder | 844760 | Starship Troopers 3: Marauder | http://ia.media-imdb.com/images/M/MV5BMTQyMTcw... | 2008 | starship_troopers_3 | 5.4 | 5.0 | 3.0 | 2.0 | 60.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | http://content9.flixster.com/movie/10/85/24/10... |
740 rows × 20 columns
And start making eras:
era = pd.Series('Vintage', index=scifi.index)
era
id
24 Vintage
29 Vintage
32 Vintage
66 Vintage
76 Vintage
...
63617 Vintage
63676 Vintage
63989 Vintage
64497 Vintage
64508 Vintage
Length: 740, dtype: object
Set Star Wars for after 1977 (we won’t bother with the end condition, because we’ll override those values in the next step):
era[scifi.year > 1977] = 'Star Wars'
era
id
24 Star Wars
29 Star Wars
32 Star Wars
66 Star Wars
76 Star Wars
...
63617 Star Wars
63676 Vintage
63989 Vintage
64497 Vintage
64508 Star Wars
Length: 740, dtype: object
And then set ‘Post-Matrix’ for after 1999:
era[scifi.year > 1999] = 'Post-Matrix'
era
id
24 Star Wars
29 Star Wars
32 Star Wars
66 Star Wars
76 Star Wars
...
63617 Post-Matrix
63676 Vintage
63989 Vintage
64497 Vintage
64508 Post-Matrix
Length: 740, dtype: object
Now we can add this as a column in our data frame:
scifi['era'] = era
How many sci-fi movies do we have from each era?
sns.countplot(scifi['era'])
<matplotlib.axes._subplots.AxesSubplot at 0x1fdb6bc0880>
But this era is in the wrong order! That’s no good. We can fix that by making the era
an ordered categorical. We often leave categorical variables as strings, unless we have a very large number of data points, but we don’t get control over their order.
Let’s make it a categorical variable:
era = era.astype('category')
And set the category order:
era = era.cat.reorder_categories(['Vintage', 'Star Wars', 'Post-Matrix'])
scifi['era'] = era
sns.countplot(scifi['era'])
<matplotlib.axes._subplots.AxesSubplot at 0x1fdb744d850>
Creating Data Frames#
In this section, we’re going to see how to create data frames!
From One Series#
We can create a data frame from one series by using the to_frame
method, as seen in Reshaping.
Let’s get a series:
id_s = movies['id']
id_s
0 1
1 2
2 3
3 4
4 5
...
10192 65088
10193 65091
10194 65126
10195 65130
10196 65133
Name: id, Length: 10197, dtype: int64
And convert it to a frame:
mids = id_s.to_frame('movieID')
mids
movieID | |
---|---|
0 | 1 |
1 | 2 |
2 | 3 |
3 | 4 |
4 | 5 |
... | ... |
10192 | 65088 |
10193 | 65091 |
10194 | 65126 |
10195 | 65130 |
10196 | 65133 |
10197 rows × 1 columns
This preserves the index.
From Multiple Series#
If we have multiple series, with the same or compatible indexes, we can create a data frame from a dictionary mapping column names to series:
titles = pd.DataFrame({
'movieID': movies['id'],
'title': movies['title']
})
titles
movieID | title | |
---|---|---|
0 | 1 | Toy story |
1 | 2 | Jumanji |
2 | 3 | Grumpy Old Men |
3 | 4 | Waiting to Exhale |
4 | 5 | Father of the Bride Part II |
... | ... | ... |
10192 | 65088 | Bedtime Stories |
10193 | 65091 | Manhattan Melodrama |
10194 | 65126 | Choke |
10195 | 65130 | Revolutionary Road |
10196 | 65133 | Blackadder Back & Forth |
10197 rows × 2 columns
We can also use NumPy arrays or Python lists instead of Pandas series, so long as they are all the same length.
We can also provide an index (with index=
) in the data frame constructor.
From a List of Rows#
One more way to create a data frame is from a list (or any iterable) of rows, where each row is either a tuple or a dictionary.
The from_records
function does this.
If the rows are dictionaries, their keys are used as column names; if they are tuples, you can specify a column name with columns=['name1', 'name2']
.
One very common source of data like this is when we are reading data that comes to us in a list or file of JSON objects, or some other source of dictionaries (such as the census data, or a MongoDB connection).
For example, the Rent the Runway data comes as a GZIP-compressed list of JSON objects, one per line. We can read this into a list of JSON objects like this:
import json, gzip
with gzip.open('renttherunway_final_data.json.gz', 'r') as zf:
rtr_records = [json.loads(line) for line in zf]
# dump first value of list
rtr_records[0]
{'fit': 'fit',
'user_id': '420272',
'bust size': '34d',
'item_id': '2260466',
'weight': '137lbs',
'rating': '10',
'rented for': 'vacation',
'review_text': "An adorable romper! Belt and zipper were a little hard to navigate in a full day of wear/bathroom use, but that's to be expected. Wish it had pockets, but other than that-- absolutely perfect! I got a million compliments.",
'body type': 'hourglass',
'review_summary': 'So many compliments!',
'category': 'romper',
'height': '5\' 8"',
'size': 14,
'age': '28',
'review_date': 'April 20, 2016'}
We can then use from_records
to make a data frame:
rtr = pd.DataFrame.from_records(rtr_records)
rtr.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 192544 entries, 0 to 192543
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 fit 192544 non-null object
1 user_id 192544 non-null object
2 bust size 174133 non-null object
3 item_id 192544 non-null object
4 weight 162562 non-null object
5 rating 192462 non-null object
6 rented for 192534 non-null object
7 review_text 192544 non-null object
8 body type 177907 non-null object
9 review_summary 192544 non-null object
10 category 192544 non-null object
11 height 191867 non-null object
12 size 192544 non-null int64
13 age 191584 non-null object
14 review_date 192544 non-null object
dtypes: int64(1), object(14)
memory usage: 22.0+ MB
rtr.head()
fit | user_id | bust size | item_id | weight | rating | rented for | review_text | body type | review_summary | category | height | size | age | review_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | fit | 420272 | 34d | 2260466 | 137lbs | 10 | vacation | An adorable romper! Belt and zipper were a lit... | hourglass | So many compliments! | romper | 5' 8" | 14 | 28 | April 20, 2016 |
1 | fit | 273551 | 34b | 153475 | 132lbs | 10 | other | I rented this dress for a photo shoot. The the... | straight & narrow | I felt so glamourous!!! | gown | 5' 6" | 12 | 36 | June 18, 2013 |
2 | fit | 360448 | NaN | 1063761 | NaN | 10 | party | This hugged in all the right places! It was a ... | NaN | It was a great time to celebrate the (almost) ... | sheath | 5' 4" | 4 | 116 | December 14, 2015 |
3 | fit | 909926 | 34c | 126335 | 135lbs | 8 | formal affair | I rented this for my company's black tie award... | pear | Dress arrived on time and in perfect condition. | dress | 5' 5" | 8 | 34 | February 12, 2014 |
4 | fit | 151944 | 34b | 616682 | 145lbs | 10 | wedding | I have always been petite in my upper body and... | athletic | Was in love with this dress !!! | gown | 5' 9" | 12 | 27 | September 26, 2016 |
We would then need to go convert a lot of data types, but we have the data!
Note: Pandas also provides functions to read JSON lines from a file. But
from_records
has a lot of uses.