DataFrame
, Series
, and Index
Most useful Python functions are in modules.
We need to import them.
Standard imports:
import numpy as np
import pandas as pd
This imports the module and gives it a shorter name.
Many data files are distributed in CSV (comman-separated value) format:
print_first_lines('ml-25m/movies.csv')
movieId,title,genres 1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy 2,Jumanji (1995),Adventure|Children|Fantasy 3,Grumpier Old Men (1995),Comedy|Romance 4,Waiting to Exhale (1995),Comedy|Drama|Romance
movies = pd.read_csv('ml-25m/movies.csv')
movies
movieId | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
... | ... | ... | ... |
62418 | 209157 | We (2018) | Drama |
62419 | 209159 | Window of the Soul (2001) | Documentary |
62420 | 209163 | Bad Poems (2018) | Comedy|Drama |
62421 | 209169 | A Girl Thing (2001) | (no genres listed) |
62422 | 209171 | Women of Devil's Island (1962) | Action|Adventure|Drama |
62423 rows × 3 columns
movies.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 62423 entries, 0 to 62422 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 movieId 62423 non-null int64 1 title 62423 non-null object 2 genres 62423 non-null object dtypes: int64(1), object(2) memory usage: 1.4+ MB
RangeIndex
: indexes $0 \dots n-1$int64
object
— these store stringsHow much data do we have?
What kind(s) of data do we have?
What is the data about?
movies['title']
0 Toy Story (1995) 1 Jumanji (1995) 2 Grumpier Old Men (1995) 3 Waiting to Exhale (1995) 4 Father of the Bride Part II (1995) ... 62418 We (2018) 62419 Window of the Soul (2001) 62420 Bad Poems (2018) 62421 A Girl Thing (2001) 62422 Women of Devil's Island (1962) Name: title, Length: 62423, dtype: object
We'll learn about indexes in another video.
Let's load the ratings:
ratings = pd.read_csv('ml-25m/ratings.csv')
ratings.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 25000095 entries, 0 to 25000094 Data columns (total 4 columns): # Column Dtype --- ------ ----- 0 userId int64 1 movieId int64 2 rating float64 3 timestamp int64 dtypes: float64(1), int64(3) memory usage: 762.9 MB
How many instances? 25M
What does each contain?
userId
(int)movieId
(int)rating
(float)timestamp
(int)Data can refer to other data
We can merge - will see that later