DataFrame, Series, and IndexMost useful Python functions are in modules.
We need to import them.
Standard imports:
import numpy as np
import pandas as pd
This imports the module and gives it a shorter name.
Many data files are distributed in CSV (comman-separated value) format:
print_first_lines('ml-25m/movies.csv')
movieId,title,genres 1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy 2,Jumanji (1995),Adventure|Children|Fantasy 3,Grumpier Old Men (1995),Comedy|Romance 4,Waiting to Exhale (1995),Comedy|Drama|Romance
movies = pd.read_csv('ml-25m/movies.csv')
movies
| movieId | title | genres | |
|---|---|---|---|
| 0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
| 1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy |
| 2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
| 3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
| 4 | 5 | Father of the Bride Part II (1995) | Comedy |
| ... | ... | ... | ... |
| 62418 | 209157 | We (2018) | Drama |
| 62419 | 209159 | Window of the Soul (2001) | Documentary |
| 62420 | 209163 | Bad Poems (2018) | Comedy|Drama |
| 62421 | 209169 | A Girl Thing (2001) | (no genres listed) |
| 62422 | 209171 | Women of Devil's Island (1962) | Action|Adventure|Drama |
62423 rows × 3 columns
movies.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 62423 entries, 0 to 62422 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 movieId 62423 non-null int64 1 title 62423 non-null object 2 genres 62423 non-null object dtypes: int64(1), object(2) memory usage: 1.4+ MB
RangeIndex: indexes $0 \dots n-1$int64object — these store stringsHow much data do we have?
What kind(s) of data do we have?
What is the data about?
movies['title']
0 Toy Story (1995)
1 Jumanji (1995)
2 Grumpier Old Men (1995)
3 Waiting to Exhale (1995)
4 Father of the Bride Part II (1995)
...
62418 We (2018)
62419 Window of the Soul (2001)
62420 Bad Poems (2018)
62421 A Girl Thing (2001)
62422 Women of Devil's Island (1962)
Name: title, Length: 62423, dtype: object
We'll learn about indexes in another video.
Let's load the ratings:
ratings = pd.read_csv('ml-25m/ratings.csv')
ratings.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 25000095 entries, 0 to 25000094 Data columns (total 4 columns): # Column Dtype --- ------ ----- 0 userId int64 1 movieId int64 2 rating float64 3 timestamp int64 dtypes: float64(1), int64(3) memory usage: 762.9 MB
How many instances? 25M
What does each contain?
userId (int)movieId (int)rating (float)timestamp (int)Data can refer to other data
We can merge - will see that later