Introducing Pandas¶
Learning Outcomes¶
- Import Python libraries
- Load a data file into Pandas
- Examine the size and data types of a data frame
- Understand the relationship between
DataFrame
,Series
, andIndex
Python Modules¶
Most useful Python functions are in modules.
We need to import them.
Standard imports:
In [1]:
import numpy as np
import pandas as pd
This imports the module and gives it a shorter name.
Utilities¶
I'm going to define some utilities for future use.
First, another form of import: with from
, we can import a few functions, submodules, or other names into our namespace:
In [2]:
from itertools import islice
And now we're going to define a function that prints the first n lines of a file:
In [3]:
def print_first_lines(fn, n=5):
with open(fn, 'r') as f:
for line in islice(f, n):
print(line[:-1]) # last char is '\n', exclude it
Comma Separated Values¶
Many data files are distributed in CSV (comman-separated value) format:
In [4]:
print_first_lines('ml-25m/movies.csv')
movieId,title,genres 1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy 2,Jumanji (1995),Adventure|Children|Fantasy 3,Grumpier Old Men (1995),Comedy|Romance 4,Waiting to Exhale (1995),Comedy|Drama|Romance
Reading CSV¶
In [5]:
movies = pd.read_csv('ml-25m/movies.csv')
movies
Out[5]:
movieId | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
... | ... | ... | ... |
62418 | 209157 | We (2018) | Drama |
62419 | 209159 | Window of the Soul (2001) | Documentary |
62420 | 209163 | Bad Poems (2018) | Comedy|Drama |
62421 | 209169 | A Girl Thing (2001) | (no genres listed) |
62422 | 209171 | Women of Devil's Island (1962) | Action|Adventure|Drama |
62423 rows × 3 columns
The Data Frame¶
In [6]:
movies.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 62423 entries, 0 to 62422 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 movieId 62423 non-null int64 1 title 62423 non-null object 2 genres 62423 non-null object dtypes: int64(1), object(2) memory usage: 1.4+ MB
RangeIndex
: indexes $0 \dots n-1$- 3 columns
- 1
int64
- 2
object
— these store strings
- 1
- 62423 rows
Initial Questions¶
How much data do we have?
- 62423 rows, each with 3 columns
What kind(s) of data do we have?
- integer movieID
- string title
- string genres
Instances¶
What is the data about?
- Movies
- Datasheets for Datasets talks about instances and what they represent
In [7]:
movies['title']
Out[7]:
0 Toy Story (1995) 1 Jumanji (1995) 2 Grumpier Old Men (1995) 3 Waiting to Exhale (1995) 4 Father of the Bride Part II (1995) ... 62418 We (2018) 62419 Window of the Soul (2001) 62420 Bad Poems (2018) 62421 A Girl Thing (2001) 62422 Women of Devil's Island (1962) Name: title, Length: 62423, dtype: object
Series¶
- An array with an index
- All columns of a data frame share the same index
We'll learn about indexes in another video.
Another Frame¶
Let's load the ratings:
In [9]:
ratings = pd.read_csv('ml-25m/ratings.csv')
ratings.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 25000095 entries, 0 to 25000094 Data columns (total 4 columns): # Column Dtype --- ------ ----- 0 userId int64 1 movieId int64 2 rating float64 3 timestamp int64 dtypes: float64(1), int64(3) memory usage: 762.9 MB
How many instances? 25M
What does each contain?
userId
(int)movieId
(int)rating
(float)timestamp
(int)
Linking Data¶
Data can refer to other data
- Ratings are instances themselves
- But each connects a user to a movie — other types!
- Like relational foreign keys
We can merge - will see that later
Wrapping Up¶
- A data frame consists of columns
- Each column is a series: array with index
- We want to know quickly:
- how many rows? (instances)
- what columns?
- what data types?