Introducing Pandas¶
Learning Outcomes¶
Import Python libraries
Load a data file into Pandas
Examine the size and data types of a data frame
Understand the relationship between
DataFrame
,Series
, andIndex
Python Modules¶
Most useful Python functions are in modules.
We need to import them.
Standard imports:
import numpy as np
import pandas as pd
This imports the module and gives it a shorter name.
Utilities¶
I’m going to define some utilities for future use.
First, another form of import: with from
, we can import a few functions, submodules, or other names into our namespace:
from itertools import islice
And now we’re going to define a function that prints the first n lines of a file:
def print_first_lines(fn, n=5):
with open(fn, 'r') as f:
for line in islice(f, n):
print(line[:-1]) # last char is '\n', exclude it
Comma Separated Values¶
Many data files are distributed in CSV (comman-separated value) format:
print_first_lines('ml-25m/movies.csv')
movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
Reading CSV¶
movies = pd.read_csv('ml-25m/movies.csv')
movies
movieId | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
... | ... | ... | ... |
62418 | 209157 | We (2018) | Drama |
62419 | 209159 | Window of the Soul (2001) | Documentary |
62420 | 209163 | Bad Poems (2018) | Comedy|Drama |
62421 | 209169 | A Girl Thing (2001) | (no genres listed) |
62422 | 209171 | Women of Devil's Island (1962) | Action|Adventure|Drama |
62423 rows × 3 columns
The Data Frame¶
movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 movieId 62423 non-null int64
1 title 62423 non-null object
2 genres 62423 non-null object
dtypes: int64(1), object(2)
memory usage: 1.4+ MB
RangeIndex
: indexes \(0 \dots n-1\)3 columns
1
int64
2
object
— these store strings
62423 rows
Initial Questions¶
How much data do we have?
62423 rows, each with 3 columns
What kind(s) of data do we have?
integer movieID
string title
string genres
Instances¶
What is the data about?
Movies
Datasheets for Datasets talks about instances and what they represent
Columns¶
Each column is a Series
Access like a dictionary:
movies['title']
0 Toy Story (1995)
1 Jumanji (1995)
2 Grumpier Old Men (1995)
3 Waiting to Exhale (1995)
4 Father of the Bride Part II (1995)
...
62418 We (2018)
62419 Window of the Soul (2001)
62420 Bad Poems (2018)
62421 A Girl Thing (2001)
62422 Women of Devil's Island (1962)
Name: title, Length: 62423, dtype: object
Series¶
An array with an index
All columns of a data frame share the same index
We’ll learn about indexes in another video.
Another Frame¶
Let’s load the ratings:
ratings = pd.read_csv('ml-25m/ratings.csv')
ratings.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000095 entries, 0 to 25000094
Data columns (total 4 columns):
# Column Dtype
--- ------ -----
0 userId int64
1 movieId int64
2 rating float64
3 timestamp int64
dtypes: float64(1), int64(3)
memory usage: 762.9 MB
How many instances? 25M
What does each contain?
userId
(int)movieId
(int)rating
(float)timestamp
(int)
Linking Data¶
Data can refer to other data
Ratings are instances themselves
But each connects a user to a movie — other types!
Like relational foreign keys
We can merge - will see that later
Wrapping Up¶
A data frame consists of columns
Each column is a series: array with index
We want to know quickly:
how many rows? (instances)
what columns?
what data types?