Introducing Pandas
Contents
Introducing Pandas#
Learning Outcomes#
Import Python libraries
Load a data file into Pandas
Examine the size and data types of a data frame
Understand the relationship between
DataFrame
,Series
, andIndex
Python Modules#
Most useful Python functions are in modules.
We need to import them.
Standard imports:
import numpy as np
import pandas as pd
This imports the module and gives it a shorter name.
Utilities#
Iβm going to define some utilities for future use.
First, another form of import: with from
, we can import a few functions, submodules, or other names into our namespace:
from itertools import islice
And now weβre going to define a function that prints the first n lines of a file:
def print_first_lines(fn, n=5):
with open(fn, 'r') as f:
for line in islice(f, n):
print(line[:-1]) # last char is '\n', exclude it
Comma Separated Values#
Many data files are distributed in CSV (comman-separated value) format:
print_first_lines('ml-25m/movies.csv')
movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
Reading CSV#
movies = pd.read_csv('ml-25m/movies.csv')
movies
movieId | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
... | ... | ... | ... |
62418 | 209157 | We (2018) | Drama |
62419 | 209159 | Window of the Soul (2001) | Documentary |
62420 | 209163 | Bad Poems (2018) | Comedy|Drama |
62421 | 209169 | A Girl Thing (2001) | (no genres listed) |
62422 | 209171 | Women of Devil's Island (1962) | Action|Adventure|Drama |
62423 rows Γ 3 columns
The Data Frame#
movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 movieId 62423 non-null int64
1 title 62423 non-null object
2 genres 62423 non-null object
dtypes: int64(1), object(2)
memory usage: 1.4+ MB
RangeIndex
: indexes \(0 \dots n-1\)3 columns
1
int64
2
object
β these store strings
62423 rows
Initial Questions#
How much data do we have?
62423 rows, each with 3 columns
What kind(s) of data do we have?
integer movieID
string title
string genres
Instances#
What is the data about?
Movies
Datasheets for Datasets talks about instances and what they represent
Columns#
Each column is a Series
Access like a dictionary:
movies['title']
0 Toy Story (1995)
1 Jumanji (1995)
2 Grumpier Old Men (1995)
3 Waiting to Exhale (1995)
4 Father of the Bride Part II (1995)
...
62418 We (2018)
62419 Window of the Soul (2001)
62420 Bad Poems (2018)
62421 A Girl Thing (2001)
62422 Women of Devil's Island (1962)
Name: title, Length: 62423, dtype: object
Series#
An array with an index
All columns of a data frame share the same index
Weβll learn about indexes in another video.
Another Frame#
Letβs load the ratings:
ratings = pd.read_csv('ml-25m/ratings.csv')
ratings.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000095 entries, 0 to 25000094
Data columns (total 4 columns):
# Column Dtype
--- ------ -----
0 userId int64
1 movieId int64
2 rating float64
3 timestamp int64
dtypes: float64(1), int64(3)
memory usage: 762.9 MB
How many instances? 25M
What does each contain?
userId
(int)movieId
(int)rating
(float)timestamp
(int)
Linking Data#
Data can refer to other data
Ratings are instances themselves
But each connects a user to a movie β other types!
Like relational foreign keys
We can merge - will see that later
Wrapping Up#
A data frame consists of columns
Each column is a series: array with index
We want to know quickly:
how many rows? (instances)
what columns?
what data types?