Introducing Pandas
Contents
Introducing Pandas#
Learning Outcomes#
- Import Python libraries 
- Load a data file into Pandas 
- Examine the size and data types of a data frame 
- Understand the relationship between - DataFrame,- Series, and- Index
Python Modules#
Most useful Python functions are in modules.
We need to import them.
Standard imports:
import numpy as np
import pandas as pd
This imports the module and gives it a shorter name.
Utilities#
I’m going to define some utilities for future use.
First, another form of import: with from, we can import a few functions, submodules, or other names into our namespace:
from itertools import islice
And now we’re going to define a function that prints the first n lines of a file:
def print_first_lines(fn, n=5):
    with open(fn, 'r') as f:
        for line in islice(f, n):
            print(line[:-1])  # last char is '\n', exclude it
Comma Separated Values#
Many data files are distributed in CSV (comman-separated value) format:
print_first_lines('ml-25m/movies.csv')
movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
Reading CSV#
movies = pd.read_csv('ml-25m/movies.csv')
movies
| movieId | title | genres | |
|---|---|---|---|
| 0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | 
| 1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy | 
| 2 | 3 | Grumpier Old Men (1995) | Comedy|Romance | 
| 3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance | 
| 4 | 5 | Father of the Bride Part II (1995) | Comedy | 
| ... | ... | ... | ... | 
| 62418 | 209157 | We (2018) | Drama | 
| 62419 | 209159 | Window of the Soul (2001) | Documentary | 
| 62420 | 209163 | Bad Poems (2018) | Comedy|Drama | 
| 62421 | 209169 | A Girl Thing (2001) | (no genres listed) | 
| 62422 | 209171 | Women of Devil's Island (1962) | Action|Adventure|Drama | 
62423 rows × 3 columns
The Data Frame#
movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  62423 non-null  int64 
 1   title    62423 non-null  object
 2   genres   62423 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.4+ MB
- RangeIndex: indexes
- 3 columns - 1 - int64
- 2 - object— these store strings
 
- 62423 rows 
Initial Questions#
How much data do we have?
- 62423 rows, each with 3 columns 
What kind(s) of data do we have?
- integer movieID 
- string title 
- string genres 
Instances#
What is the data about?
- Movies 
- Datasheets for Datasets talks about instances and what they represent 
Columns#
Each column is a Series
Access like a dictionary:
movies['title']
0                          Toy Story (1995)
1                            Jumanji (1995)
2                   Grumpier Old Men (1995)
3                  Waiting to Exhale (1995)
4        Father of the Bride Part II (1995)
                        ...                
62418                             We (2018)
62419             Window of the Soul (2001)
62420                      Bad Poems (2018)
62421                   A Girl Thing (2001)
62422        Women of Devil's Island (1962)
Name: title, Length: 62423, dtype: object
Series#
- An array with an index 
- All columns of a data frame share the same index 
We’ll learn about indexes in another video.
Another Frame#
Let’s load the ratings:
ratings = pd.read_csv('ml-25m/ratings.csv')
ratings.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000095 entries, 0 to 25000094
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int64(3)
memory usage: 762.9 MB
How many instances? 25M
What does each contain?
- userId(int)
- movieId(int)
- rating(float)
- timestamp(int)
Linking Data#
Data can refer to other data
- Ratings are instances themselves 
- But each connects a user to a movie — other types! 
- Like relational foreign keys 
We can merge - will see that later
Wrapping Up#
- A data frame consists of columns 
- Each column is a series: array with index 
- We want to know quickly: - how many rows? (instances) 
- what columns? 
- what data types?