Introducing Pandas

Learning Outcomes

  • Import Python libraries
  • Load a data file into Pandas
  • Examine the size and data types of a data frame
  • Understand the relationship between DataFrame, Series, and Index

Python Modules

Most useful Python functions are in modules.

We need to import them.

Standard imports:

In [1]:
import numpy as np
import pandas as pd

This imports the module and gives it a shorter name.

Comma Separated Values

Many data files are distributed in CSV (comman-separated value) format:

In [4]:
print_first_lines('ml-25m/movies.csv')
movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance

Reading CSV

In [5]:
movies = pd.read_csv('ml-25m/movies.csv')
movies
Out[5]:
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
... ... ... ...
62418 209157 We (2018) Drama
62419 209159 Window of the Soul (2001) Documentary
62420 209163 Bad Poems (2018) Comedy|Drama
62421 209169 A Girl Thing (2001) (no genres listed)
62422 209171 Women of Devil's Island (1962) Action|Adventure|Drama

62423 rows × 3 columns

The Data Frame

In [6]:
movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  62423 non-null  int64 
 1   title    62423 non-null  object
 2   genres   62423 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.4+ MB
  • RangeIndex: indexes $0 \dots n-1$
  • 3 columns
    • 1 int64
    • 2 object — these store strings
  • 62423 rows

Initial Questions

How much data do we have?

  • 62423 rows, each with 3 columns

What kind(s) of data do we have?

  • integer movieID
  • string title
  • string genres

Instances

What is the data about?

  • Movies
  • Datasheets for Datasets talks about instances and what they represent

Columns

Each column is a Series

Access like a dictionary:

In [7]:
movies['title']
Out[7]:
0                          Toy Story (1995)
1                            Jumanji (1995)
2                   Grumpier Old Men (1995)
3                  Waiting to Exhale (1995)
4        Father of the Bride Part II (1995)
                        ...                
62418                             We (2018)
62419             Window of the Soul (2001)
62420                      Bad Poems (2018)
62421                   A Girl Thing (2001)
62422        Women of Devil's Island (1962)
Name: title, Length: 62423, dtype: object

Series

  • An array with an index
  • All columns of a data frame share the same index

We'll learn about indexes in another video.

Another Frame

Let's load the ratings:

In [9]:
ratings = pd.read_csv('ml-25m/ratings.csv')
ratings.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000095 entries, 0 to 25000094
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int64(3)
memory usage: 762.9 MB

How many instances? 25M

What does each contain?

  • userId (int)
  • movieId (int)
  • rating (float)
  • timestamp (int)

Linking Data

Data can refer to other data

  • Ratings are instances themselves
  • But each connects a user to a movie — other types!
  • Like relational foreign keys

We can merge - will see that later

Wrapping Up

  • A data frame consists of columns
  • Each column is a series: array with index
  • We want to know quickly:
    • how many rows? (instances)
    • what columns?
    • what data types?