Introducing Pandas¶

Learning Outcomes¶

Import Python libraries
Load a data file into Pandas
Examine the size and data types of a data frame
Understand the relationship between DataFrame, Series, and Index

Python Modules¶

Most useful Python functions are in modules.

We need to import them.

Standard imports:

In [1]:

import numpy as np
import pandas as pd

This imports the module and gives it a shorter name.

Utilities¶

I'm going to define some utilities for future use.

First, another form of import: with from, we can import a few functions, submodules, or other names into our namespace:

In [2]:

from itertools import islice

And now we're going to define a function that prints the first n lines of a file:

In [3]:

def print_first_lines(fn, n=5):
    with open(fn, 'r') as f:
        for line in islice(f, n):
            print(line[:-1])  # last char is '\n', exclude it

Comma Separated Values¶

Many data files are distributed in CSV (comman-separated value) format:

In [4]:

print_first_lines('ml-25m/movies.csv')

movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance

Reading CSV¶

In [5]:

movies = pd.read_csv('ml-25m/movies.csv')
movies

Out[5]:

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy
...	...	...	...
62418	209157	We (2018)	Drama
62419	209159	Window of the Soul (2001)	Documentary
62420	209163	Bad Poems (2018)	Comedy\|Drama
62421	209169	A Girl Thing (2001)	(no genres listed)
62422	209171	Women of Devil's Island (1962)	Action\|Adventure\|Drama

62423 rows × 3 columns

The Data Frame¶

In [6]:

movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  62423 non-null  int64 
 1   title    62423 non-null  object
 2   genres   62423 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.4+ MB

RangeIndex: indexes $0 \dots n-1$
3 columns
- 1 int64
- 2 object — these store strings
62423 rows

Initial Questions¶

How much data do we have?

62423 rows, each with 3 columns

What kind(s) of data do we have?

integer movieID
string title
string genres

Instances¶

What is the data about?

Movies
Datasheets for Datasets talks about instances and what they represent

Columns¶

Each column is a Series

Access like a dictionary:

In [7]:

movies['title']

Out[7]:

0                          Toy Story (1995)
1                            Jumanji (1995)
2                   Grumpier Old Men (1995)
3                  Waiting to Exhale (1995)
4        Father of the Bride Part II (1995)
                        ...                
62418                             We (2018)
62419             Window of the Soul (2001)
62420                      Bad Poems (2018)
62421                   A Girl Thing (2001)
62422        Women of Devil's Island (1962)
Name: title, Length: 62423, dtype: object

Series¶

An array with an index
All columns of a data frame share the same index

We'll learn about indexes in another video.

Another Frame¶

Let's load the ratings:

In [9]:

ratings = pd.read_csv('ml-25m/ratings.csv')
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000095 entries, 0 to 25000094
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int64(3)
memory usage: 762.9 MB

How many instances? 25M

What does each contain?

userId (int)
movieId (int)
rating (float)
timestamp (int)

Linking Data¶

Data can refer to other data

Ratings are instances themselves
But each connects a user to a movie — other types!
Like relational foreign keys

We can merge - will see that later

Wrapping Up¶

A data frame consists of columns
Each column is a series: array with index
We want to know quickly:
- how many rows? (instances)
- what columns?
- what data types?