Introducing Pandas¶

Learning Outcomes¶

  • Import Python libraries

  • Load a data file into Pandas

  • Examine the size and data types of a data frame

  • Understand the relationship between DataFrame, Series, and Index

Python Modules¶

Most useful Python functions are in modules.

We need to import them.

Standard imports:

import numpy as np
import pandas as pd

This imports the module and gives it a shorter name.

Utilities¶

I’m going to define some utilities for future use.

First, another form of import: with from, we can import a few functions, submodules, or other names into our namespace:

from itertools import islice

And now we’re going to define a function that prints the first n lines of a file:

def print_first_lines(fn, n=5):
    with open(fn, 'r') as f:
        for line in islice(f, n):
            print(line[:-1])  # last char is '\n', exclude it

Comma Separated Values¶

Many data files are distributed in CSV (comman-separated value) format:

print_first_lines('ml-25m/movies.csv')
movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance

Reading CSV¶

movies = pd.read_csv('ml-25m/movies.csv')
movies
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
... ... ... ...
62418 209157 We (2018) Drama
62419 209159 Window of the Soul (2001) Documentary
62420 209163 Bad Poems (2018) Comedy|Drama
62421 209169 A Girl Thing (2001) (no genres listed)
62422 209171 Women of Devil's Island (1962) Action|Adventure|Drama

62423 rows × 3 columns

The Data Frame¶

movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  62423 non-null  int64 
 1   title    62423 non-null  object
 2   genres   62423 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.4+ MB
  • RangeIndex: indexes \(0 \dots n-1\)

  • 3 columns

    • 1 int64

    • 2 object — these store strings

  • 62423 rows

Initial Questions¶

How much data do we have?

  • 62423 rows, each with 3 columns

What kind(s) of data do we have?

  • integer movieID

  • string title

  • string genres

Instances¶

What is the data about?

  • Movies

  • Datasheets for Datasets talks about instances and what they represent

Columns¶

Each column is a Series

Access like a dictionary:

movies['title']
0                          Toy Story (1995)
1                            Jumanji (1995)
2                   Grumpier Old Men (1995)
3                  Waiting to Exhale (1995)
4        Father of the Bride Part II (1995)
                        ...                
62418                             We (2018)
62419             Window of the Soul (2001)
62420                      Bad Poems (2018)
62421                   A Girl Thing (2001)
62422        Women of Devil's Island (1962)
Name: title, Length: 62423, dtype: object

Series¶

  • An array with an index

  • All columns of a data frame share the same index

We’ll learn about indexes in another video.

Another Frame¶

Let’s load the ratings:

ratings = pd.read_csv('ml-25m/ratings.csv')
ratings.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000095 entries, 0 to 25000094
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int64(3)
memory usage: 762.9 MB

How many instances? 25M

What does each contain?

  • userId (int)

  • movieId (int)

  • rating (float)

  • timestamp (int)

Linking Data¶

Data can refer to other data

  • Ratings are instances themselves

  • But each connects a user to a movie — other types!

  • Like relational foreign keys

We can merge - will see that later

Wrapping Up¶

  • A data frame consists of columns

  • Each column is a series: array with index

  • We want to know quickly:

    • how many rows? (instances)

    • what columns?

    • what data types?