Week 1 — Question
The key learning outcomes for this week are:
Ask and refine questions that can be answered with data
Install and run the software required for the course
Write and run basic Python code in a Jupyter notebook
Begin to think about the complexity of meaningful questions
This week uses chapters 1–3 from the textbook. If you already know Python, that should mostly be review.
We'll do this through several resources and activities:
This week's videos are available as a Panopto playlist .
Introduction
This video introduces the course and our learning outcomes for the term.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
INTRODUCTION
Video Outcomes
Introduce the class subject
Understand course learning outcomes
Understand structure of this class
Know where to get help
What is Data Science?
The use of data to provide quantitative insights on questions of scientific, business, or social interest.
Data and Meaning
“By itself, some piece of data has no meaning. Data is only given meaning – as evidence – by the people who make use of it.”
Sergio Sismondo, An Introduction to Science and Technology Studies, 2nd Edition. Wiley-Blackwell, 2010, page 133.
Learning Outcomes
Explore a data set to determine whether and how it might illuminate questions of interest.
Define and operationalize a research question such that a data analysis could produce meaningful knowledge.
Use best practices to carry out analyses in a documented, reproducible, and efficient fashion.
Present the results of a data analysis with appropriate visuals and written argument.
Identify weaknesses in a data analysis and assess their impact on the correctness and utility of the results.
Assess ethical implications of an analysis in terms of both classical human subject research ethics and contemporary concerns such as fairness and bias.
Understand the space of data science techniques and applications, and relate future learning to this framework.
Course Components
Videos and readings – this is the content delivery
Zoom meetings – Q&A, discussion, and exercises
Exercises and practice (ungraded)
Assignments
Exams (delivered online)
2 midterm + final
Getting Help
Course forum on Piazza – encouraged!
Office hours – by appointment
Google Calendar appointment slots, link online
Please use Piazza for course inquiries
Wrapping Up
This is going to be a different semester
Learning through videos and reading
Discussion and support on Zoom
Let’s learn!
We’re in the Internet!
Asking Questions
In this video, I introduce questions in their broader context of using data to advance goals.
I also introduce the idea of operationalization , which will be a key concept throughout the class.
Lighting
That video is just a little off, you may notice. Forgot to turn on the lights when recording it.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
ASKING QUESTIONS
Learning Outcomes
Understand components of a good question
Understand how questions relate to goals and analyses
Think about data for a question
Key term: operationalization
What is Data Science?
The use of data to provide quantitative insights on questions of scientific, business, or social interest.
Example: Department Interest
Context: changed assignment structures in CS121
Goal: Assess whether new change improved CS121
What does ‘improve’ mean?
What data could inform the assessment?
What could we do with that data to measure improvement?
Operationalization
We call this operationalization
Start with a goal
Refine through intermediate questions
Define a specific measurement to take
What, precisely, is collected
How, precisely, it is computed
operationalize, v.:
The process of doing this.
operationalization, n.:
The result of this process.
Example: Department Interest
Goal: Assess whether new change improved CS121
Question: Are students better prepared to excel at work?
Question: Are students better prepared for the next class?
Goals, Questions, and Analysis
Questions do not have 1 level
May be multiple levels
Answering the question should advance the goal.
Or higher-level question
Carrying out the analysis should answer the question.
Operationalized Questions
They are specific
They are answerable
They can be answered with available data
Questions that fail this are not bad!
Example: Department Interest
Goal: Assess whether new change improved CS121
Question: Are students better prepared for next class?
Question: Are students more likely to pass CS 221?
Data: student grades for CS121 & 221
Caveats: many
Clarifying Questions
Your boss, client, etc. sets goals, may have questions
But they need to be refined before meeting data!
Refine this through clarifying questions!
Wrapping Up
Multiple layers to translate between data and high-level goals or objectives.
Questions bridge this gap.
Multiple levels of questions may be needed.
Further Reading
Questioning Questions
We make our operationalizations better by questioning them. What do they capture? Who or what do they prioritize?
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
QUESTIONING QUESTIONS
Learning Outcomes
Identify stakeholders and subgroups for a problem
Evaluate the operationalization of a question to see who and what it prioritizes
Key concept: understand a metric by asking what makes it improve
Example from Last Video
Goal: Assess whether new change improved CS121
Question: Are students better prepared for next class?
Question: Are students more likely to pass CS 221?
Data: student grades for CS121 & 221
Caveats: many
Measuring the Question
Option 1: Pass Rate
What fraction of students pass CS221?
Does the new 121 method improve it?
Option 2: Grade
What is the average grade in CS221?
Does the new 121 method improve it?
Letter or course points?
Understanding Measurements
How do I improve this measurement?
Improving Metrics
Option 1: Pass Rate
Increase the number of people who pass the class
The only way to improve it is helping those w/ the most difficulty
Option 2: Grade
Make students do better
Can be improved at the top (A- → A)
Measurement
You get what you measure.
Stakeholders
Students
Faculty
Department
Employers
University
High-performing students
Underprepared students
Marginalized students
How Do I Improve This?
Clarifies metric behavior
What changes to improve it?
What can remain the same while it is improved?
How can it be gamed or manipulated?
Metrics are always lossy and reductive, but this question helps evaluate and improve.
Wrapping Up
It is crucial to appropriately define measurements, and question our definitions.
How do we improve a metric?
Who or what does a metric prioritize?
Photo: Nicolos Tissot on Unsplash
Week 1 Quiz
Take the Week 1 Quiz in Blackboard (it's under “Quizzes”).
Tuesday Breakpoint
If you have completed through this point, you are ready for Tuesday's synchronous activities.
First Week Grace
The due date for the Week 1 quiz is flexible. I strongly encourage you to complete it on Monday,
but it is not due until the end of the week.
Subsequent weekly quizzes will be due at the end of the day on Monday.
Tuesday Session
About Me
Working Ahead
While this and following sections are not required before Tuesday's class, there is no reason you can't
get started whenever you want.
In this video, I talk about about my background, teaching, and research, as well as the structure of publishing computer science research and my professional involvement with different parts of that.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
ABOUT ME
CVish
Ph.D 2014, University of Minnesota
Asst. Prof @ Boise State 2016–present
Sr. Member, Association for Computing Machinery
Human-computer interaction, data science
Teaching
Lots of things about data
Storing and querying it (CS 410/510)
Using it for science (CS 533)
Using it to recommend things (CS 538)
Research
Information access systems (search, recommendation, etc.)
Are they fair and equitable?
What are their human impacts, for good or ill?
How do we build them to meet overlooked needs?
Professional Communities
ACM Conference on Recommender Systems (RecSys)
General Chair 2018, Steering Committee, program committee
ACM Conference on Fairness, Accountability, and Transparency (FAccT)
Network Co-chair (coordinating workshops)
Executive Committee
Review for many other conferences
Distinguished Reviewer for TiiS journal
How CS Scholarship Works
Conference
Submit paper (full, short)
Peer-reviewed
Presented at conference
Workshop
Affiliated with conference
Lower-prestige, often WIP
Journal
More developed ideas
Longer treatment
Sometimes expanded version of conference papers
Lots of variance
All of these require committees, labor to function!
Some Favorites
📺 Deep Space Nine, The Good Place
🎥 The Iron Giant
📚 The Broken Earth trilogy (N.K. Jemisin)
🤣 Savage Chickens, Calvin and Hobbes
See you in class!
Introduce Yourself
On Piazza, there is a thread for self-introductions. Reply and introduce yourself!
This counts towards participation points.
Textbook Chapters
The Python material we are working on this week is a subset of the material in chapters 1–3 of the textbook .
I don't expect to you get through all 3 chapters thoroughly this week, and we will be introducing more Python features as we need them throughout
the semester.
I will note specific chapters and sections relevant to videos in their Resources subsections.
Install Software
Make sure you have installed the course software so you can complete the assignment.
Content Structure
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
CONTENT STRUCTURE
Learning Outcomes
Understand the organization of the class content
Photo by Prateek Katyal on Unsplash
A Class in 4 Acts
Act 1 Answering Questions with Data
Act 2 Building and Evaluating Models
Act 3 Sourcing, Cleaning, and Integrating Data
Act 4 More Data Types and Models
Assignments and Dependencies
Class material is (mostly) sequentially dependent
We’ll use what came before to understand the next thing
Assignments and exams semi-cumulative
Within a Week
Each week has a page
Page contains all content
Videos
References to readings
Online
Textbook
Multiple ways to learn some material
Wrapping Up
The course is blocked into pieces.
We’ll build sequentially throughout the semester.
Photo by Markus Spiske on Unsplash
Our First Python Notebook
This video shows you how to start Jupyter, create a notebook, and run Python code.
It also shows you how to prepare a notebook to submit as an assignment.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
OUR FIRST PYTHON NOTEBOOK
Learning Outcomes
Start Jupyter
Run Python code
Prepare a notebook for submission
USINNG ONYX
Resources
Data Types and Control Flow
This video introduces fundamental Python data types and operations, along with variables and basic control flow.
Resources
Control Structures
This video introdcues Python control structures and code layout.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
CONTROL STRUCTURES
Learning Outcomes
Write basic Python control structures
Understand Python block syntax
Know standard practice for whitespace
Key concept: Python uses whitespace to detect blocks, such as the bodies of loops or conditionals.
Photo by gryffyn m on Unsplash
For Loops
for i in range(5): print(f'iteration {i}')print('done')
Iterate over an iterable
range(5) iterates 0, 1, 2, 3, 4
Whitespace delimits blocks
Blocks
Python blocks begin with ‘:’ at the end of a line
if, else, elif, for, while, def, class
Block content is indented one level
Standard practice is 4 spaces (Jupyter defaults to this)
Block ends when indentation returns to previous level
for i in range(5): print(f'iteration {i}’)print('done')
Comments
Python comments
Begin with ‘#’
Continue until end of line
# this is a comment
print('foo') # can comment here too
If Statements
x = 5if x >= 10: print('big')elif x >= 5: print('medium')else: print('small')
These are false:
False (Boolean value)
None (equiv. to null)
0
Empty list, set, or tuple ([])
Empty string('')
Most other things are true.
While Loops
while condition: # do something pass
Iterates until condition is false
‘pass’ does nothing
Here only to make it valid Python
Wrapping Up
Python provides the usual control structures (if, for, while)
Blocks are based on indentation
'#' starts a comment
Photo by Erik Kroon on Unsplash
Scientific Python
This video introduces NumPy ndarray
, the fundamental numeric array data structure for scientific computing.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
SCIENTIFIC PYTHON
Learning Outcomes
Understand the limitations of core Python data types for data science
Know the three key array data types:
ndarray
Series
DataFrame
Perform basic vectorized operations
Photo by Faris Mohammed on Unsplash
Lists of Numbers
Python:
numbers = [0.3, 9.2, 1.0, 6.7]
Remember that everything is an object?
List of 4 pointers (8 bytes each), with header (16 bytes)
To 4 objects: 8 bytes (double) + 16 bytes (object header)
Total: 144 bytes
Elements can have different types!
Sum our Numbers
total = 0for x in numbers: total = total + x
What’s wrong?
Python is slow – convenient, but slow
Interpreted
Dynamically typed
Pointers to objects cause cache misses
Enter NumPy
NumPy provides efficient numeric array types (‘ndarray’s)
import numpy as npnumbers = np.array([0.3, 9.2, 1.0, 6.7])
All elements have the same type
Stored directly in the array – no indirection, contiguous
Many ways to load or create arrays without going through lists
Sum our Numbers
sum = np.sum(numbers)
Shorter (although we could have used Python’s sum earlier)
Implemented in compiled language
NumPy array internal layout is compatible with C and/or Fortran
Don’t loop over arrays.
Vectorization
NumPy lets us perform operations on arrays:
scale = np.linspace(0, 1, 4)numbers + scale
Result:
array([0.3 , 9.53333333, 1.66666667, 7.7 ])
linspace creates an array from 0 to 1 (both inclusive), evenly spaced into 4 elements (same size as numbers)
+ does elementwise addition: adds corresponding elements
Efficiently in compiled code
About Arrays
Each array has 3 key things:
A data type (dtype) – what kind of elements?
A shape (tuple of ints) – how big?
May be multidimensional, e.g. (100,50) for 100x50 matrix
Elements – the data itself
Pandas
Pandas builds on NumPy with two new data types:
Series – an array with an associated index (element labels)
DataFrame - a table where each column is a series
You’ll see these briefly in A0, and more next week!
Use of Lists and Loops
We’ll still sometimes use lists and loops
Lists of arrays or data frames
Looping over input files or groups of data
But we avoid looping over individual data points.
Wrapping Up
NumPy provides efficient arrays
These are the backbone of our data processing
Prefer ‘vectorized’ operation whenever possible
Practice: put these in a notebook!
Fortak, Lord High Researcher of Clan Urdnot; Mass Effect 2
Resources
Assignment 0
Complete and submit Assignment 0 by midnight on Sunday, August 30 .