Week 15 — What Next?
This is the last week of class. We're going to recap, and talk about what's next, both for learning and for putting what you've learned to practical use.
This week's videos are also available as a Panopto playlist .
Recap
This video reviews the concepts we have discussed this term and puts them into the broader context of data science.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
RECAP
Learning Outcomes (Week)
Wrapping up!
Tie together the class content again
Apply Pandas time series operations and model correlated regression errors
Take the results of data science analysis in production or publication
Know some topics to study further to expand your data science skills
Photo by Lumitar on Unsplash
The Data Science Workflow
Transform / Prepare (ETL)
Raw Source Data
Prepared Data
Inference
Findings
Modeling
Model + Predictions
Data Description
What is Data Science?
The use of data to provide quantitative insights on questions of scientific, business, or social interest.
Data Management
Reading from static files
Processing and integrating with Pandas
To learn more:
CS 510 Databases
Application- and type-specific data in other classes
Mathematical Fundamentals
Probability Theory
Linear algebra (a little)
To learn more:
Math 562 (Probability and Statistics)
Math 503 (Advanced Linear Algebra)
Inference
Basic parametric pairwise comparisons (t-tests)
Bootstrapping
Sampling theory
Linear regression models (OLS & logistic)
To learn more:
Math 562 (Probability and Statistics)
Prediction
Regression: continuous outcomes
Classification: categorical (esp. binary) outcomes
To learn more:
CS 534 (Machine Learning)
Many other data science classes
Evaluation and Tuning
Train/test splits
Classification and continuous prediction metrics
Hyperparameter tuning
To learn more:
CS 534 (Machine Learning)
Unsupervised Learning
Lower-dimensional embedding (matrix decomposition)
Clustering
To learn more:
CS 534 (Machine Learning)
Will appear in other data science classes
Workflows
Data science pipeline
Breaking code into separate scripts & modules
Design patterns for code workflows
You will apply this throughout your classes!
Wrapping Up
This class is designed to lay a conceptual foundation for your future data science studies.
Other classes will build on these concepts and ideas!
Photo by Dave Heere on Unsplash
Data and Concept Drift
This video introduces a fundamental assumption of predictive modeling and the way drift can affect it.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
CONCEPT AND DATA DRIFT
Learning Outcomes
Know crucial assumptions of machine learning evaluation and deployment
Understand how models can degrade over time
Photo by guille pozzi on Unsplash
Fundamental Assumption
Deployment Assumption
Drifts and Shifts
Offline Solution: Temporal Splitting
Random train-test split ensures train/test comparable
Assumes test is uniformly drawn from same distribution!
Alternative: temporal split
Select temporally-contiguous test data (e.g. 1 month)
Train on data before test data (no time travel!)
Benefit: simulates actual use
Drawback: temporal data no longer random, inference harder
Online Solution: Continuous Monitoring
Instrument your system in production
Watch key metrics over time
Click-through rate
Classification rate
Regularly re-train and re-evaluate
Train model on new data
Evaluate model on new data
Wrapping Up
ML training and evaluation assumes that the training and test data match real life.
You can’t always rely on that.
Photo by Jackson Douglas on Unsplash
Resources
Time Series Operations
Time is an important kind of data that we haven't spent much time with — this video discusses the fundamental Pandas operations for working with time-series data.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
TIME SERIES
Learning Outcomes
Summarize and plot time series data with Pandas
Understand that time series data is often not independent
Photo by Bruno Figueiredo on Unsplash
Time Series
Time Series
Time Representation
Many representations:
String dates
Years & months
Timestamps (seconds, ms, etc.)
Pandas: must convert to a datetime
pd.to_datetime
Pandas Setup Steps
Create/convert datetime column with instance times
Index data frame by timestamp
Sort index
ratings['timestamp'] = pd.to_datetime(ratings['timestamp'], unit='s')rts = ratings.set_index('timestamp').sort_index()
Time series exception to “prefer unique indexes” guideline
Operation: Resampling
Like groupby, but on time intervals
Compute aggregate functions for time intervals
‘on’ option allows a non-index column
monthly_ratings = rts.resample('1M')['rating'].count()
Operation: Plotting
Matplotlib & Seaborn render timestamp X axes well
sns.lineplot(data=monthly_ratings)
Operation: Range Select
Time series indexes support range operations
Select by range: rts.loc['2010-01-01':'2010-12-31']
Includes end point (unlike normal slice!)
Select by period: rts.loc['2010-07']
Operation: Diff and Shift
Time Effects
Autocorrelation
Wrapping Up
Repeated data over time requires particular handling and has distinct operations.
Pandas provides these through datetime columns and indexes.
Photo by Javier Esteban on Unsplash
Resources
Time Series Example
The MovieLens Time Series notebook demonstrates basic time series operations in Pandas.
Weekly Quiz 15
Take the Week 15 Quiz in Blackboard.
Regression models require that the data be independent . This video introduces two kinds of non-independence and methods for addressing them: grouped observations addressed with a mixed-effects model and temporal auto-correlation addressed with ARIMA models.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
CORRELATED ERRORS
Learning Outcomes
Understand the idea of correlated errors and how they can be addressed.
Know when you need to look for a model that can handle correlated error.
Know two models to study further when needed.
Photo by Tyler Nix on Unsplash
Correlated Errors
Observations are not necessarily independent:
Time series: autocorrelated
Within-subjects designs: grouped
Other data may also be grouped
Medical data collected by hospital
Grouping Example
Example problem: measuring search engine effectiveness
Two search algorithms
Users issue queries
Results are assessed for relevance (experts or click logs)
Problem: results are not independent
* Some queries are harder than others
Naïve Solution
Stratified Errors / Group Effects
Fixed and Random Effects
Basic Idea: Modeling Additional Effects
Random effects capture natural sources of variance shared between observations
Performance: system effectiveness + query difficulty
Remaining variance (residuals) hopefully i.i.d. normal!
Our linear model works again
This applies generally – if we know an effect in the variance structure, we can remove it (“control for” it).
When To Use Stratified Errors
Any time your data points come in groups
Multiple measurements for the same object
Trying to understand difference between measurements
Understanding diff. between objects, they are fixed effects!
When you have an external feature shared between instances
Autocorrelation
Autoregression
Moving Average
ARIMA
Auto Regression Integrated Moving Average: ARIMA(p, q, r)
AR(p) model
MA(r) model
Applied to q-order diffs (1 = diff, 2 = diff of diffs)
AR & I can be viewed as a type of feature engineering
Integration with Prediction
Wrapping Up
We often have structured errors in a regression problem.
In some cases, modeling that structure yields a valid model.
Photo by Pawel Janiak on Unsplash
Resources
Publishing Projects
This video talks about going from an analysis and its notebooks to a publishable paper.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
PUBLISHING PROJECTS
Learning Outcomes
Understand what is needed to go from analysis to publishable reports.
Outline a research paper.
Photo by Good Good Good on Unsplash
Publication Audiences
Data science document products can be for:
Collaborators
Decision-makers
Other organizations
Scientific community
Lay public
Formats
Written document (electronic or printed)
Presentation (live or recorded)
Interactive online demo/dashboard
Publication Goals
Reader needs to understand:
What you did
What you learned
What they should do / take away
(In general. Not everything is actionable.)
Typical Outline
Introduction
Background & Related Work
Methods
Results
Discussion
Conclusion
Variants:
Some communities put related work at the end
Institutional reports often lead with Executive Summary
1-2 page summary w/ key points
Discussion may be merged w/ Results or Conclusion
Methods may split
Rendering Plots
High-quality images
When practical: vector images (PDF good for LaTeX)
Otherwise: high-resolution images (at least 300dpi, 600 better)
Complex images can overwhelm PDF
Clean and clear labels and captions
Distinct colors, shapes, etc.
Experiment with dimensions & aspect ratio
E.g. 5”-wide image, scale down to 3.75” columns
Step Outside Yourself
Forget you did the work and wrote the document.
Can you understand it? Could you reproduce it? Is info missing?
Applies everywhere:
Reports
Papers
README
Wrapping Up
Internal and external publications require special attention to writing and visual presentation.
Photo by Kari Shea on Unsplash
Resources
PlotNine is a good plotting library for preparing consistent, publication-ready graphics.
The book gender example also demonstrates the current evolution of my own practices for preparing for publication.
Production Applications
How do you put the results of your data science project into product?
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
PRODUCTION APPLICATIONS
Learning Outcomes
Understand how data science outcomes can be used.
Think about how to put mdoels and outcomes into production.
Photo by C D-X on Unsplash
Using Data Science
Data-driven reports to inform decisions
Regular forecasts for internal purposes
Data science outputs for real-time decisions
Internal
Customer-facing
Reproducibility
Reproduction is crucial:
Regular reports / predictions re-run
Daily / weekly / monthly reports
Retraining models for online use
Online Use
Many modalities: web, mobile app, desktop app, server infra
Mobile & desktop often use web tech to connect to models
HTTP (often REST) API
Some low-latency exceptions
Multiple audiences
Internal reporting
Internal decision-making
Customer decision-making
Service-Oriented Architecture
Client
Web Server
Services
Databases
Deployment
Predictions made available via a web service
E.g. TensorFlow Serving
Model trained offline on other hardware
Model-training script saves trained model to disk
Web service loads trained model & serves predictions
Useful Infrastructure Capabilities
Train models on live or freshly-exported data
Hold out test data to evaluate new model before deployment
Version your models (including retrain w/ same hyperparams!)
Roll back to old model version
Details depend on institution & infrastructure.
Skills to Learn
Web back-end programming (to build services)
Web front-end programming (to build dashboards)
Performance measurement & tuning
Wrapping Up
Many data science projects result in online production capabilities.
Often done by training a model and deploying it in a web service.
Photo by Massimo Botturi on Unsplash
Topics to Learn
This video goes over some useful topics to learn to fill out more of your data science education.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
THINGS TO LEARN
Learning Outcomes
Know topics and software to study for expanding your data science skills.
Select the classes to take to complete your graduate degree.
Photo by Reinhart Julian on Unsplash
Machine Learning and Statistics
CS 534: Machine Learning
Math 562: Probability and Statistics II
Math 572: Computational Statistics
Math 573: Time Series Analysis
Working with Text
CS 536: Natural Language Processing
CS 537: Introduction to Information Retrieval
Read about NLP and IR
Application Areas
Social media: CS539, read papers in ICWSM & CSCW
Information retrieval: CS 537 & 637
Recommendation & personalization: CS 538
Software Development
CS 573: Advanced Software Engineering
Study programming practices & software engineering
Practice, practice, practice
Think about your code’s readability & effectiveness
Advanced ML
CS 633: Deep Learning
Software: TensorFlow, PyTorch
“Modular differentiable programming”
Bayesian Inference
I’m a Bayesian. Mostly.
Book: Statistical Rethinking (McElreath)
Software: STAN or PyMC3
Causal Inference
Book: Counterfactuals and Causal Inference (Morgan & Winship)
ECON 522: Advanced Econometrics
Critical Perspectives
Book: Data Feminism (D’Ignazio & Klein)
Book: Data, Now Bigger And Better
Book: Introduction to Science and Technology Sudies (Sismondo)
Keyword: Critical Data Studies
Wrapping Up
This class lays a foundation for you to integrate further knowledge.
Never stop learning new things.
Photo by Annie Spratt on Unsplash
General Tips
Some final closing tips and suggestions for you to think about as you take the next steps in your data science career.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
GENERAL TIPS
Learning Outcomes
Some concluding tips for your data science work.
Photo by Sam Dan Truong on Unsplash
Questions
Good questions are fundamental
Tukey:
Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.
Photo by Jon Tyson on Unsplash
Work Reproducibly
For many, many reasons.
But accurate description is easier if work is reproducible (and internally reproduced).
Never Lose Context
Keep the big picture in mind
Goals — Questions — Analysis — Data
Keeping the goal in mind helps with:
Contextualizing results
Identifying limitations
Generating next ideas
Photo by Adrian Dascal on Unsplash
Don’t Overlook Detail
Be precise and specific in specifying questions and results
Always know what, precisely, you are measuring
Communicate precise results & definitions
Put them in context
Photo by dorota dylka on Unsplash
Be Curious
Learn more about statistics
Learn more about applications
Learn more about anything interesting
Photo by Siora Photography on Unsplash
Pause and Reflect
Photo by Lukasz Saczek on Unsplash
Wrapping Up
Good data science requires ongoing reflection, study, and practice.
Try to do better science tomorrow than you did yesterday.
Photo by Jan Tinneberg on Unsplash
Farewell
It's been grand!
I would love to hear more feedback on how to improve this material, because I hope to keep using it for a while.
Makeup Exam
The makeup midterm is due Saturday, Dec. 12 at 5:00 PM .
Grade Replacement
If you turn in the makeup exam to be graded, its grade will replace the lower of your Midterm A and B grades, even if that lowers your final grade .
Only turn it in if you think you did better than your worst normal midterm!
Assignment 7
Assignment 7 is due December 13, 2020 .
Final Exam
The final exam will be released at 5PM on Monday, Dec. 14 , and due at 5PM Thursday, Dec. 17 .