Week 14 — Workflow

In this week, we are going to talk more about workflows. What does it look like to build a practical data science pipeline?

From Notebooks to Workflows (3m44s)
Scripts and Modules (15m33s)
Introducing Git (12m2s)
Weekly Quiz 14
Git for Data Science (6m52s)
Extract, Transform, Load (6m46s)
Split, Apply, Combine (6m45s)
Tuning Hyperparameters (10m49s)
Tuning Example
Reproducible Pipelines (8m28s)
More Examples
Assignment 7

This week's videos are also available as a Panopto playlist.

From Notebooks to Workflows

In this video, we introduce going beyond notebooks to broader structures for our Python projects.

Video

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand FROM NOTEBOOKS TO WORKFLOWS Learning Outcomes (Week) Break code into scripts, modules, and notebooks. Design a data pipeline to run and reproduce an analysis. Use Git to version-control code. Photo by Jessica Furtney on Unsplash Managing Code and Data Notebooks are great Interactively test code View results with code Good visualization capabilities Combine discussion, methods, and results Limitations of Notebooks Hard to reuse code from one notebook in another Not great for long-running tasks Limited running capabilities Moving Beyond Scripts are Python programs that run on their own. Modules hold Python code to reuse elsewhere Scripts Notebooks Other Modules Pipelines Transform / Prepare (ETL) Raw Source Data Prepared Data Inference Findings Modeling Model + Predictions Data Description Wrapping Up Significant data science projects usually have multiple components in a pipeline. Git is useful for tracking and versioning the code to generate these components. Photo by Fran Jacquier on Unsplash

Scripts and Modules

This video introduces Python scripts and modules, and how to organize Python code outside of a notebook.

Video

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand SCRIPTS AND MODULES Learning Outcomes Write a Python script Put Python code in a module Understand the Python module/package structure Photo by Simon Goetz on Unsplash Scripts A .py file can be run as a script from the command line:python my-script.py Runs the code in the file ‘def’, ‘class’, etc. are just Python statements Example: read in a file, and write a filtered file Starts with a docstring (optional) """Filter ratings to only real ones"""import pandas as pdratings = pd.read_csv('ratings.csv')r2 = ratings[ratings['rating'] >0]r2.to_csv('filtered-ratings.csv', index=False) Scripts and Pipelines Typical script: Reads input files Does some processing Pandas manipulations SciKit-Learn model training/evaluation Saves results Data frame as CSV, Parquet, etc. Model as pickle file Docstrings A Python code object can start with a docstring Script, class, function, module Documents the code Purpose Function arguments Class fields Doc renderers & IPython/Jupyter use these Configurability Scripts can take command line arguments python script.py in.csv out.csv In list sys.argv 0 is name of the program Libraries help parse: argparse (in standard lib) docopt (uses help message) """Filter ratings to only real ones"""import sysimport pandas as pdin_file = sys.argv[1]out_file = sys.argv[2]ratings = pd.read_csv(in_file)r2 = ratings[ratings['rating'] >0]r2.to_csv(out_file, index=False) Import Protection Python files can be either run as a script or imported as a module Import-protect your scripts to avoid potential problems & enable code reuse: Put all code in functions Call main function in ‘if’ statement at end of script """Filter ratings to only real ones"""import sysimport pandas as pddef main(): in_file = sys.argv[1] out_file = sys.argv[2] ratings = pd.read_csv(in_file) r2 = ratings[ratings['rating'] >0] r2.to_csv(out_file, index=False)if __name__ == '__main__': main() Modules import foo Looks for file foo.py In script’s directory (or local dir for notebooks / console) In PYTHONPATH environment variable In Python installation Runs file to create definitions Exposes definitions under ‘foo’ object def bar()… becomes foo.bar Exposes all assigned names: variables, functions, classes, other imports… Packages Modules can be grouped together into packages A package is just a directory with a file __init__.py Init file can be empty Init can have docstring to document package Packages can contain other packages Let's see an example… Script Advice Write a docstring (quickly glance at script to see purpose) With docopt, docstring is also script usage information Import-protect scripts Provide reasonable configurability If script has too many different modes, break apart Multiple scripts Common code in modules Disconnected Runs What if you lose connection? Can we start a process running, go home, and check it later? The tmux program does this! tmux creates a new session Ctrl+b d (Ctrl+b followed by ‘d’) detaches tmux attach re-attaches to session Many other capabilities under Ctrl+b. General Principles Use packages and modules to organize code for your project Layout Common utilities Presentation themes? Always refer to relative paths Applies to all code! Beware excessive configurability In either functions or scripts If multiple ways to combine pieces, extract pieces & have different scripts or functions that combine them in different ways. Wrapping Up Scripts and modules are useful for organizing code in larger projects. We can reuse code and operations across multiple parts of the project. Photo by Klára Vernarcová on Unsplash

Resources

Python Modules
docopt, a very useful tool for processing command-line arguments
Environment Variables in glossary
LK Demo Experiment, which I used in the demo; this also uses DVC
tmux

Introducing Git

This video introduces version control with Git.

Video

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand INTRODUCING GIT Learning Outcomes Use Git to save versions of scripts and notebooks Share code through GitHub Merge code changes from collaborators This video introduces concepts — see links for hands-on learning. Photo by Ula Kuźma on Unsplash Problems with Saving Files Make a change that didn’t work — get back the old version Save work to make sure you don’t lose it Making sure you have the right version of the file on multiple computers Share changes with collaborators Versions Git is a tool for storing versions of software Snapshot of current state History of versions A commit is a snapshot with a pointer to the previous commit(s) Chain of commits form history Can go back to previous commits Commits form basis for sharing and merging changes Core Concepts The working tree is your directory of files, ready to run or edit The index is a staging area for changes to be commit A repository stores the history Local repository is in .git directory in working tree Remote repositories (e.g. GitHub) don't have working trees Your local copy has the entire history! A branch is a is a line of development Points to a commit Updated as you make new commits Default branch is either ‘main’ or ‘master’ Local and Remote You have a local repository where you work and make changes It can have configured remotes where you push and pull changes GitHub is one service for hosting repositories Configure your GitHub repo(s) as remotes on your local Other options include BitBucket and GitLab Operations commit records the current version of your files clone creates a repository & working tree by copying another push sends commits (and files) to a remote repository Updates remote branch to match local branch fetch retrieves data from a remote repository merge updates one branch to include changes from another pull updates local branch to include remote (fetch + merge) Use Case: Tracking History Work on your code and notebooks Commit when you have a version you want to save Do this very frequently It's useful for commits to successfully run Result: local history to go back and recover old versions Use Case: Multiple Computers Work on one machine Commit your changes Push to remote repo (e.g. GitHub) Pull on other machine, continue working Significantly less error-prone than manually copying files Cannot directly push between machines (can pull, though) Use Case: Collaboration Work, committing changes from time to time Pull from shared remote to get collaborator's current work Merge if necessary Push your work to shared remote Can only do if your branch is current w/ remote changes Collaborator pulls changes Always commit before merging Ignoring Files .gitignore files specify files to ignore These are committed to Git, to share w/ others Should ignore: Editor temp files (e.g. ~, .bak, .swp, etc.) OS weird files (e.g. .DS_Store on Mac) Python bytecode cache (__pycache__ dir, .pyc/.pyo files) Compiled files Most generated files Commit source and generators In data science projects, may store results of analysis (or even processing) Tools / Interfaces git command-line tool You’ll need to learn this, even if you primarily use other tools(running code on servers / clusters) Sometimes you need to fix things Integrated support in editor / IDE I use VS Code for almost all code — it has very good Git support Dedicated GUI like Tower, SourceTree, or GitKraken Free through GitHub Student Developer Pack Wrapping Up Git allows you to record versions of your code to track history, roll back changes, and share with others. Commit early and often. Photo by Jed Villejo on Unsplash

Resources

Git Resources (including my example .gitignore file)
Git Handbook
Resources to Learn Git
Version Control by Example
GitHub Student Developer Pack

Weekly Quiz 14

Take Quiz 14 in Blackboard.

Git for Data Science

How do you use Git effectively in a data science project?

Video

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand GIT FOR DATA SCIENCE Learning Outcomes Understand limits of git Ignore data files Know additional tools to look at for managing data files Photo by Lianhao Qu on Unsplash Git Strengths Git is very good at tracking: Modestly-sized files (less than a few MB) Text files Not good for: Binary files (although small ones are ok) Large files (especially binary) Hard-to-merge files (e.g. notebooks) Ignoring Files In a data science project, we often ignore more files Data files (e.g. *.csv, *.csv.gz) Inputs, intermediate files, and large outputs Keep the notebooks (and possibly other documents) Method 1: Inputs + Recreate Ignore data files Include script to fetch input data from central server File store Database Reproduce intermediate files locally by re-running scripts Optional: commit outputs / summaries Optional: save results into database / shared repo Good if analysis is relatively cheap Method 2: Data Data Repository Ignore data files Include scripts to fetch current inputs + intermediates from another server (file share, Amazon S3, etc.) Include scripts to update inputs on other server May commit: Outputs File versions Can be bespoke or use standard tools (I use DVC) Method 3: Large File Storage Git LFS (Large File Storage) Manage large media files with Git, looks like they are committed Commits stub file with pointer to actual content Content stored on separate server, not in git repo Stub replaced with content on checkout You may commit outputs in this mode! If someone changes, re-commit and push Caveat: if you use GitHub, limited space + bandwidth Notebooks Notebooks are text, but are complex JSON Hard to compare Hard to merge Change when run! 2 solutions, roughly: Commit as normal Merge by just taking one version Merge with nbdime Coordinate notebook edits Commit without outputs nbstripout filter Wrapping Up Git works great for data science, but requires a few new tricks. Be thoughtful about how you handle data in Git. Notebooks can be annoying. Photo by laura adai on Unsplash

Resources

NoteBook DIff and MErge (nbdime) — tools for diff/merge of notebooks. Available in Conda:
```
conda install nbdime
```
git-lfs

Extract, Transform, Load

The Extract, Transform, Load (ETL) pipeline is a common design pattern for data ingest. Sometimes it is adjusted to Extract, Load, Transform.

Video

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand EXTRACT, TRANSFORM, LOAD Learning Outcomes Use standard design patterns to think about data integration and transformation. Photo by Dom Heartley on Unsplash Pipelines Transform / Prepare (ETL) Raw Source Data Prepared Data Inference Findings Modeling Model + Predictions Data Description Stages of Transformation Input: a source of initial, unprocessed data Extract data (download, export, scrape, etc.) Transform data (into common format, initial cleaning, etc.) Load data into system for analysis (store in DB, save in file) Result: cleaned, integrated data ready for analysis or modeling Benefits of Design Patterns A design pattern is a common structure for (software) design Common language for documenting & understanding software Context for developing best practices Can benefit from automation support Another example: SKLearn ‘fit’ design ETL in Context Stand-alone projects: may live in repo One or more ETL scripts Saves data ready for subsequent stages of analysis Organizational resource: may be on its own ETL pipeline to prepare data for use across organization Many different projects use results of common ETL pipeline Data “loaded” into shared database (or data warehouse / data “lake”) Variant: Extract, Load, Transform Sometimes transformations are done in the database Extract raw source data Load into (initial) database tables Transform in-DB (e.g. with SQL queries) Use layered schema design (load-side and transform-side tables) ELT in Practice Book Data Tools are a shared ELT pipeline Extract from 6–8 input data sources Load into PostgreSQL Transform (w/ SQL queries) into integrated tables Any project in group can use this data. Wrapping Up Design patterns provide a common language for talking about software design. Extract, Transform, Load and Extract, Load, Transform are patterns for data pre-processing. Photo by Arseny Togulev on Unsplash

Resources

ETL — Understanding It and Effectively Using It
ETL vs. ELT
Wikipedia article on ETL
The Week 7 Example uses an ELT design

Split, Apply, Combine

We've seen group-by operations this semester; they're a specific form of a general paradigm called split, apply, combine.

Video

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand SPLIT, APPLY, COMBINE Learning Outcomes Use the split/apply/combine to analyze and transform data. Photo by Omar Flores on Unsplash GroupBy ratings.groupby('movieId')['userId'].count() Split data by movie ID Apply the operation ‘count user IDs’ Combine results into data frame (of movie rating counts) This is another pattern! Split groupby splits data by values of one (or more) columns Each group results in data frame Can see this by iterating over a groupby Each iteration yields a grouping key & subset data frame Apply Programming languages: you ‘apply’ a function to data Pandas apply operations: agg: apply an aggregate function (returns single value) transform: apply a 1:1 op (output size matches input size) apply: apply arbitrary function May return value, series, or data frame Should return same type for every partition Combine Pandas automatically combines results of the applied operation Value → series indexed by grouping columns Series → series indexed by grouping columns + result index DataFrame → DF indexed by grouping columns + result index Apply should return same columns for every partition Why? Pandas takes care of split/combine bookkeeping Easier to understand code in standard paradigm Trivial to parallelize Dask parallelizes with same API Runs apply op on multiple processes or machines Related: Map/Reduce Define two operations: Map transforms element to key-value pairs Reduce transforms key & set of values to single value May have partial / repeated reduce, w/ values from previous Map/reduce framework (e.g. Hadoop) parallelizes. Example: count ratings def map(rating): yield rating.movieId, 1 def reduce(id, counts): return np.sum(counts) Wrapping Up The split/apply/combine pattern lets us transform groups of data. It improves understandability, modifiability, and parallelism. Photo by Erol Ahmed on Unsplash

Resources

Split, Apply, Combine at Pandas

Tuning Hyperparameters

How can we move beyond GridSearchCV in our quest to tune hyperparameters?

Video

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand TUNING HYPERPARAMETERS Learning Outcomes Apply different techniques to tune hyperparameters Understand the principle of random search Photo by Marten Newhall on Unsplash Selecting Hyperparameters Need to pick good values for hyperparameters Regularization strength Number of trees in the forest Number of neighbors Latent space dimensionality Grid Search Characteristics of Grid Search Random Search Why Does Random Search Work? Principle 1: we don’t need best, just good enough Principle 2: more than one setting probably good enough If 5% of the search space is “good enough” And you sample 60 points Probability you have at least 1 good-enough point: 95% G Proof G Random Search Summary Only needs 60-100 points, regardless of # of parameters Trivially parallelizable (like grid search) May not find best solution Requires assumption about size of “good enough” set SciKit Learn: RandomizedSearchCV Hyperparameter Search as Optimization Bayesian Optimization Tests model at a few initial points Maintains surrogate model to predict performance at new settings Uses model to pick next test points Implemented by scikit-optimize. BayesSearchCV: SciKit-compatible optimizer gp_minimize: general-purpose function minimizer Bayesian Optimization Characteristics Trades off parallelism for optimization ability Next search point(s) depend on results so far Can batch searches (e.g. try 4 new points instead of 1) Useful for complex search spaces if random isn’t good enough Can be more efficient than random w/ early stopping In a Workflow We’ve been using CV to search while we run I often use a hyperparameter search script Runs tuning on training data Saves optimal parameter values to a file (e.g. JSON file) Later scripts read & use settings Wrapping Up Hyperparameter tuning is an expensive optimization problem. Several techniques are useful, with good automation for scikit-learn. Photo by Clem Onojeghuo on Unsplash

Resources

Tuning Example

The Tuning Example notebook demonstrates hyperparameter tuning by cross-validation with multiple techniques.

Reproducible Pipelines

I provide very brief pointers to additional tools you may want for workflow management in more advanced projects.

Video

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand REPRODUCIBLE PIPELINES Learning Outcomes Understand the value of a reproducible pipeline for both science and industrial application. Know where to read more about tools to help build and automate them. Photo by Anne Nygård on Unsplash Reproducibility Cornerstone of current scientific philosophy A result only observable once is unlikely to be valid (or at least useful) Need to re-run with new data Update forecasts / models for the next month Check for bugs and sensitivity End-to-end re-run catches order-of-operations bugs Re-running with new random seed(s) checks for seed-sensitivity Helps ensure you actually did what you say you did Goal Rerun the entire analysis end-to-end with a single command Well-documented set of steps acceptable alternative Possibly with new: Data Software versions Settings Requirements What steps need to happen? What scripts or notebooks? What arguments? What order do they need to happen in? Optional: Is a step up-to-date? Only recompute out-of-date steps Saves time, energy, money Sounds like make? Data Version Control (dvc) Pipeline has stages, with: Input files Output files Command to produce outputs from inputs Stages defined in DVC files committed to Git Output-only stage just records the presence of a file DVC – Reproduce Stage(s) Checks if it’s up-to-date Inputs, outputs, command match last recorded run Re-runs command if out of date Recursively updates dependencies first Like Make, but uses checksums instead of mtimes Commits checksums to Git Reproduce entire pipeline by ensuring final stage(s) are current DVC – Manage Data DVC also manages data Stage files contain input/output checksums Git ignores all outputs DVC copies outputs to/from data server (e.g. Amazon S3) Easy to insure you have the current copy of the data DVC in Practice Entire pipeline in DVC Experiment with manual commands Save to DVC once I have the run figured out Run expensive models on university cluster Push results to data server Pull to other machine for final analysis with notebooks Easy to make sure we have current data (just as with Git) Other Tools MLflow Make Gradle (useful for Java-based environments) Many others. Wrapping Up Fully reproducible data science pipelines help science and practice. Tools such as dvc can help you build them. Photo by Possessed Photography on Unsplash

Resources

Some software that supports data and/or workflow management:

Data Version Control — I use this
MLflow — support for machine learning workflows

More Examples

My book author gender project is an example of an advanced workflow with DVC.

Assignment 7

Assignment 7 is due December 13, 2020.