Week 14 — Workflow
In this week, we are going to talk more about workflows. What does it look like to build a practical data science pipeline?
This week's videos are also available as a Panopto playlist.
From Notebooks to Workflows
In this video, we introduce going beyond notebooks to broader structures for our Python projects.
Scripts and Modules
This video introduces Python scripts and modules, and how to organize Python code outside of a notebook.
- Python Modules
- docopt, a very useful tool for processing command-line arguments
- Environment Variables in glossary
- LK Demo Experiment, which I used in the demo; this also uses DVC
This video introduces version control with Git.
- Git Resources (including my example
- Git Handbook
- Resources to Learn Git
- Version Control by Example
- GitHub Student Developer Pack
Weekly Quiz 14
Take Quiz 14 in Blackboard.
Git for Data Science
How do you use Git effectively in a data science project?
NoteBook DIff and MErge (nbdime) — tools for diff/merge of notebooks. Available in Conda:
conda install nbdime
Extract, Transform, Load
The Extract, Transform, Load (ETL) pipeline is a common design pattern for data ingest. Sometimes it is adjusted to Extract, Load, Transform.
- ETL — Understanding It and Effectively Using It
- ETL vs. ELT
- Wikipedia article on ETL
- The Week 7 Example uses an ELT design
Split, Apply, Combine
We've seen group-by operations this semester; they're a specific form of a general paradigm called split, apply, combine.
How can we move beyond
GridSearchCV in our quest to tune hyperparameters?
The Tuning Example notebook demonstrates hyperparameter tuning by cross-validation with multiple techniques.
I provide very brief pointers to additional tools you may want for workflow management in more advanced projects.
Some software that supports data and/or workflow management:
My book author gender project is an example of an advanced workflow with DVC.
Assignment 7 is due December 13, 2020.