Week 14 — Workflow
In this week, we are going to talk more about workflows.
What does it look like to build a practical data science pipeline?
This week's videos are also available as a Panopto playlist.
From Notebooks to Workflows
In this video, we introduce going beyond notebooks to broader structures for our Python projects.
Scripts and Modules
This video introduces Python scripts and modules, and how to organize Python code outside of a notebook.
Resources
Introducing Git
This video introduces version control with Git.
Resources
Weekly Quiz 14
Take Quiz 14 in Blackboard.
Git for Data Science
How do you use Git effectively in a data science project?
Resources
The Extract, Transform, Load (ETL) pipeline is a common design pattern for data ingest.
Sometimes it is adjusted to Extract, Load, Transform.
Resources
Split, Apply, Combine
We've seen group-by operations this semester; they're a specific form of a general paradigm called split, apply, combine.
Resources
Tuning Hyperparameters
How can we move beyond GridSearchCV
in our quest to tune hyperparameters?
Resources
Tuning Example
The Tuning Example notebook demonstrates hyperparameter tuning by cross-validation with multiple techniques.
Reproducible Pipelines
I provide very brief pointers to additional tools you may want for workflow management in more advanced projects.
Resources
Some software that supports data and/or workflow management:
More Examples
My book author gender project is an example of an advanced workflow with DVC.
Assignment 7
Assignment 7 is due December 13, 2020.