Week 14 — Workflow (11/29–12/3)

In this week, we are going to talk more about workflows. What does it look like to build a practical data science pipeline?

🧐 Content Overview

🎥 From Notebooks to Workflows


🎥 Scripts and Modules


🎥 Introducing Git


🎥 Git for Data Science




🎥 Split Apply Combine


🎥 Tuning Hyperparameters


🎥 Reproducible Pipelines


📃 Software Environments

1068 words

📃 Yay Reproducibility

1250 words

This week has 1h11m of video and 2318 words of assigned readings. This week’s videos are available in a Panopto folder and as a podcast.

📅 Deadlines

  • Quiz 14, December 2

  • Assignment 7, December 12

In this video, we introduce going beyond notebooks to broader structures for our Python projects.

This video introduces Python scripts and modules, and how to organize Python code outside of a notebook.


This video introduces version control with Git.

How do you use Git effectively in a data science project?


The Extract, Transform, Load (ETL) pipeline is a common design pattern for data ingest. Sometimes it is adjusted to Extract, Load, Transform.

We’ve seen group-by operations this semester; they’re a specific form of a general paradigm called split, apply, combine.

How can we move beyond GridSearchCV in our quest to tune hyperparameters?

📓 Tuning Example

The Tuning Example notebook demonstrates hyperparameter tuning by cross-validation with multiple techniques.

I provide very brief pointers to additional tools you may want for workflow management in more advanced projects.


Some software that supports data and/or workflow management:

📃 Software Environments

Read software environments.

📃 Reproducibility Case Study

Read my case study on reproducibility and bug-hunting.

🚩 Weekly Quiz 14

Take Quiz 14 in Canvas.

📓 More Examples

My book author gender project is an example of an advanced workflow with DVC.

📩 Assignment 7

Assignment 7 is due Sunday, December 12, 2021.