Week 14 — Workflow (11/28–12/2)#

In this week, we are going to talk more about workflows. What does it look like to build a practical data science pipeline?

🧐 Content Overview#

Element Length

🎥 From Notebooks to Workflows

3m44s

🎥 Scripts and Modules

15m33s

🎥 Introducing Git

12m2s

🎥 Git for Data Science

6m52s

🎥 ETL

6m46s

🎥 Split Apply Combine

6m45s

🎥 Tuning Hyperparameters

10m49s

🎥 Reproducible Pipelines

8m28s

📃 Software Environments

1068 words

📃 Yay Reproducibility

1250 words

This week has 1h11m of video and 2318 words of assigned readings. This week’s videos are available in a Panopto folder.

📅 Deadlines#

  • Quiz 14, December 1

  • Assignment 7, December 11

🎥 From Notebooks to Workflows#

In this video, we introduce going beyond notebooks to broader structures for our Python projects.

🎥 Scripts and Modules#

This video introduces Python scripts and modules, and how to organize Python code outside of a notebook.

Resources#

🎥 Introducing Git#

This video introduces version control with Git.

Resources#

🎥 Git for Data Science#

How do you use Git effectively in a data science project?

Resources#

🎥 Extract, Transform, Load#

The Extract, Transform, Load (ETL) pipeline is a common design pattern for data ingest. Sometimes it is adjusted to Extract, Load, Transform.

Resources#

🎥 Split, Apply, Combine#

We’ve seen group-by operations this semester; they’re a specific form of a general paradigm called split, apply, combine.

Resources#

🎥 Tuning Hyperparameters#

How can we move beyond GridSearchCV in our quest to tune hyperparameters?

Note

There is an error on slide 9. Where it says “≤ 0.5” it should say “≤ 0.05”.

Resources#

📓 Tuning Example#

The Tuning Example notebook demonstrates hyperparameter tuning by cross-validation with multiple techniques.

🎥 Reproducible Pipelines#

I provide very brief pointers to additional tools you may want for workflow management in more advanced projects.

Resources#

Some software that supports data and/or workflow management:

📃 Software Environments#

Read software environments.

📃 Reproducibility Case Study#

Read my case study on reproducibility and bug-hunting.

📓 Example Script and Notebook#

You can find an example, with walkthrough of how to run it with the command line on GitHub CodeSpaces, in this example repo.

🚩 Weekly Quiz 14#

Take Quiz 14 in Canvas.

📓 More Examples#

My book author gender project is an example of an advanced workflow with DVC.

📩 Assignment 7#

Assignment 7 is due Sunday, December 11, 2022.