Week 14 — Workflow (11/28–12/2)#

In this week, we are going to talk more about workflows. What does it look like to build a practical data science pipeline?

🧐 Content Overview#

Element	Length
🎥 From Notebooks to Workflows	3m44s
🎥 Scripts and Modules	15m33s
🎥 Introducing Git	12m2s
🎥 Git for Data Science	6m52s
🎥 ETL	6m46s
🎥 Split Apply Combine	6m45s
🎥 Tuning Hyperparameters	10m49s
🎥 Reproducible Pipelines	8m28s
📃 Software Environments	1068 words
📃 Yay Reproducibility	1250 words

This week has 1h11m of video and 2318 words of assigned readings. This week’s videos are available in a Panopto folder.

📅 Deadlines#

Quiz 14, December 1
Assignment 7, December 11

🎥 From Notebooks to Workflows#

In this video, we introduce going beyond notebooks to broader structures for our Python projects.

Video (3m44s)

Slides

So this video, we're going to talk about what we're really looking at this week,
which is moving beyond what we've been doing in individual notebooks to being able to have a workflow that crosses multiple modules,
multiple files and is version controlled. So learning outcomes for this week are for you to be able to break code into scripts, modules and notebooks,
to design a data pipeline, to run and reproduce and analysis and use get to version control your code.
So. Notebooks are great. We've been using them all semester.
Been getting a lot of use out of them. They're great for interactive with testing code.
You can view results right with the code. The notebook is great for displaying charts and visualizations.
It can display pandas data structures very nicely. We can combine discussion methods and results for.
For documents where the methods, the computational methods are right there for exactly what we're doing.
But they don't scale terribly well. There's a few problems with them that we want to try to address.
Which is one. One is that it's hard to reuse code from one notebook in another.
There are mechanisms that you import a notebook as a module that are a little weird.
They're also not great for long running tasks. The notebook you're try to use kick off a job from the browser.
You lose your Internet connection, you go home, whatever. It's not a great environment for running a long running task.
Those are better run in a Python script directly without the notebook infrastructure.
Also, you can there are you can run a notebook from the command line,
but it's going to be overwrite options and things like that to reuse the code in the sense of a program, not just functions that you reuse.
The options are a little limited there. So to move beyond this,
we're gonna look at in this video or in this week are being able to write scripts
which are Python programs that run on their own and then to take our python code,
our functions, our classes, et cetera, and put them in the modules that we can then reuse in our scripts, in our notebooks and in our other modules.
So in this context, we're going to be thinking about data pipelines.
And so if you've got a you've got we've seen diagrams like this earlier, but if you've got some raw source data,
if you have a data integration step that's going to get you some prepared data that you analyze,
then you're going to want to do some descriptive analysis. That's a great use for a notebook right there.
You want to do descriptive analysis of the results of your data integration, data transformation.
You also want to be able to do some statistical inference. You want to be able to do some predictive modeling where you're generating predictions,
classifications, etc. Maybe you're doing inference on their accuracy.
We can think of each. So far, we'd we would put all of this in one notebook.
But in practice and a lot of projects, you're actually going to want to split that apart so that you have different stages in their own files.
You'll have likes a script or more than one script that will do your data transformation.
You'll have a notebook that does data description. You'll have a script that runs one of your predictive models.
You might have different scripts for different predictive models, etc.,
so that you can rerun individual pieces and you don't have everything in one big file that's difficult to edit and maintain.
So to wrap up significant data science projects usually have multiple components in a pipeline.
Get is really useful for tracking, for tracking and versioning the code that used to generate these components.
The rest of the video, this video's this week. We're going to talk about more about how to do these different pieces.

🎥 Scripts and Modules#

This video introduces Python scripts and modules, and how to organize Python code outside of a notebook.

Video (15m33s)

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand SCRIPTS AND MODULES Learning Outcomes Write a Python script Put Python code in a module Understand the Python module/package structure Photo by Simon Goetz on Unsplash Scripts A .py file can be run as a script from the command line:python my-script.py Runs the code in the file ‘def’, ‘class’, etc. are just Python statements Example: read in a file, and write a filtered file Starts with a docstring (optional) """Filter ratings to only real ones"""import pandas as pdratings = pd.read_csv('ratings.csv')r2 = ratings[ratings['rating'] >0]r2.to_csv('filtered-ratings.csv', index=False) Scripts and Pipelines Typical script: Reads input files Does some processing Pandas manipulations SciKit-Learn model training/evaluation Saves results Data frame as CSV, Parquet, etc. Model as pickle file Docstrings A Python code object can start with a docstring Script, class, function, module Documents the code Purpose Function arguments Class fields Doc renderers & IPython/Jupyter use these Configurability Scripts can take command line arguments python script.py in.csv out.csv In list sys.argv 0 is name of the program Libraries help parse: argparse (in standard lib) docopt (uses help message) """Filter ratings to only real ones"""import sysimport pandas as pdin_file = sys.argv[1]out_file = sys.argv[2]ratings = pd.read_csv(in_file)r2 = ratings[ratings['rating'] >0]r2.to_csv(out_file, index=False) Import Protection Python files can be either run as a script or imported as a module Import-protect your scripts to avoid potential problems & enable code reuse: Put all code in functions Call main function in ‘if’ statement at end of script """Filter ratings to only real ones"""import sysimport pandas as pddef main(): in_file = sys.argv[1] out_file = sys.argv[2] ratings = pd.read_csv(in_file) r2 = ratings[ratings['rating'] >0] r2.to_csv(out_file, index=False)if __name__ == '__main__': main() Modules import foo Looks for file foo.py In script’s directory (or local dir for notebooks / console) In PYTHONPATH environment variable In Python installation Runs file to create definitions Exposes definitions under ‘foo’ object def bar()… becomes foo.bar Exposes all assigned names: variables, functions, classes, other imports… Packages Modules can be grouped together into packages A package is just a directory with a file __init__.py Init file can be empty Init can have docstring to document package Packages can contain other packages Let's see an example… Script Advice Write a docstring (quickly glance at script to see purpose) With docopt, docstring is also script usage information Import-protect scripts Provide reasonable configurability If script has too many different modes, break apart Multiple scripts Common code in modules Disconnected Runs What if you lose connection? Can we start a process running, go home, and check it later? The tmux program does this! tmux creates a new session Ctrl+b d (Ctrl+b followed by ‘d’) detaches tmux attach re-attaches to session Many other capabilities under Ctrl+b. General Principles Use packages and modules to organize code for your project Layout Common utilities Presentation themes? Always refer to relative paths Applies to all code! Beware excessive configurability In either functions or scripts If multiple ways to combine pieces, extract pieces & have different scripts or functions that combine them in different ways. Wrapping Up Scripts and modules are useful for organizing code in larger projects. We can reuse code and operations across multiple parts of the project. Photo by Klára Vernarcová on Unsplash

In this video, we're going to talk about how to use Python scripts and modules to break our analysis apart into smaller pieces and organize our code,
learning outcomes are for you to be able to write a Python script. Put Python code in a module and understand the Python module in package structure.
So Adopt Pi file can be run as a script from the command line.
So if we have if we have a file like this, we can run it.
And it saved as my script. Scott Pi, we can write it with Python, my scripts, not pi on some systems.
You might need to run it with Python three, my script dot pi. But what it does is it just runs the code in the file from top to bottom.
If you define a function, debt in python def in class are actually just python statements that define a
function or a class and save the resulting function or class object in a variable.
It runs a script from top to bottom. So this example here, it reads in a file.
It filters it so we only have the values with the rating is greater than zero, and then it saves the result back out to another file.
It also starts with a dock string. So a dock string is this.
It's this string at the top.
I'm using triple quotes with which allows us to have a multi-line string in Python triple quotes to limit multi-line strings.
The string at the beginning just tells us that we've got it. What the script is going to do.
It's going to filter ratings to only real ones. So the script also was an example of the typical kinds of things that we usually do with scripts.
So a script is often going to read some input files, do some processing,
and might do Panda's manipulations and might trying to Saikat learn model and make some predictions.
It might do a statistical inference and then it's gonna save the results data frame it like if the results are a date,
one or more data frames save them and CSP files. I really like saving data frames and kept files because they're easier.
They're they're more efficient to read and write.
If you can also take entire Saikat Learn model that you've trained and use a library called Pikul to save it to a file on disk.
And then the next stage of the pipeline,
another script or a notebook is going to read these outputs that you saved from this script and do something with them like.
You might train a psychic. Learn model and predict some test data and save the results of that.
And then a notebook will load the test data and load your predictions of it and compute your accuracy metrics so that you can separate
perhaps a very computationally intensive model training and prediction stage from analyzing the results of running your predictor.
So any Python code object, a script, a class, a function, a module can start with a dock string.
It's just a string and it's just a string. All it is, is a string.
And it's at the beginning of the file. The beginning of the function or the class.
And what it does is it documents the code, its purpose, its argument, if it's a function, might document class field, et cetera.
If you've used Java, it's Python, the equivalent to Java doc.
And both documentation renderers such as Sphynx and eye python and Jupiters.
So I Python is the python engine that lives inside of Jupiter.
They use the dock strings when you ask them to document a particular function or a class.
They're also useful for scripts. Scripts can also take command line arguments.
So if we run this script with Python script up high and then we give it to command line arguments in that CSB and outdate CSP.
Then. What it's going to do is it's going to pass in that CSB.
As Sister RGV one, and it's going to pass out, that's GSV as Sister AAG too.
And we can access them in2. We can access them in our script so that we can make a script that can do the same operation on different data files.
And so if you've got a variety, if you have different data sets, you want to do the same operation.
You have maybe different models that you want to run. And you know how to run them given a name and a command line.
This allows you to make scripts that are parameter eyes. You can use the same script code to do multiple different tasks.
The system RTV variable, it's in the system module C, we import that.
It's a list of command line arguments. ARG V Zero is the name of the program and then ARG V one.
And following is the are the actual command line arguments that were passed to your script.
It does not include any of the command line arguments that were passed to the Python interpreter itself.
Python strips those out and sets it up so that your program just sees its name and its arguments.
Then there are some libraries that help you pass command line arguments to allow you to build very sophisticated command line interfaces.
One is AAG parts. It's in the Python standard library. Another that I use a lot is called Doc Opt and it actually uses your help message
to define what options are available or they are written in your doc string.
Another thing we need to do when we're writing a script is do what we call import,
protect it, because Python files can either be run as a script or imported as a module.
And if we it's a common convention to import protect. And what we do is we put all of the code and function.
So I've moved all of our code into a main function here.
And then at the end of the script, you have this kind of a line where if name is May equals Main, then we call the main function.
And this underscore, underscore name is a python magic variable.
That that contains the name of the currently of the module that's currently being run or loaded.
And as a special case, if you run a Python file as a script, what it does is it sets the name to underscore, underscore, main, underscore, underscore.
So this is this is how you detect that your file is being run as a script.
And what it does is it only actually runs the code that's going to do your operations if it's being run as a script.
If it's not being as it's run as a script, it's just gonna define all of the functions. Couple of reasons for this.
One is it allows you to just import a function from another script.
I don't really recommend that if two scripts need the same function, I recommend putting that in a module.
But. Also, there are some situations where Python may need to reimport your script around certain parallelism techniques.
I haven't taught you how to do any of them, but some libraries may use them.
And so import protecting your scripts just provides this extra protection in case in case eventually you wind
up wanting to do something in your code that uses one of these techniques that requires it to be reimported.
It's standard practice, though. Most Python scripts you're going to find in the wild, particularly in distributed software, are import protected.
So I've mentioned module's what is a module when you run, when you have the pilot, the Python command, import that food or import food?
What it does is it looks for the file called food up pie, and it looks in a few different places.
It first looks in the scripts directory or if you're just running a python interactive interpreter or a notebook notebook,
it looks in the notebooks current directory for a console.
It's going to look in your current local or at current working directory.
It then searches that environment, the directories in an environment variable called python path.
Environment variables are a mechanism for a process to have information about its environment and then to pass that on to child processes.
I put just a little bit about them in the glossary online. It also then looks in your Python system directory.
And then it runs this file to create its definitions and it runs the whole file because a python,
a python file just runs and all of your things are statements. Def is a statement that defines a function import the statement that imports code.
And then what it does is all of the definitions get exposed under the few objects.
So Foo has a death bar that defines a function called bar.
Then it's available as food bar in the function or in the code that imports Foo and it exposes all of our assigned names,
variables, functions, classes, other imports. Any variable that's defined gets made available.
There's no such thing as a truly private variable. Can convention is you prefix it with underscores.
Anything that's defined in food is available, food, whatever. Pilot modules can also be grouped together into packages,
so a package is just a directory with a file called underscore, underscore in it, underscore, underscore, dot pie.
This file can be empty. All it does is signal that the package is there.
You can also put some code in it to the default setup, some things in the package if you want.
It can also the dock string to document the package and present packages can contain modules and other packages.
So if you import food up bar, what it does is it looks for the food module or the food package,
a food directory with the init file in it, and then it looks for bardock pie or a bar directory with the init file in it within that.
So let's see an example of this. This is a project I have for doing some experiments with recommender systems.
And here, this first file that I'm showing you is a script that I created that splits data.
And in its Doch string, I give the usage to say how how to run the script.
You run split data, that pie with with partition, you can give it to options.
Doc Opt will pass this in order to figure out what options to pass out of the command line.
And then I have my imports. I have my main function, which actually takes the arguments that are already past the main function, does all of the work.
And then we finally have at the end we have a.
We have a file that we have the import protections, if the name is made, that we set up a logger.
We pass the arguments and we call the main function.
I don't usually do very much like I just have no more than like three lines or so in my import protecting.
But here I'm making it's that the main function actually takes the pre passed arguments already.
Now in this project, I also have a directory called LTA Demo that has a file underscore, underscore and it dot pie that's empty and that makes.
OK. Demo a package. And within that then I have modules. So I have the module log which defines a couple of functions.
The setup function to setup logging a script function that sets up logging for, for a script.
And so then. So I've defined this, this module, this, these functions in this module.
And then over and split data. What I did was from LDK demo.
I imported the data sets and log modules. And so then I can call log dot script and it's going to do that initialization process.
I'm linking to this example code that was prepared by me and some of my graduate students in the resources.
You can go see an actual example of how this kind of code gets laid out.
So a few pieces of advice for writing a script first. Always with a dark string for your script.
That way you can quickly just look at the file. Look at the top of the file and see what that script is supposed to do.
Also, I like using dark ops that then I just write in my dark string how you actually run the script and what its options are.
And Doc UPT uses that as the ground truth for how to actually pass command line arguments.
I'd recommend always import protecting your scripts.
Recommend providing reasonable configurability, so some options like, OK, I want to run on three different data sets.
Maybe change a parameter like how many partitions of data to create.
But if you've got if you wind up creating a lot of options that create a lot of different modes.
So there's different code pads. You've got a lot of extensive conditionals in order to figure out the right code path to the script.
I often recommend breaking that into multiple scripts, put the common code in the modules,
and then for each different way, you need to have combine those functions that you defined in the modules.
You can write a different script and if you put enough code in the module that each script is very simple and straightforward.
And that way you don't have near as much code complexity that easy to break as you're doing future development and maintenance of your code.
So another thing I want to mention briefly is we've got these scripts and run them from the command line.
We can also run them and the disconnect and leave them running on another computer.
It's like if you're running on Onex or you're running if your research group
has a computer that you can run things on or you're running on an Amazon node.
There's a program called Tmax that creates a terminal that you log into a computer, to a machine, to a machine, overassessed age.
You run Team X, it creates it starts in your shell, but it's within ti max.
So you can run programs and there you can start your program running.
You can then detached from t max hit control be followed by D it's team X will detach.
You can log out your team X is still running and the program you're running is still running.
So then you can go home. You can log back into the machine. You can run Team X again.
Team X attach will reattach to an existing T-Mac session.
And then you can see. You can check on your program.
So and also it protects if you lose your Internet connection, if you're just running a program over, as I said, you lose your Internet connection.
It's going to stop. But if you run your program over SSA, it's through T Tmax.
You lose your connection. When you connect again, you can tarmacked attach. And the program will never know you disconnect it.
And so it allows you to run your scripts in a much more robust fashion.
If you've got a script, that's going to take a while.
So general principles are they want to recommend or use packages and modules to organize code for your project,
a variety of things that I put in there, I put in code about how to go find other files so that I can I have my file names defined in one place,
maybe code about like, OK, here are the data sets that might be stored as a variable.
And one of my modules, common utility functions that I use throughout, like those logging scripts,
all of my scripts use those logging functions in order to set up the logging framework.
I often wind up having a module that has code for doing prison, time for doing plot some visualizations particularly that has the theme.
So it's easy for me to have the same layout and the same ability to save images to desk etc
throughout all of my notebooks always refer to rule part files by their relative paths.
You never want to have an absolute path in a script or in a notebook or in a module,
because then if someone else is working with the code or you just check it out in a different location in or different computer,
it's not going to run.
Always have relative paths relative to the top of the working directory or the top of your repository usually is where I have them from.
If it's a notebook, it needs to be relative to the notebooks location so that you can move code from one place to another and still run it.
Also, be careful about excessive configurability, either in functions or in scripts.
If you've got too many different paths to it through a function or through a script,
then that's a good sign that you need to pull some code off into the functions in a module, make multiple functions or multiple scripts.
Each one has one of those paths through the code. So to wrap up scrips, the modules are useful for organizing code and larger projects.
We can reuse code in operations across multiple parts of the project.

Resources#

Python Modules
docopt, a very useful tool for processing command-line arguments
Environment Variables in glossary
LK Demo Experiment, which I used in the demo; this also uses DVC
tmux

🎥 Introducing Git#

This video introduces version control with Git.

Video (12m2s)

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand INTRODUCING GIT Learning Outcomes Use Git to save versions of scripts and notebooks Share code through GitHub Merge code changes from collaborators This video introduces concepts — see links for hands-on learning. Photo by Ula Kuźma on Unsplash Problems with Saving Files Make a change that didn’t work — get back the old version Save work to make sure you don’t lose it Making sure you have the right version of the file on multiple computers Share changes with collaborators Versions Git is a tool for storing versions of software Snapshot of current state History of versions A commit is a snapshot with a pointer to the previous commit(s) Chain of commits form history Can go back to previous commits Commits form basis for sharing and merging changes Core Concepts The working tree is your directory of files, ready to run or edit The index is a staging area for changes to be commit A repository stores the history Local repository is in .git directory in working tree Remote repositories (e.g. GitHub) don't have working trees Your local copy has the entire history! A branch is a is a line of development Points to a commit Updated as you make new commits Default branch is either ‘main’ or ‘master’ Local and Remote You have a local repository where you work and make changes It can have configured remotes where you push and pull changes GitHub is one service for hosting repositories Configure your GitHub repo(s) as remotes on your local Other options include BitBucket and GitLab Operations commit records the current version of your files clone creates a repository & working tree by copying another push sends commits (and files) to a remote repository Updates remote branch to match local branch fetch retrieves data from a remote repository merge updates one branch to include changes from another pull updates local branch to include remote (fetch + merge) Use Case: Tracking History Work on your code and notebooks Commit when you have a version you want to save Do this very frequently It's useful for commits to successfully run Result: local history to go back and recover old versions Use Case: Multiple Computers Work on one machine Commit your changes Push to remote repo (e.g. GitHub) Pull on other machine, continue working Significantly less error-prone than manually copying files Cannot directly push between machines (can pull, though) Use Case: Collaboration Work, committing changes from time to time Pull from shared remote to get collaborator's current work Merge if necessary Push your work to shared remote Can only do if your branch is current w/ remote changes Collaborator pulls changes Always commit before merging Ignoring Files .gitignore files specify files to ignore These are committed to Git, to share w/ others Should ignore: Editor temp files (e.g. ~, .bak, .swp, etc.) OS weird files (e.g. .DS_Store on Mac) Python bytecode cache (__pycache__ dir, .pyc/.pyo files) Compiled files Most generated files Commit source and generators In data science projects, may store results of analysis (or even processing) Tools / Interfaces git command-line tool You’ll need to learn this, even if you primarily use other tools(running code on servers / clusters) Sometimes you need to fix things Integrated support in editor / IDE I use VS Code for almost all code — it has very good Git support Dedicated GUI like Tower, SourceTree, or GitKraken Free through GitHub Student Developer Pack Wrapping Up Git allows you to record versions of your code to track history, roll back changes, and share with others. Commit early and often. Photo by Jed Villejo on Unsplash

So in this video, I want to introduce using get to save versions of some of your project and share code with others.
So learning outcomes are for you to be able to use get to save versions of your scripts and notebooks.
Sherko to get hub and merge code changes from collaborators. This video introduces the concept.
It's not Hands-On and I'm not going to walk through the specific details of how to do each operation.
There are a lot of tools of resources online for learning get.
And I expect a number of you probably already know it, although certainly not all of you.
So I'm going to fried some links in the resources with this video to more of places where you can do hands on learning of the details of get.
I'm going to be talking about the concepts that you need to know to put it all together.
So we save files, we have scripts, you have notebooks, they're saved files.
But there are a few things we might want to be able to do. Maybe we made a change.
It didn't work. We want to get an old version of the file back. Maybe you want to make sure that we have the current version saves as we make changes.
We can't go back. Maybe we want to make sure that you have the right version of the file on multiple computers.
You've changed it and you changed it at home and you changed it at work. You want to make sure you've got the same version everywhere.
We also want to be able to share changes to our files and our projects with collaborators if we're working with others on a project.
And so give us a tool for storing versions of software or versions, a snapshot of the current state of your code.
And then it has a history of these versions. It's the first version, second version, et cetera.
So a comet. It does this with what's called a comet.
And a comet is a snapshot of your correct code with a pointer to the previous commits usually one,
but it can be two if it's the result of merging two divergent branches of code back together, the chain of commits form a history.
You can go back to previous commits, commits,
then form the basis for sharing and merging changes between multiple computers and multiple collaborators.
Few of the core concepts of Get Out First is the working tree, which is your directory of files ready to run or read it.
So you've got to work. If you aren't using get yet, you already have a working tree and then get has a repository that stores the history.
This lives inside a get directory in your working tree.
There's also remote repositories such as GitHub. You have an index which is a staging area for changes to be committed.
So you got your working treats, you stage files into the index, and then you commit them to a repository.
You also then have a branch, which is a line of development that points to a commit.
So if I've got a first commit A and then I've got a second commit B and a third commit C.
And I have my main. Main points to see, and that says OKC is the branch and it has its history going back through B and A demerging happens,
well, maybe somebody else makes a B prime from A,
and then I say, well, I want both B, NDB Prime and I merge the changes, resolving any places where they conflict and I get my commit c the domain.
The branch is updated as you make new Kimmet. So if you're on Branch Main and you make a new commit C prime.
Or as new, commit deth. Then it's going to update Maine to point to D.
So you have a local repository where you make him work your changes.
So you're working tree has a guy get directory that get directory contains the entire it contains your local repository.
It has a complete history. You have a complete copy of the history locally. You can then have configured remotes where you push and pull changes.
So if you're using GitHub, then GitHub will be a remote that your local is set up for KIPP to push to other options besides GitHub,
include BitBucket and get lab. You can also run your own server to host get repositories.
A few of the operations you're going to need to be able to perform and again,
I'm going to refer you to the resources I'm linking online to learn more detail about each of these operations.
A commit records that the commit operation or command records, the current version of your files creates and you commit.
Clone creates a repository and working tree by copying another.
So if you say if there's a repository on GitHub and you want to work on it, you clone it to your local computer.
Push sends commits from your repository to a remote, and so if you've made some changes and you want to push them, if you want to share them.
Either push the new repositories so you can access them on another computer or that your your collaborators can access them.
You use push and that's make sure that pushes you push a branch.
And what that does, it is it makes sure that the remote machine, the remote mote repository has all the changes in your branch that you have.
In fact, retrieves a different remote repository, merge merges two.
So if you've made some changes and someone else has made some changes and pushed them.
You need to merge those changes together before you can push your code backup and update your common branch pull fetches,
emerges together to update your current local branch to include the remote.
So now to talk through a few use cases here. First, the simple use case is just tracking history and a repository on your local computer.
You work on your code and notebooks. You commit when you have a version you want to save.
I recommend doing this very frequently, multiple times a day, possibly even multiple times an hour.
That way you can always go back and you have the security of if you if you accidently make a mistake, you delete the wrong file, you can get it back.
The result of this is you've got a local history to go back and recover old versions.
I also sometimes see I go K through three days, a complete week's worth of work and haven't committed.
I'd strongly recommend commit early and commit often. So another another use cases.
You've got multiple computers, you work on one machine.
You can make your changes, you push through a remote repository, maybe it's GitHub, maybe it's somewhere else.
Then on the other machine you pull to make sure you have the latest changes and you continue working.
This is significantly less error prone than manually copying files because GET is tracking your
versions and it knows that the remote is more current than your current version or vice versa.
And so if you're just copying files around, you have to keep track of which one is the latest and current version of the file to make sure you
don't accidentally copy an old file on top of a new one get since it's chaining the commits together.
It knows old versus new, and it can also merge if you made changes in both places at the same time.
It can help you merge those together. So it's a much more reliable way to share a code between multiple computers than copying files around.
You cannot directly push. I can't push from my laptop to my desktop.
I have to push from my laptop to get up and then pull from GitHub to my desktop.
I can't. You can't pull like I could pull for my desktop to my laptop.
I don't do that. I always go to get help. Another then use case is collaboration.
So you work, you commit changes from time to time when you're ready to share your work with your collaborator, you push to a remote.
You both have access to. Maybe it's your GitHub repository. Excuse me, first you poll, because you need to make sure you have your collaborators work,
because if they've pushed changes you don't have, you can't push get will say your branch isn't up to date.
I can't push. And so you pull from your local.
You collect your report remote repository to get your collaborator's work.
You merge if necessary, if you need to merging of tests, run the tests,
make sure things to work, and then push your work with the merge changes to the shared remote.
Now your collaborator can get them. They'll need to pull before they can push again.
But you can only you can only push if your Brant if you have a current copy of everything that's on the remote.
And so you have to do the pull in the merge before you can push if you have an especially active collaborator.
And it takes a little while to do your merge, then you might wind up OK.
I pulled the total merge. I'm ready to push up. They've pushed more code.
Since I pulled I have to pull again. And then your collaborator can pull down the changes.
One note here. Always commit before merging.
You can pull with uncommitted changes. But if something goes wrong in the merge process, if you've committed everything before,
you start to try to merge your divergent lines of code, then you can always go back and try again if something goes wrong in the merge.
So another thing you want to pay attention to is ignoring files.
You don't want to commit every file to get. So a repository usually has a file called get ignore to specify files to ignore.
You could also have your own settings. You can have a ignore file that's applied to all repositories.
You work on it. Say if you have a text editor that makes a particular kind of backup file.
I recommend putting that in your personal get ignore. But the things you some of the things you should ignore editor temp files some times
like Emacs when backup mode save a file that ends until the VM creates swap files.
You want to ignore those, you don't actually accidentally commit them. Macko asked, likes to create directories called Dot D.
S store. You'd never want to commit that to getting share. It's useless to your collaborators.
Python creates a variety of temporary files.
That cache transformed versions of Python code.
You may have a project recompiling some code you don't want to commit.
The compiled files, in part because you committed compiled on Mac and someone else is going to run on Linux most of your generation.
In general, if the file is generated from another file, you probably don't want to commit.
A key exception is notebook's and then some data science project.
You may store the results of analysis and get. It's not a hard and fast break if you do it.
It says if you can quickly and easily regenerate the code, it's often a good idea.
You commit source get a designed for tracking source code, not generated files.
And so. So these are some of the files you're going to you're going to you often want to ignore in the resources.
I'm going to give you an example that get ignore file. That's good to toss in a lot of Python data science projects.
So some of the interfaces and tools that you can use to work with get first the get command
line tools to get the command line tool you can use in Unix or Windows command show.
You're really going to need to learn the command line tool, even if in most of your day to day work you use other tools.
Because if you're running code on a server, on a cluster or something,
you need to be able to pull at least and probably make some changes and commit sometimes.
Also, you need your repository getting a state where you need to fix things that aren't easy to fix from the gooey.
The degrees are getting better and better about that.
But occasionally I find myself needing to go into surgery or something just faster on the command line.
But the big reason you really need to learn it is when you've got larger projects,
you're probably going to be doing some running on a remote server, not just on your local machine with graphical interface.
And you really need the command line to be able to do that. There's good get support and a lot of editors and I.D.
I used Visual Studio Code for a lot of my editing. It is very good.
Get support. And then there's get dedicated. Get DUIs like tower or source tree or get crackin.
Some of those cost tower and get crackin.
Both cost money, but they're both also available for free to the GitHub student developer PAC.
If you register your university email address, so wrap up get allows you to record versions of your code so you can track history,
rollback changes and you can share with others and with yourself across multiple computer computers.
I strongly recommend that you commit early and often to prevent lost work as you're working on your projects.

Resources#

Git Resources (including my example .gitignore file)
Git Handbook
Resources to Learn Git
Version Control by Example
GitHub Student Developer Pack

🎥 Git for Data Science#

How do you use Git effectively in a data science project?

Video (6m52s)

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand GIT FOR DATA SCIENCE Learning Outcomes Understand limits of git Ignore data files Know additional tools to look at for managing data files Photo by Lianhao Qu on Unsplash Git Strengths Git is very good at tracking: Modestly-sized files (less than a few MB) Text files Not good for: Binary files (although small ones are ok) Large files (especially binary) Hard-to-merge files (e.g. notebooks) Ignoring Files In a data science project, we often ignore more files Data files (e.g. *.csv, *.csv.gz) Inputs, intermediate files, and large outputs Keep the notebooks (and possibly other documents) Method 1: Inputs + Recreate Ignore data files Include script to fetch input data from central server File store Database Reproduce intermediate files locally by re-running scripts Optional: commit outputs / summaries Optional: save results into database / shared repo Good if analysis is relatively cheap Method 2: Data Data Repository Ignore data files Include scripts to fetch current inputs + intermediates from another server (file share, Amazon S3, etc.) Include scripts to update inputs on other server May commit: Outputs File versions Can be bespoke or use standard tools (I use DVC) Method 3: Large File Storage Git LFS (Large File Storage) Manage large media files with Git, looks like they are committed Commits stub file with pointer to actual content Content stored on separate server, not in git repo Stub replaced with content on checkout You may commit outputs in this mode! If someone changes, re-commit and push Caveat: if you use GitHub, limited space + bandwidth Notebooks Notebooks are text, but are complex JSON Hard to compare Hard to merge Change when run! 2 solutions, roughly: Commit as normal Merge by just taking one version Merge with nbdime Coordinate notebook edits Commit without outputs nbstripout filter Wrapping Up Git works great for data science, but requires a few new tricks. Be thoughtful about how you handle data in Git. Notebooks can be annoying. Photo by laura adai on Unsplash

This video, I'm going to talk with you a little bit about, you can get specifically for data science projects,
so learning outcomes are understand some of the limits of get ignore data files and know some additional tools to look out for managing data files.
So it gets very good at tracking modestly sized files, far smaller than the few megabytes and text files.
It's not so good for binary files or large files, especially large binary files.
It also has some difficulties with files that are hard to emerge, such as notebooks.
A notebook is stored in text, but its text is a lot of Jason merging.
All of that is really touchy and easy to get wrong. And so it requires a little special care when you commit your notebooks to get really good idea.
Get Hub lets you view them in line on online, but you have to take a little care if you're going to be needing to be merging them.
So first, is it a data science project? We often ignore more files.
So a lot of heads, the data files, input files, intermediate files, large output files, all of those are going to ignore.
So we're going to have ignore lines for CFS fees. Yes, Fijis. We're usually going to keep the notebooks, quite possibly other documents,
and that may involve keeping outputs like the notebook source, the source code and the output in the notebook file.
We may also store other notebook outputs, etc. and get just that.
We you can view the results without having to rerun everything when you check out the get repository.
So but for dealing with these large input files, you've got your input file, it's maybe two gigabytes.
You're going to create a few hundred megabytes of output. There's a few methods.
One is that you can just expect anyone working with the repository to recreate all the intermediate output files.
So you ignore all of your data files. You include either a script or instructions on how to fetch the input data.
Maybe it's a script that downloads data from a database or fetches it from a Web site.
Then you have scripts that reproduce the intermediate files. You've documented how to run them.
The read me is a very good place to do this. You may commit outputs or summaries of the files.
You may save the results into a database or shared repository. But if the analysis is relatively cheap, this can work well.
Fetch the project. Make sure you have the current input data rerun. But if the data is not so cheap.
If you've got processes that can take a while to run in there,
then you also want people to be able to get copies of the intermediate and output files.
Maybe you've got a classification model that takes four hours to train.
What you can do then is you can, again,
ignore your data files and you can include scripts that fetch both the current in inputs and intermediate files from another server.
Maybe it's a file share on your network. Maybe it's an Amazon s three server or Buckett.
And then you could also include include scripts to update inputs of you update the input files, you update the intermediate files.
Sort of the current versions on the other server. You might again commit your outputs.
You might commit information about the versions of your intermediate files.
You can do this just by writing about just scripts yourself, or you can use this tool that does a lot of it for you.
I take this approach to a lot of my own work using a tool called Data Version Control, but I'll talk more about later.
And then the third method is to use get large file storage, so large file storage is a system for managing large media files.
And it looks like they're committed when you're working with it and get it just. They act like any other file that's committed to get.
It's just that all that actually gets committed to get as a short stub that says what the contents are supposed to be.
And the actual file content gets stored in a separate server, get ELA fast, pushes and pulls that other content to this separate server.
And when you check out every places the stub with the actual contents, you have the large file.
It works great for big files. You can commit output. You might want up committing outputs if you use this.
And if you change it and you recreate your new output files and re push.
One of the one caveat to this or one of the caveats to this is that if you use if you can run your own get hosting,
you can run your own get out fast hosting server. All the storage you want.
But if you use GitHub, their default accounts have limited space and bandwidth and the pricing on expanding that can go up relatively quickly.
And so it's often not terribly cost effective. You get L-A fast and GitHub for a lot of large data science files, but get off as is an option.
So I want to talk just a little bit about notebooks. As I said, notebooks are taxed, but they're complex, Jason.
It's hard to compare and merge them. Also, they change. You run like you rewrite. It changes the images.
It might be bit forbid, identical. It might be running with a slightly different version of the software.
So it compresses differently. Also, it has it the Jew Jupiter stores like how many times each cell has been executed in the Jason.
So there's roughly two solutions for dealing this.
This one is just committee's normal merge by taking one version or like often merged by taking one version or another or doing manual merges.
There's a tool called NDB Dime Notebook. Different merge. That gives you support for actually merging notebooks.
It's a little weird, but it does work. I have used it successfully. Also, you can coordinate, notebook, edit.
So if you if you're a small pool of collaborators, you can just coordinate, send a message on slack, say hey.
I'm really working on this and hoping for a little while. Maybe don't change it, we'd have emerged problem.
You're done. You push it and then the others just stay away. It kind of breaks a little bit of the bird Frehley with get because it's easy.
But notebooks are a little hard to merge.
Another option that doesn't fully fix the problem but makes it merges easier is to only commit the notebook without outputs.
And so there's a program called NDB Strip Out that strips the output from a notebook content.
You can wire it into gets that anytime you commit. You still have the outputs in the version you're working free.
But the version that actually gets committed, it's like you ran clear all cells before saving and that can decrease the amount of conflict
because the only thing that changes your textual descriptions and your source code,
they're both options. I tend, which when I use really depends on which project I'm working on.
A lot of my projects I commit as normal and use NDB dime. If I have a notebook murd situation.
So wrap up. Gil works great for data science but requires a few tricks.
You need to be thoughtful in how you handle data, notebooks, things like that, and get works great.
But there's a few things you need to pay attention to. Notebooks can be a little annoying.

Resources#

NoteBook DIff and MErge (nbdime) — tools for diff/merge of notebooks. Available in Conda:
```
conda install nbdime
```
git-lfs

🎥 Extract, Transform, Load#

The Extract, Transform, Load (ETL) pipeline is a common design pattern for data ingest. Sometimes it is adjusted to Extract, Load, Transform.

Video (6m46s)

Slides

This video, we're going to talk about the extract, transform load pattern for handling data transformations and integration.
So learning outcomes for you to be able to use standard design patterns that think about your data integration and transformation.
So we saw the pipeline at the beginning of the week.
If we have some raw source data and we want to transform it into prepared data of some kind, that's ready for further use.
Using the techniques that we've discussed earlier in the semester,
turns out that there is there are standard paradigms for thinking about how we structure that process.
And it involves breaking it down into three stages.
So extract the extract, transform load pattern takes as input a source of initial unprocessed data, the extract process or stage.
Gets the data, it exports it from the database, it scrapes it from the Web site, it downloads it from wherever the data comes from.
And it's how you get the actual raw data that you're going to be working with.
Then you transform that data into maybe you transforming data from multiple sources into a common format.
You're integrating data. You're doing some initial cleaning, like deleting invalid records.
And then finally, you load the data into a system for analysis.
And the setups we've been talking about so far that a lot of times this is going to look like saving
it in the file like a part kept file or CSP file that you then can load in the later stages.
But you might load it into a database. And the result is that you have cleaned and integrated data ready for analysis or modeling.
And this may seem certainly fairly straightforward,
but it's what we call a pattern and a design pattern is a common structure for design in general, but in our purposes for software design.
And it serves as few useful features if it accomplishes a few useful things.
First is, it gives its common language for documenting and understanding software.
So if you document I'm here's my extract my transform my load for this for processing this data,
then others will know what you're talking about because you're using familiar in standard language.
It's also provides context for developing best practices because you can document, OK, here's good ways to do extraction.
Here's good ways to do transformation. It could also in some cases benefit from the automation support.
There are automate. There are automated tools that provide extensive support for doing various types of extractions, transformations and loads.
Another example, the design is a little bit tableau between an interface in a design pattern.
But the pattern we're using this, Kate, learning of a model. You fit the model. It updates in place.
We can think of that as a design pattern for machine learning models in Python.
So in context, ECL may live in a standalone instead of a project and may live in your repository.
So you have one or more scripts to do your EDL stages.
You might have separate ones like an extract, transform and load stage that save the data in a format that's ready for subsequent stages of analysis.
You don't have to have getting things into easy to work with format, deleting invalid records,
making sure you have tabular data in your train, a predictive model stage because that's happened in your E Taliban.
All of your you can do multiple things with your clean data. And you may also, though,
have a dedicated ECL pipeline project that to give repository that just does your ECL pipeline for big win
contexts where the loaded data is going to be used in multiple different projects across an organization.
And so if you've got if you've got an organization where your processing say you're processing some
government records in a form that's going to be used for informed decision making across the organization,
then you might have an ECL pipeline that fetches the current version of the government
records from the from the the the official source does whatever transformations
you're going to you need to do with them and then load them in a database so that other people in your organization can go and make use of that data.
And maybe you rerun this once a day, once a month, once a week, whatever the right timeframe is.
So that then. Anyone in the organization who needs the data can just go to your database.
Or sometimes it's called a data warehouse or data warehouse is basically a database that's often larger and capacity
and not designed to support real time production level systems is designed for more long term querying and analysis.
But anyone can go to the database and get the data as of the most recent result or the most recent import.
And you've got this ETF pipeline that takes care of that and you can rerun it from time to time as your source provides new data.
A variant of ECL is extract, load, transform.
So sometimes you're actually going to want to do the transformations in whatever place you're storing the data.
An example of this is if you want to use ask you all queries to do your data transformation.
So you extract your ross' or your raw source data and then you load it directly into initial database tables and then you use your database system.
If it's a rescue old database, use ask you all queries, but something else, use it.
It's native data processing capabilities to do your transformations in the database.
It's usually good to do a layered schema designed to it's clear what tables in your database get the raw or get the data directly out of the load.
And then what table store the results of your transformations. But it's a very useful variant of ECL for.
For situations where you've got the capacity and the desire to do your data transformations in a database.
The book data tools that I used as an example, we were talking about data integration.
That's an example of an EMT pipeline because we extracted from several data sources.
We piped that data directly into PostgreSQL. Well, and as night native a format as we could.
And then we used escudo queries to transform it into integrated tables.
And that gives it its own get repositories that we have this database that has our book data and any we can.
Any student in my research group can use that same set of book data for whatever book related projects that they want to work on.
And then if they need new data extractions,
we'll add them to the transform layer in the core projects that others can benefit from those changes as well.
So wrap up design patterns provide a common language for talking about software design, extract, transform, load and extract,
load, transform our two patterns for doing the data pre processing stage of a pipeline for a data science project.

Resources#

ETL — Understanding It and Effectively Using It
ETL vs. ELT
Wikipedia article on ETL
The 🎥 Week 7 Example uses an ELT design

🎥 Split, Apply, Combine#

We’ve seen group-by operations this semester; they’re a specific form of a general paradigm called split, apply, combine.

Video (6m45s)

Slides

This video, we're going to talk about another pattern that's useful in building our data science pipelines, split, apply, combine.
We've actually already seen it a little bit,
but we're going to talk about the split of I can apply combine a pattern as a general pattern to analyze and transform data.
And so you've seen group by we we group some ratings by movie idee and then we
count the user I.D. group by is how you do split a fly combine and pandas and
we split the data by movie I.D. We apply the Operation Count user I.D. and then
we combine the results into a data frame or a series of movie rating counts.
This is another pattern and it's a pattern that does have direct computational support.
If you can fit your data in the split, apply.
Combine that pandas will take care of a lot of the bookkeeping for you in order to do a complex data transformation.
So the split stage, so partition splits or partitions, a data frame into subsets grouped by as how you do this in pandas.
So you give it a column and it's going to group by distinct values of that column.
If you want to do something like group by a range of values, then you would create another column that has the original values rounded,
that each range is represented by one value and then you can group by that. Each group results in a data frame.
You can see this if you iterate over a group by. So if you just for X or in group by data, frame that group by a column.
Then you're going to get a tuple of a grouping key.
There's a distinct set of values that indicate that of your gripping columns that defined that group.
And then a data frame that is the subset of the original data frame containing the entries for that group.
You then have the apply stage and the name, this name comes from programing languages, programing, language theory where you call flunked.
You apply a function to data is what you call it when you call a function. And pandas provide several ways to do apply.
One is AG. We've seen this many times. You at aggregate function and aggregate function returns a single value.
A transform applies a one to one operation. So the output size matches the input size.
If it gets 20 rows, it will be turned 20 rows means entering data would be an example of a trend or subtracting a value.
Be an example of a transform. So you could make a split of all I combined.
That means center's data.
By using transform compute, it takes a series, computes the mean of the series and that subtracts the mean from all the values and returns.
The result that's gonna be a transform apply applies an arbitrary function that
can return anything and maybe turn a value a series and maybe turn a data frame.
The only thing is that if we should return the same type for every partition, it shouldn't return of value sometimes in a series others.
It always needs to return the same kind of thing, but it can do any of the above.
If you know you have an aggregate right transform, it's better to use that it's a little more efficient.
Pandas actually called the apply function twice on the first group.
The first time to figure out what kind of value it's going to get. And then it goes and does the apply to all of the groups combined.
Then you've split the data into partitions. You've applied some operation to each partition.
Combine combines the results back into a final data structure. So if you're if your apply stamp returns of value,
then pandas combine is going to give you the combined stage of pandas is going to give you a series that's indexed by the group in columns.
If you're if you return to a series from the aggregate,
then you're going to get a series that is that indexed by grouping columns and the index of the result series.
And then if you're a data frame, you're going to get a data frame that if you're apply return to data frame,
you're going to get a data frame that's indexed by grouping columns. And then the index of your result data frame.
Apply your fly data frame should generally return the same columns for every partition.
Now, why do you want. Why do we think about this in these terms? If you if you formulate your data in terms of split, apply combine.
Then first, you can pandas takes care of a lot of the bookkeeping. All you have to write is your apply stage and pandas will do everything else.
It's also easier to understand the code that you could write some loops to do some complex things.
But then you have to pass out what is the Sloup doing? If you write it in terms of split up like mine and then someone reading the code, oh,
this is applies a split of like a lineup combine operation and it's easier for them to understand what you're doing.
And split up play combined is not unique to pandas. Pandas implementation is with group by R has support for split apply,
combine many different platforms in many different environments, have support for some kind of split apply combine operation.
Also, it's trivial to paralyze because there's no reason you have to do the groups one after another.
You could apply to multiple groups and parallel pandas doesn't do this.
But the package called Dask provides an API that's very similar to pandas.
It is. Well, it's a subset of the Pandas API.
It does split apply combine and it can run your apply and multiple processes or on multiple machines in a cluster.
And by putting in a split, a fly combined paradigm, you can take advantage of that parallelism very easily.
Want to talk very, very briefly about a related paradigm that's also useful for parallelism in some contexts.
Pandas does not directly support it, but it's the basis of YouTube and it's called Map Reduce, where you define two operations.
Map transforms the data and an instance into key value pairs.
One or more or zero or more actually, and then reduce transforms key,
a key and then a set of values that key to a single value you might need to reduce multiple times
because it might it might get off of what I do as a map one but one machine do the mapping.
You reduce everything to the key. Isn't that machine to reduce. So you don't have to trance transmit as much data off the pipe.
And that can be done parallel across a very large scale systems. If we want to count ratings again, what we would do is in the map stage,
we would yield the movie I.D. and the number one and then the reduced stage, we would sum up the kouts, we're gonna sum up all of these ones.
And if one is the result of another reduced, it's going to we're gonna sum that up as well.
So this is not code for any particular mass-produced framework.
It's just to show you the flavor of how Mass-produced works. Example of another pattern where if you can make your code fit this pattern,
there exists tools to automate and enhance your use of this pattern, for example, by paralyzing.
So to wrap up the split, apply a combined pattern. Let's transform groups of data and fitting code into patterns like this,
improves the understand the biology modify ability and in some cases parallelism of our code.

Resources#

Split, Apply, Combine at Pandas

🎥 Tuning Hyperparameters#

How can we move beyond GridSearchCV in our quest to tune hyperparameters?

Video (10m49s)

Slides

Let's talk about tuning hyper parameters, learning outcomes are for you to be able to apply different techniques,
to tune hyper parameters of your model and to understand the principle of random search.
So as we've seen with some of our models, we need to be able to pick good values for hyper parameters.
That might be the regularization strength in a regularized linear model. It might be in a random forest.
The number of trees with the maximum depth in the forest might be the number of neighbors for a canyon classifier.
Or if we're doing something with a with a matrix factorization or SVOD the dimensionality of the latent embedding space.
And so we need to pick good values for these. The general principles that we try different values.
We measure the effectiveness in some way, maybe the classifier accuracy,
maybe the classifier area under the curve or some other metric on some tuning data.
Often with cross-validation, we'll split the training data into multiple partitions.
So we've seen this with logistic regression CV where we try every combination of lists of values,
where we have some values for our regularization strength. Lambe They hear or see in the.
And the structure for Saikat learns logistic regression, some values for our L1 ratio row.
And then we compute the accuracy at each value and we pick the best as we see it.
It's built into logistic regression CV.
And then there's also the grid search c.v class that allows us to search any parameter of any Saikat Learn model.
So long as the model provides it ability to a score function that returns a at
some kind of an accuracy or performance score that it can try to optimize.
So grid search has a couple of issues. It was has a couple advantages.
It's simple and it's trivial to paralyze if you have access to multiple processors.
But it's expensive. If we've got any parameters, then values mean.
And then test its combinatorial explodes as particularly as we add more and more parameters.
It also only test selected values.
So here where we're selecting rows in intervals of point one, point two, and if the actual best value is point to three, we're not going to find it.
So random search is a way to speed up grid search.
It doesn't help with the well. It doesn't help with the.
Well, maybe the best values between them very much. But it does help with the the time complexity of the comments.
Horrible explosion as we acquire more and more points to sample.
The idea is that you randomly pick end points from either grade or from intervals.
You define that, you define your different parameters and you define a way to pick values,
to hybrid parameter values to try and you just pick and different combinations.
You then measure the performance and use the best one you can use.
That equals 60 the default. And so I could learn as N equals 10. If you want to.
If you want to do some more, you can do end equals one hundred. The idea though of random searches based on a couple of principles.
First, the idea that we actually don't need the best hyper parameter values for most applications.
We need good enough hyper parameter values,
values that are going to get the model to perform well enough for our business purposes or organizational purposes.
Another principle is that more than one setting is probably good enough.
And so if if we have a search space here, let's say this is lambda and this is.
And this is Roe. And we have this space here that's good enough.
If five percent of our space is good enough and we sample 60 points at random from the space.
In this case, it's uniformally up random.
But the space takes into account your probability mass of five percent of the volume of your probability is good enough.
Then if you sample 60 points, you probably have at least one good enough point.
And that set in the probability that probability is point nine five.
So the reason that's true. So G is the set of good enough points. Then the probability of picking you randomly pick a point.
The probability of picking one that's good enough is point of five.
And this is our fundamental assumption. This is.
The mark,
this this approach assumes that this particular approach to selecting how many random points you should try assumes that that this is what it means,
that five percent of the space is good enough.
The probability of a randomly selected point, according to the distribution you're using being good enough is point five if you select.
Then if we say that s 60 is this is the event random, 60 randomly selected points have at least one good enough good enough point.
Actually, completing that probability is a little bit tricky because you need to compute the probability of having one good
enough point at any one of the positions and the probability of having to good enough points at any pair of positions.
It's a fairly complex expression, but it turns out we can turn it around.
So if there's at least one good enough point.
That means it's not the case that there are no good enough points.
So the compliment of there is at least one good enough point is there is no good enough points.
And if we can turn this probability around to get the probability that randomly selected point is not good enough,
point nine five, we can just take the compliment.
And so the probability of at least one good enough point is one minus the probability of no good enough points.
And the probability of no good enough points is the product. Over 60 points of the probability that that point is not good enough,
because the only way you none of the points can be good enough is if every point is not good enough.
So this is the probability of not good enough to the 60, which is point nine, five to the 60,
push that in a calculator, going to get point O four six, which is less than point five.
So with probability better than point five, at least one of your.
Sixty randomly selected points is going to be good enough. So random search needs allows you to get away with only 60 to 100 points,
regardless of the number of parameters that sometimes go to 100 just because it allows us to get away with a smaller fraction of the search space.
Having that is good enough points. It's trivially paralyzed, but like good search because you can just paralyze over these hundred points.
It may not find the best solution. It requires us to this assumption about good enough, Saikat learned, does provide randomize search.
The random I search c.v works just like grid search CB tip. You can give it distributions of your of your points to try.
And then another one is that we can also think of hyper parameter searches optimization.
So when we're training a model or fitting a model, we're trying to find some parameters that minimize a lost function.
Hyper parameter search, we're really trying to do the same thing, we're trying to find hyper parameters that minimize the lost function,
but the lost function is cross, validate the model and compute a misclassification rate or and accuracy metric or whatever.
And we're training a model or training one model.
We've we have all this training data and we can just look up a data point and see its actual outcome and we can see our prediction.
But if we want to see the outcome of a.
Hyper parameter, we have to cross validate a model using that hyper parameter.
That's very expensive. And it also has no derivative.
And so a lot of the techniques that we use for optimizing models and solving this argument problem don't work anymore.
Or they're prohibitively expensive. So there's a technique called Bayesian optimization that works by testing the model with a few
initial points to get a starting point for how the accuracy is reflected in the search space.
It then maintains what's called a surrogate model that tries to predict the performance of new ABS unobserved hyper parameter settings.
And it uses this model to pick the next points it wants to test.
And this allows it to be more targeted in its search than just random search.
And it can it can sometimes find better solutions. It's implemented by a package called Cycad Optimized.
They have based search CV that works like randomized search or grid search c.v.
They also a function called g.P minimize.
That's a general purpose function, minimize the like psi PI is optimized function that we saw a while ago that uses Bayesian optimization.
It trades off parallelization for optimization ability, it might find better solutions.
But the next search points depend on the results so far. You can do some batch searches rather than just trying one new point.
You can say try four and then you could do that in parallel. But it's useful for complex search spaces if random search isn't good enough.
And also, you can't like random, doesn't have the ability to early stop because you need in order for the proofs,
the whole sheet to try to 60 unless you know the threshold for good enough.
But with Bayesian optimization, since it's continually trying to improve, then you can use of early stopping to say,
okay, our last five runs haven't gotten us any better, maybe we can go ahead and stop.
So in a work, I want to talk brief very, very briefly about using hyper parameter tuning in a workflow.
So far we've just included in our notebooks.
We have a pipeline that the cross-validation is just part of a model fitting process that can be useful, especially for relatively simple models.
But. We might want to not redo the hyper parameter search every time, say, we update our data.
We might want to do it on a less frequent basis. What I often do is I have a script that does hyper parameter search.
And so it'll take the training data.
It will cross validate or I'll use a tuning set and it will do hyper parameter searches and that tuning set or the cross validation.
And then at the end of that, it's going to save the optimal values it learn through the cross validation to a file, often a Jason file.
Oh, here's my parameter values and then other scripts can read it like my model training or
my prediction or whatever script can read those optimal values and use those to train.
The real model that I'm going to use for actually testing on my data.
Works great. So to wrap up, hyper parameter tuning is an expensive optimization problem.
That's really what it is. It's it. Each sample is expensive because it's costly to evaluate that loss function.
Because you have to train models on cross validated data and also it has no derivative.
And so a lot of our techniques for a lot of the techniques that are used by other
packages in the fi ecosystem in order to do solve optimization problems don't work.
But there are several techniques that are useful with good automation for integration into psychedelic.

Note

There is an error on slide 9. Where it says “≤ 0.5” it should say “≤ 0.05”.

Resources#

📓 Tuning Example#

The Tuning Example notebook demonstrates hyperparameter tuning by cross-validation with multiple techniques.

🎥 Reproducible Pipelines#

I provide very brief pointers to additional tools you may want for workflow management in more advanced projects.

Video (8m28s)

Slides

CS 533INTRO TO DATA SCIENCE Michael Ekstrand REPRODUCIBLE PIPELINES Learning Outcomes Understand the value of a reproducible pipeline for both science and industrial application. Know where to read more about tools to help build and automate them. Photo by Anne Nygård on Unsplash Reproducibility Cornerstone of current scientific philosophy A result only observable once is unlikely to be valid (or at least useful) Need to re-run with new data Update forecasts / models for the next month Check for bugs and sensitivity End-to-end re-run catches order-of-operations bugs Re-running with new random seed(s) checks for seed-sensitivity Helps ensure you actually did what you say you did Goal Rerun the entire analysis end-to-end with a single command Well-documented set of steps acceptable alternative Possibly with new: Data Software versions Settings Requirements What steps need to happen? What scripts or notebooks? What arguments? What order do they need to happen in? Optional: Is a step up-to-date? Only recompute out-of-date steps Saves time, energy, money Sounds like make? Data Version Control (dvc) Pipeline has stages, with: Input files Output files Command to produce outputs from inputs Stages defined in DVC files committed to Git Output-only stage just records the presence of a file DVC – Reproduce Stage(s) Checks if it’s up-to-date Inputs, outputs, command match last recorded run Re-runs command if out of date Recursively updates dependencies first Like Make, but uses checksums instead of mtimes Commits checksums to Git Reproduce entire pipeline by ensuring final stage(s) are current DVC – Manage Data DVC also manages data Stage files contain input/output checksums Git ignores all outputs DVC copies outputs to/from data server (e.g. Amazon S3) Easy to insure you have the current copy of the data DVC in Practice Entire pipeline in DVC Experiment with manual commands Save to DVC once I have the run figured out Run expensive models on university cluster Push results to data server Pull to other machine for final analysis with notebooks Easy to make sure we have current data (just as with Git) Other Tools MLflow Make Gradle (useful for Java-based environments) Many others. Wrapping Up Fully reproducible data science pipelines help science and practice. Tools such as dvc can help you build them. Photo by Possessed Photography on Unsplash

This video, I want to talk with you about how to make a reproducible data science pipeline.
I'm not going to be getting in a lot of the details here. I mean,
are we talking primarily about concepts and giving you a pointer to some software I use that you might find useful and want to study on your own?
So learning outcomes are few to understand the value for reproducible pipeline for both science and industrial purposes.
And nowhere to read more about tools. Help build and automate them.
So reproducibility, what do we. What we mean by reproducibility is that we can rerun the code and get the same conclusions and results out of it.
And reproducibility and its related concept replicability are cornerstones of current scientific philosophy, basic ideas.
If you can only observe things all at once, it's unlikely we Valadares, but we need to be able to rise to observe it multiple times.
We need to be able to reproduce the observation. And we also, from a practical perspective, in many cases, we need to be able to rerun with new data.
Maybe you're building a forecasting model for.
For traffic at your business and you need to be able to update it based on the new data you collected this month and re for next month, you need.
You've got it, Weldy, but you just need to be able to rerun it each month as new data comes in.
You need to be able to check for bugs and sensitivity if it's easy to have a stage where maybe some P.C. or pipeline,
you ran an old version of the data. You you don't have results on the same copy of the data.
Maybe you have your rerunning things in different orders.
And so you actually have an artifact of the order in which you ran things.
Maybe you have sensitive, you random seem unlikely, but it can happen if you can rerun your whole analysis front to back.
That ensures that your actual final conclusions are based on what you thought you did because you reran all of the steps in order.
So the goal that the high level ultimate goal, four four four on reproducibility is that you can rerun an entire analysis and end with one command.
I have achieved this in one or two of my projects where there is one command and I've run it on the AA2 cluster and.
A day later, I have a fresh copy of the results. Reproduced from the from the input data.
Including rerunning the notebooks and regenerating the figures that day doesn't include hyper brammer tuning,
hyper parameter tuning would take another week. But we want to be older, rerun it past with new data, new softer versions, new settings,
a well-documented set of steps like a read me that says, OK, run these five STAP.
Do these five steps to reproduce the analysis. That also accomplishes the full one script reproducibility.
That's kind of the ultimate easy go rerun the whole thing.
You can get a lot of value out of reproducibility just with documenting of a few steps that a human can run.
But this is the idea that you can rerun the analysis end to end,
possibly on another computer, possibly new data, and you're gonna get the same conclusions.
Requirements to be able to do that is first you need to know what steps need to happen, what scripts or notebooks do you need to run,
what arguments you need to provide with those scripts, what are their inputs and outputs?
What order do these script steps need to happen in? And then there's an optional thing for optimization is my step up to date.
So it might be that one of the steps the step hasn't changed, the inputs haven't changed.
The outputs are there, so you don't need to run it again. Might sound a lot like make.
And in fact, you can make these kinds of pipelines with make if you've used that tool before.
I use a tool called data version control for this. And in data version control, the pipeline is made up of stages, each of which has input files,
output files in the command that will produce the outputs from the inputs.
And the stages are defined in DVC files that are committed to get.
You can also have a stage that only has outputs and that basically just records the presence of a file.
And then you, DVC can reproduce, stages it or checkoffs stages up to date it check looks,
it's it records the check sums of the inputs and outputs and the commands so we can see.
Has anything changed since the stage last run? Am I missing an output file as one of my input files changed.
And if anything's changed or rerun the command to update the outputs before it does that, though, it first make sure that all of the dependencies,
all of those input files it looks and sees is is for each input file, it looks, and for a stage that produces it as an output.
And if there is one at first, make sure that stage is up to date so that you're building off of the current dependencies.
And then once it's commits the checksums, all these checksums get committed to get as a part of the.
As a part of the run, so that you get there as a part of the DVC files that at stores you store in get here is the check.
Here was the input check. Some here was the command that ran and he was the output checksum.
Then you reproduced the entire pipeline by ensuring the final stages.
Maybe your notebook, along with all their dependencies, are up to date.
And that's what the DVC repro command does. DVC can also help manage data because it's committing these checksums to get so it's got an M.D. five.
Check some of the data file. Then it will.
What will you do is that you have get ignore all the output files it gets not going to manage them, but DVC will look at those output files.
Anything that's an output of a stage knows about. And it will copy those outputs to and from a data server like Amazon S3.
That makes it really easy to ensure you have a current copy of the data because it ensures your current file, the DVC files.
The DVC file says I need the data output file with this checksum and DVC knows how to
go get that from your from your data server and make sure you have a local copy of it.
So all of the things that get gives us for make sure making sure we have the right version
of the code working across multiple machines when we're working with collaborators.
DDC gives us for the data files. Some practice.
I have my entire pipeline and DVC experiment with manual commands, and then once I have the script, I want it.
I created DVC stage. It's going to run it. And then I've run expensive models on the universe.
One of the university clusters, either R2 or Bora. And I.
And so I run that. I come and I. I commit my DVC steps.
I get push. I DVC push to copy the output files to our data server.
And then on another machine, either my desktop or are our research group server.
I will pull down those data files and then I'll do my statistical analysis.
I'll work with my notebooks and things. Their notebooks are annoying to use in the cluster.
So I just do the big models in the cluster. DVC saves the data. Make sure I know what version of the data I'm supposed to be working with.
And then I get pulled DVC pull on our on our research group server,
which isn't as powerful as the cluster, but it's good enough for Jupiter notebooks.
And I make sure that I'm running on the current version of the results from the cluster.
So it makes it easy to make sure that I've got the right version of my data files across all of my different machines.
And I don't have to worry about accidentally copying a file the wrong direction.
There's other tools that can do similar things. The tool called M.L. Flow, that's one of DVC competitors.
You can build your own your own pipelines out of make if you're always in a UNIX environment.
If you're in a Java Bay, if you're doing a Java based project, Gradle was a really useful integration tool.
When my before I switched all my source code to Python,
a lot of my extensive models were in Java and I was using Gradle from my automation at that point.
There are many other tools as well. You've got this and some of them will just do the pipeline management and you have to take care of the data files.
The right system that can just do the data management. You have to do pipeline with another system.
DVC does them together. There's a variety of options, but there are tools that can help you build this pipeline.
End to end. I'm providing a few links in the resources. So to wrap up fully reproducible data science pipelines,
help science and practice and tools like DVC and make and things can help you build such reproducible pipelines.

Resources#

Some software that supports data and/or workflow management:

Data Version Control — I use this
MLflow — support for machine learning workflows

📃 Software Environments#

Read software environments.

📃 Reproducibility Case Study#

Read my case study on reproducibility and bug-hunting.

📓 Example Script and Notebook#

You can find an example, with walkthrough of how to run it with the command line on GitHub CodeSpaces, in this example repo.

🚩 Weekly Quiz 14#

Take Quiz 14 in Canvas.

📓 More Examples#

My book author gender project is an example of an advanced workflow with DVC.

📩 Assignment 7#

Assignment 7 is due Sunday, December 11, 2022.

CS 533 Fall 2022

Week 14 — Workflow (11/28–12/2)

Contents

Week 14 — Workflow (11/28–12/2)#

🧐 Content Overview#

📅 Deadlines#

🎥 From Notebooks to Workflows#

🎥 Scripts and Modules#

Resources#

🎥 Introducing Git#

Resources#

🎥 Git for Data Science#

Resources#

🎥 Extract, Transform, Load#

Resources#

🎥 Split, Apply, Combine#

Resources#

🎥 Tuning Hyperparameters#

Resources#

📓 Tuning Example#

🎥 Reproducible Pipelines#

Resources#

📃 Software Environments#

📃 Reproducibility Case Study#

📓 Example Script and Notebook#

🚩 Weekly Quiz 14#

📓 More Examples#

📩 Assignment 7#