Week 14 — Workflow (11/28–12/2)
Contents
Week 14 — Workflow (11/28–12/2)#
In this week, we are going to talk more about workflows. What does it look like to build a practical data science pipeline?
🧐 Content Overview#
Element | Length |
---|---|
3m44s | |
15m33s | |
12m2s | |
6m52s | |
6m46s | |
6m45s | |
10m49s | |
8m28s | |
1068 words | |
1250 words |
This week has 1h11m of video and 2318 words of assigned readings. This week’s videos are available in a Panopto folder.
📅 Deadlines#
Quiz 14, December 1
Assignment 7, December 11
🎥 From Notebooks to Workflows#
In this video, we introduce going beyond notebooks to broader structures for our Python projects.
- So this video, we're going to talk about what we're really looking at this week,
- which is moving beyond what we've been doing in individual notebooks to being able to have a workflow that crosses multiple modules,
- multiple files and is version controlled. So learning outcomes for this week are for you to be able to break code into scripts, modules and notebooks,
- to design a data pipeline, to run and reproduce and analysis and use get to version control your code.
- So. Notebooks are great. We've been using them all semester.
- Been getting a lot of use out of them. They're great for interactive with testing code.
- You can view results right with the code. The notebook is great for displaying charts and visualizations.
- It can display pandas data structures very nicely. We can combine discussion methods and results for.
- For documents where the methods, the computational methods are right there for exactly what we're doing.
- But they don't scale terribly well. There's a few problems with them that we want to try to address.
- Which is one. One is that it's hard to reuse code from one notebook in another.
- There are mechanisms that you import a notebook as a module that are a little weird.
- They're also not great for long running tasks. The notebook you're try to use kick off a job from the browser.
- You lose your Internet connection, you go home, whatever. It's not a great environment for running a long running task.
- Those are better run in a Python script directly without the notebook infrastructure.
- Also, you can there are you can run a notebook from the command line,
- but it's going to be overwrite options and things like that to reuse the code in the sense of a program, not just functions that you reuse.
- The options are a little limited there. So to move beyond this,
- we're gonna look at in this video or in this week are being able to write scripts
- which are Python programs that run on their own and then to take our python code,
- our functions, our classes, et cetera, and put them in the modules that we can then reuse in our scripts, in our notebooks and in our other modules.
- So in this context, we're going to be thinking about data pipelines.
- And so if you've got a you've got we've seen diagrams like this earlier, but if you've got some raw source data,
- if you have a data integration step that's going to get you some prepared data that you analyze,
- then you're going to want to do some descriptive analysis. That's a great use for a notebook right there.
- You want to do descriptive analysis of the results of your data integration, data transformation.
- You also want to be able to do some statistical inference. You want to be able to do some predictive modeling where you're generating predictions,
- classifications, etc. Maybe you're doing inference on their accuracy.
- We can think of each. So far, we'd we would put all of this in one notebook.
- But in practice and a lot of projects, you're actually going to want to split that apart so that you have different stages in their own files.
- You'll have likes a script or more than one script that will do your data transformation.
- You'll have a notebook that does data description. You'll have a script that runs one of your predictive models.
- You might have different scripts for different predictive models, etc.,
- so that you can rerun individual pieces and you don't have everything in one big file that's difficult to edit and maintain.
- So to wrap up significant data science projects usually have multiple components in a pipeline.
- Get is really useful for tracking, for tracking and versioning the code that used to generate these components.
- The rest of the video, this video's this week. We're going to talk about more about how to do these different pieces.
🎥 Scripts and Modules#
This video introduces Python scripts and modules, and how to organize Python code outside of a notebook.
- In this video, we're going to talk about how to use Python scripts and modules to break our analysis apart into smaller pieces and organize our code,
- learning outcomes are for you to be able to write a Python script. Put Python code in a module and understand the Python module in package structure.
- So Adopt Pi file can be run as a script from the command line.
- So if we have if we have a file like this, we can run it.
- And it saved as my script. Scott Pi, we can write it with Python, my scripts, not pi on some systems.
- You might need to run it with Python three, my script dot pi. But what it does is it just runs the code in the file from top to bottom.
- If you define a function, debt in python def in class are actually just python statements that define a
- function or a class and save the resulting function or class object in a variable.
- It runs a script from top to bottom. So this example here, it reads in a file.
- It filters it so we only have the values with the rating is greater than zero, and then it saves the result back out to another file.
- It also starts with a dock string. So a dock string is this.
- It's this string at the top.
- I'm using triple quotes with which allows us to have a multi-line string in Python triple quotes to limit multi-line strings.
- The string at the beginning just tells us that we've got it. What the script is going to do.
- It's going to filter ratings to only real ones. So the script also was an example of the typical kinds of things that we usually do with scripts.
- So a script is often going to read some input files, do some processing,
- and might do Panda's manipulations and might trying to Saikat learn model and make some predictions.
- It might do a statistical inference and then it's gonna save the results data frame it like if the results are a date,
- one or more data frames save them and CSP files. I really like saving data frames and kept files because they're easier.
- They're they're more efficient to read and write.
- If you can also take entire Saikat Learn model that you've trained and use a library called Pikul to save it to a file on disk.
- And then the next stage of the pipeline,
- another script or a notebook is going to read these outputs that you saved from this script and do something with them like.
- You might train a psychic. Learn model and predict some test data and save the results of that.
- And then a notebook will load the test data and load your predictions of it and compute your accuracy metrics so that you can separate
- perhaps a very computationally intensive model training and prediction stage from analyzing the results of running your predictor.
- So any Python code object, a script, a class, a function, a module can start with a dock string.
- It's just a string and it's just a string. All it is, is a string.
- And it's at the beginning of the file. The beginning of the function or the class.
- And what it does is it documents the code, its purpose, its argument, if it's a function, might document class field, et cetera.
- If you've used Java, it's Python, the equivalent to Java doc.
- And both documentation renderers such as Sphynx and eye python and Jupiters.
- So I Python is the python engine that lives inside of Jupiter.
- They use the dock strings when you ask them to document a particular function or a class.
- They're also useful for scripts. Scripts can also take command line arguments.
- So if we run this script with Python script up high and then we give it to command line arguments in that CSB and outdate CSP.
- Then. What it's going to do is it's going to pass in that CSB.
- As Sister RGV one, and it's going to pass out, that's GSV as Sister AAG too.
- And we can access them in2. We can access them in our script so that we can make a script that can do the same operation on different data files.
- And so if you've got a variety, if you have different data sets, you want to do the same operation.
- You have maybe different models that you want to run. And you know how to run them given a name and a command line.
- This allows you to make scripts that are parameter eyes. You can use the same script code to do multiple different tasks.
- The system RTV variable, it's in the system module C, we import that.
- It's a list of command line arguments. ARG V Zero is the name of the program and then ARG V one.
- And following is the are the actual command line arguments that were passed to your script.
- It does not include any of the command line arguments that were passed to the Python interpreter itself.
- Python strips those out and sets it up so that your program just sees its name and its arguments.
- Then there are some libraries that help you pass command line arguments to allow you to build very sophisticated command line interfaces.
- One is AAG parts. It's in the Python standard library. Another that I use a lot is called Doc Opt and it actually uses your help message
- to define what options are available or they are written in your doc string.
- Another thing we need to do when we're writing a script is do what we call import,
- protect it, because Python files can either be run as a script or imported as a module.
- And if we it's a common convention to import protect. And what we do is we put all of the code and function.
- So I've moved all of our code into a main function here.
- And then at the end of the script, you have this kind of a line where if name is May equals Main, then we call the main function.
- And this underscore, underscore name is a python magic variable.
- That that contains the name of the currently of the module that's currently being run or loaded.
- And as a special case, if you run a Python file as a script, what it does is it sets the name to underscore, underscore, main, underscore, underscore.
- So this is this is how you detect that your file is being run as a script.
- And what it does is it only actually runs the code that's going to do your operations if it's being run as a script.
- If it's not being as it's run as a script, it's just gonna define all of the functions. Couple of reasons for this.
- One is it allows you to just import a function from another script.
- I don't really recommend that if two scripts need the same function, I recommend putting that in a module.
- But. Also, there are some situations where Python may need to reimport your script around certain parallelism techniques.
- I haven't taught you how to do any of them, but some libraries may use them.
- And so import protecting your scripts just provides this extra protection in case in case eventually you wind
- up wanting to do something in your code that uses one of these techniques that requires it to be reimported.
- It's standard practice, though. Most Python scripts you're going to find in the wild, particularly in distributed software, are import protected.
- So I've mentioned module's what is a module when you run, when you have the pilot, the Python command, import that food or import food?
- What it does is it looks for the file called food up pie, and it looks in a few different places.
- It first looks in the scripts directory or if you're just running a python interactive interpreter or a notebook notebook,
- it looks in the notebooks current directory for a console.
- It's going to look in your current local or at current working directory.
- It then searches that environment, the directories in an environment variable called python path.
- Environment variables are a mechanism for a process to have information about its environment and then to pass that on to child processes.
- I put just a little bit about them in the glossary online. It also then looks in your Python system directory.
- And then it runs this file to create its definitions and it runs the whole file because a python,
- a python file just runs and all of your things are statements. Def is a statement that defines a function import the statement that imports code.
- And then what it does is all of the definitions get exposed under the few objects.
- So Foo has a death bar that defines a function called bar.
- Then it's available as food bar in the function or in the code that imports Foo and it exposes all of our assigned names,
- variables, functions, classes, other imports. Any variable that's defined gets made available.
- There's no such thing as a truly private variable. Can convention is you prefix it with underscores.
- Anything that's defined in food is available, food, whatever. Pilot modules can also be grouped together into packages,
- so a package is just a directory with a file called underscore, underscore in it, underscore, underscore, dot pie.
- This file can be empty. All it does is signal that the package is there.
- You can also put some code in it to the default setup, some things in the package if you want.
- It can also the dock string to document the package and present packages can contain modules and other packages.
- So if you import food up bar, what it does is it looks for the food module or the food package,
- a food directory with the init file in it, and then it looks for bardock pie or a bar directory with the init file in it within that.
- So let's see an example of this. This is a project I have for doing some experiments with recommender systems.
- And here, this first file that I'm showing you is a script that I created that splits data.
- And in its Doch string, I give the usage to say how how to run the script.
- You run split data, that pie with with partition, you can give it to options.
- Doc Opt will pass this in order to figure out what options to pass out of the command line.
- And then I have my imports. I have my main function, which actually takes the arguments that are already past the main function, does all of the work.
- And then we finally have at the end we have a.
- We have a file that we have the import protections, if the name is made, that we set up a logger.
- We pass the arguments and we call the main function.
- I don't usually do very much like I just have no more than like three lines or so in my import protecting.
- But here I'm making it's that the main function actually takes the pre passed arguments already.
- Now in this project, I also have a directory called LTA Demo that has a file underscore, underscore and it dot pie that's empty and that makes.
- OK. Demo a package. And within that then I have modules. So I have the module log which defines a couple of functions.
- The setup function to setup logging a script function that sets up logging for, for a script.
- And so then. So I've defined this, this module, this, these functions in this module.
- And then over and split data. What I did was from LDK demo.
- I imported the data sets and log modules. And so then I can call log dot script and it's going to do that initialization process.
- I'm linking to this example code that was prepared by me and some of my graduate students in the resources.
- You can go see an actual example of how this kind of code gets laid out.
- So a few pieces of advice for writing a script first. Always with a dark string for your script.
- That way you can quickly just look at the file. Look at the top of the file and see what that script is supposed to do.
- Also, I like using dark ops that then I just write in my dark string how you actually run the script and what its options are.
- And Doc UPT uses that as the ground truth for how to actually pass command line arguments.
- I'd recommend always import protecting your scripts.
- Recommend providing reasonable configurability, so some options like, OK, I want to run on three different data sets.
- Maybe change a parameter like how many partitions of data to create.
- But if you've got if you wind up creating a lot of options that create a lot of different modes.
- So there's different code pads. You've got a lot of extensive conditionals in order to figure out the right code path to the script.
- I often recommend breaking that into multiple scripts, put the common code in the modules,
- and then for each different way, you need to have combine those functions that you defined in the modules.
- You can write a different script and if you put enough code in the module that each script is very simple and straightforward.
- And that way you don't have near as much code complexity that easy to break as you're doing future development and maintenance of your code.
- So another thing I want to mention briefly is we've got these scripts and run them from the command line.
- We can also run them and the disconnect and leave them running on another computer.
- It's like if you're running on Onex or you're running if your research group
- has a computer that you can run things on or you're running on an Amazon node.
- There's a program called Tmax that creates a terminal that you log into a computer, to a machine, to a machine, overassessed age.
- You run Team X, it creates it starts in your shell, but it's within ti max.
- So you can run programs and there you can start your program running.
- You can then detached from t max hit control be followed by D it's team X will detach.
- You can log out your team X is still running and the program you're running is still running.
- So then you can go home. You can log back into the machine. You can run Team X again.
- Team X attach will reattach to an existing T-Mac session.
- And then you can see. You can check on your program.
- So and also it protects if you lose your Internet connection, if you're just running a program over, as I said, you lose your Internet connection.
- It's going to stop. But if you run your program over SSA, it's through T Tmax.
- You lose your connection. When you connect again, you can tarmacked attach. And the program will never know you disconnect it.
- And so it allows you to run your scripts in a much more robust fashion.
- If you've got a script, that's going to take a while.
- So general principles are they want to recommend or use packages and modules to organize code for your project,
- a variety of things that I put in there, I put in code about how to go find other files so that I can I have my file names defined in one place,
- maybe code about like, OK, here are the data sets that might be stored as a variable.
- And one of my modules, common utility functions that I use throughout, like those logging scripts,
- all of my scripts use those logging functions in order to set up the logging framework.
- I often wind up having a module that has code for doing prison, time for doing plot some visualizations particularly that has the theme.
- So it's easy for me to have the same layout and the same ability to save images to desk etc
- throughout all of my notebooks always refer to rule part files by their relative paths.
- You never want to have an absolute path in a script or in a notebook or in a module,
- because then if someone else is working with the code or you just check it out in a different location in or different computer,
- it's not going to run.
- Always have relative paths relative to the top of the working directory or the top of your repository usually is where I have them from.
- If it's a notebook, it needs to be relative to the notebooks location so that you can move code from one place to another and still run it.
- Also, be careful about excessive configurability, either in functions or in scripts.
- If you've got too many different paths to it through a function or through a script,
- then that's a good sign that you need to pull some code off into the functions in a module, make multiple functions or multiple scripts.
- Each one has one of those paths through the code. So to wrap up scrips, the modules are useful for organizing code and larger projects.
- We can reuse code in operations across multiple parts of the project.
Resources#
docopt, a very useful tool for processing command-line arguments
LK Demo Experiment, which I used in the demo; this also uses DVC
🎥 Introducing Git#
This video introduces version control with Git.
- So in this video, I want to introduce using get to save versions of some of your project and share code with others.
- So learning outcomes are for you to be able to use get to save versions of your scripts and notebooks.
- Sherko to get hub and merge code changes from collaborators. This video introduces the concept.
- It's not Hands-On and I'm not going to walk through the specific details of how to do each operation.
- There are a lot of tools of resources online for learning get.
- And I expect a number of you probably already know it, although certainly not all of you.
- So I'm going to fried some links in the resources with this video to more of places where you can do hands on learning of the details of get.
- I'm going to be talking about the concepts that you need to know to put it all together.
- So we save files, we have scripts, you have notebooks, they're saved files.
- But there are a few things we might want to be able to do. Maybe we made a change.
- It didn't work. We want to get an old version of the file back. Maybe you want to make sure that we have the current version saves as we make changes.
- We can't go back. Maybe we want to make sure that you have the right version of the file on multiple computers.
- You've changed it and you changed it at home and you changed it at work. You want to make sure you've got the same version everywhere.
- We also want to be able to share changes to our files and our projects with collaborators if we're working with others on a project.
- And so give us a tool for storing versions of software or versions, a snapshot of the current state of your code.
- And then it has a history of these versions. It's the first version, second version, et cetera.
- So a comet. It does this with what's called a comet.
- And a comet is a snapshot of your correct code with a pointer to the previous commits usually one,
- but it can be two if it's the result of merging two divergent branches of code back together, the chain of commits form a history.
- You can go back to previous commits, commits,
- then form the basis for sharing and merging changes between multiple computers and multiple collaborators.
- Few of the core concepts of Get Out First is the working tree, which is your directory of files ready to run or read it.
- So you've got to work. If you aren't using get yet, you already have a working tree and then get has a repository that stores the history.
- This lives inside a get directory in your working tree.
- There's also remote repositories such as GitHub. You have an index which is a staging area for changes to be committed.
- So you got your working treats, you stage files into the index, and then you commit them to a repository.
- You also then have a branch, which is a line of development that points to a commit.
- So if I've got a first commit A and then I've got a second commit B and a third commit C.
- And I have my main. Main points to see, and that says OKC is the branch and it has its history going back through B and A demerging happens,
- well, maybe somebody else makes a B prime from A,
- and then I say, well, I want both B, NDB Prime and I merge the changes, resolving any places where they conflict and I get my commit c the domain.
- The branch is updated as you make new Kimmet. So if you're on Branch Main and you make a new commit C prime.
- Or as new, commit deth. Then it's going to update Maine to point to D.
- So you have a local repository where you make him work your changes.
- So you're working tree has a guy get directory that get directory contains the entire it contains your local repository.
- It has a complete history. You have a complete copy of the history locally. You can then have configured remotes where you push and pull changes.
- So if you're using GitHub, then GitHub will be a remote that your local is set up for KIPP to push to other options besides GitHub,
- include BitBucket and get lab. You can also run your own server to host get repositories.
- A few of the operations you're going to need to be able to perform and again,
- I'm going to refer you to the resources I'm linking online to learn more detail about each of these operations.
- A commit records that the commit operation or command records, the current version of your files creates and you commit.
- Clone creates a repository and working tree by copying another.
- So if you say if there's a repository on GitHub and you want to work on it, you clone it to your local computer.
- Push sends commits from your repository to a remote, and so if you've made some changes and you want to push them, if you want to share them.
- Either push the new repositories so you can access them on another computer or that your your collaborators can access them.
- You use push and that's make sure that pushes you push a branch.
- And what that does, it is it makes sure that the remote machine, the remote mote repository has all the changes in your branch that you have.
- In fact, retrieves a different remote repository, merge merges two.
- So if you've made some changes and someone else has made some changes and pushed them.
- You need to merge those changes together before you can push your code backup and update your common branch pull fetches,
- emerges together to update your current local branch to include the remote.
- So now to talk through a few use cases here. First, the simple use case is just tracking history and a repository on your local computer.
- You work on your code and notebooks. You commit when you have a version you want to save.
- I recommend doing this very frequently, multiple times a day, possibly even multiple times an hour.
- That way you can always go back and you have the security of if you if you accidently make a mistake, you delete the wrong file, you can get it back.
- The result of this is you've got a local history to go back and recover old versions.
- I also sometimes see I go K through three days, a complete week's worth of work and haven't committed.
- I'd strongly recommend commit early and commit often. So another another use cases.
- You've got multiple computers, you work on one machine.
- You can make your changes, you push through a remote repository, maybe it's GitHub, maybe it's somewhere else.
- Then on the other machine you pull to make sure you have the latest changes and you continue working.
- This is significantly less error prone than manually copying files because GET is tracking your
- versions and it knows that the remote is more current than your current version or vice versa.
- And so if you're just copying files around, you have to keep track of which one is the latest and current version of the file to make sure you
- don't accidentally copy an old file on top of a new one get since it's chaining the commits together.
- It knows old versus new, and it can also merge if you made changes in both places at the same time.
- It can help you merge those together. So it's a much more reliable way to share a code between multiple computers than copying files around.
- You cannot directly push. I can't push from my laptop to my desktop.
- I have to push from my laptop to get up and then pull from GitHub to my desktop.
- I can't. You can't pull like I could pull for my desktop to my laptop.
- I don't do that. I always go to get help. Another then use case is collaboration.
- So you work, you commit changes from time to time when you're ready to share your work with your collaborator, you push to a remote.
- You both have access to. Maybe it's your GitHub repository. Excuse me, first you poll, because you need to make sure you have your collaborators work,
- because if they've pushed changes you don't have, you can't push get will say your branch isn't up to date.
- I can't push. And so you pull from your local.
- You collect your report remote repository to get your collaborator's work.
- You merge if necessary, if you need to merging of tests, run the tests,
- make sure things to work, and then push your work with the merge changes to the shared remote.
- Now your collaborator can get them. They'll need to pull before they can push again.
- But you can only you can only push if your Brant if you have a current copy of everything that's on the remote.
- And so you have to do the pull in the merge before you can push if you have an especially active collaborator.
- And it takes a little while to do your merge, then you might wind up OK.
- I pulled the total merge. I'm ready to push up. They've pushed more code.
- Since I pulled I have to pull again. And then your collaborator can pull down the changes.
- One note here. Always commit before merging.
- You can pull with uncommitted changes. But if something goes wrong in the merge process, if you've committed everything before,
- you start to try to merge your divergent lines of code, then you can always go back and try again if something goes wrong in the merge.
- So another thing you want to pay attention to is ignoring files.
- You don't want to commit every file to get. So a repository usually has a file called get ignore to specify files to ignore.
- You could also have your own settings. You can have a ignore file that's applied to all repositories.
- You work on it. Say if you have a text editor that makes a particular kind of backup file.
- I recommend putting that in your personal get ignore. But the things you some of the things you should ignore editor temp files some times
- like Emacs when backup mode save a file that ends until the VM creates swap files.
- You want to ignore those, you don't actually accidentally commit them. Macko asked, likes to create directories called Dot D.
- S store. You'd never want to commit that to getting share. It's useless to your collaborators.
- Python creates a variety of temporary files.
- That cache transformed versions of Python code.
- You may have a project recompiling some code you don't want to commit.
- The compiled files, in part because you committed compiled on Mac and someone else is going to run on Linux most of your generation.
- In general, if the file is generated from another file, you probably don't want to commit.
- A key exception is notebook's and then some data science project.
- You may store the results of analysis and get. It's not a hard and fast break if you do it.
- It says if you can quickly and easily regenerate the code, it's often a good idea.
- You commit source get a designed for tracking source code, not generated files.
- And so. So these are some of the files you're going to you're going to you often want to ignore in the resources.
- I'm going to give you an example that get ignore file. That's good to toss in a lot of Python data science projects.
- So some of the interfaces and tools that you can use to work with get first the get command
- line tools to get the command line tool you can use in Unix or Windows command show.
- You're really going to need to learn the command line tool, even if in most of your day to day work you use other tools.
- Because if you're running code on a server, on a cluster or something,
- you need to be able to pull at least and probably make some changes and commit sometimes.
- Also, you need your repository getting a state where you need to fix things that aren't easy to fix from the gooey.
- The degrees are getting better and better about that.
- But occasionally I find myself needing to go into surgery or something just faster on the command line.
- But the big reason you really need to learn it is when you've got larger projects,
- you're probably going to be doing some running on a remote server, not just on your local machine with graphical interface.
- And you really need the command line to be able to do that. There's good get support and a lot of editors and I.D.
- I used Visual Studio Code for a lot of my editing. It is very good.
- Get support. And then there's get dedicated. Get DUIs like tower or source tree or get crackin.
- Some of those cost tower and get crackin.
- Both cost money, but they're both also available for free to the GitHub student developer PAC.
- If you register your university email address, so wrap up get allows you to record versions of your code so you can track history,
- rollback changes and you can share with others and with yourself across multiple computer computers.
- I strongly recommend that you commit early and often to prevent lost work as you're working on your projects.
Resources#
Git Resources (including my example
.gitignore
file)
🎥 Git for Data Science#
How do you use Git effectively in a data science project?
- This video, I'm going to talk with you a little bit about, you can get specifically for data science projects,
- so learning outcomes are understand some of the limits of get ignore data files and know some additional tools to look out for managing data files.
- So it gets very good at tracking modestly sized files, far smaller than the few megabytes and text files.
- It's not so good for binary files or large files, especially large binary files.
- It also has some difficulties with files that are hard to emerge, such as notebooks.
- A notebook is stored in text, but its text is a lot of Jason merging.
- All of that is really touchy and easy to get wrong. And so it requires a little special care when you commit your notebooks to get really good idea.
- Get Hub lets you view them in line on online, but you have to take a little care if you're going to be needing to be merging them.
- So first, is it a data science project? We often ignore more files.
- So a lot of heads, the data files, input files, intermediate files, large output files, all of those are going to ignore.
- So we're going to have ignore lines for CFS fees. Yes, Fijis. We're usually going to keep the notebooks, quite possibly other documents,
- and that may involve keeping outputs like the notebook source, the source code and the output in the notebook file.
- We may also store other notebook outputs, etc. and get just that.
- We you can view the results without having to rerun everything when you check out the get repository.
- So but for dealing with these large input files, you've got your input file, it's maybe two gigabytes.
- You're going to create a few hundred megabytes of output. There's a few methods.
- One is that you can just expect anyone working with the repository to recreate all the intermediate output files.
- So you ignore all of your data files. You include either a script or instructions on how to fetch the input data.
- Maybe it's a script that downloads data from a database or fetches it from a Web site.
- Then you have scripts that reproduce the intermediate files. You've documented how to run them.
- The read me is a very good place to do this. You may commit outputs or summaries of the files.
- You may save the results into a database or shared repository. But if the analysis is relatively cheap, this can work well.
- Fetch the project. Make sure you have the current input data rerun. But if the data is not so cheap.
- If you've got processes that can take a while to run in there,
- then you also want people to be able to get copies of the intermediate and output files.
- Maybe you've got a classification model that takes four hours to train.
- What you can do then is you can, again,
- ignore your data files and you can include scripts that fetch both the current in inputs and intermediate files from another server.
- Maybe it's a file share on your network. Maybe it's an Amazon s three server or Buckett.
- And then you could also include include scripts to update inputs of you update the input files, you update the intermediate files.
- Sort of the current versions on the other server. You might again commit your outputs.
- You might commit information about the versions of your intermediate files.
- You can do this just by writing about just scripts yourself, or you can use this tool that does a lot of it for you.
- I take this approach to a lot of my own work using a tool called Data Version Control, but I'll talk more about later.
- And then the third method is to use get large file storage, so large file storage is a system for managing large media files.
- And it looks like they're committed when you're working with it and get it just. They act like any other file that's committed to get.
- It's just that all that actually gets committed to get as a short stub that says what the contents are supposed to be.
- And the actual file content gets stored in a separate server, get ELA fast, pushes and pulls that other content to this separate server.
- And when you check out every places the stub with the actual contents, you have the large file.
- It works great for big files. You can commit output. You might want up committing outputs if you use this.
- And if you change it and you recreate your new output files and re push.
- One of the one caveat to this or one of the caveats to this is that if you use if you can run your own get hosting,
- you can run your own get out fast hosting server. All the storage you want.
- But if you use GitHub, their default accounts have limited space and bandwidth and the pricing on expanding that can go up relatively quickly.
- And so it's often not terribly cost effective. You get L-A fast and GitHub for a lot of large data science files, but get off as is an option.
- So I want to talk just a little bit about notebooks. As I said, notebooks are taxed, but they're complex, Jason.
- It's hard to compare and merge them. Also, they change. You run like you rewrite. It changes the images.
- It might be bit forbid, identical. It might be running with a slightly different version of the software.
- So it compresses differently. Also, it has it the Jew Jupiter stores like how many times each cell has been executed in the Jason.
- So there's roughly two solutions for dealing this.
- This one is just committee's normal merge by taking one version or like often merged by taking one version or another or doing manual merges.
- There's a tool called NDB Dime Notebook. Different merge. That gives you support for actually merging notebooks.
- It's a little weird, but it does work. I have used it successfully. Also, you can coordinate, notebook, edit.
- So if you if you're a small pool of collaborators, you can just coordinate, send a message on slack, say hey.
- I'm really working on this and hoping for a little while. Maybe don't change it, we'd have emerged problem.
- You're done. You push it and then the others just stay away. It kind of breaks a little bit of the bird Frehley with get because it's easy.
- But notebooks are a little hard to merge.
- Another option that doesn't fully fix the problem but makes it merges easier is to only commit the notebook without outputs.
- And so there's a program called NDB Strip Out that strips the output from a notebook content.
- You can wire it into gets that anytime you commit. You still have the outputs in the version you're working free.
- But the version that actually gets committed, it's like you ran clear all cells before saving and that can decrease the amount of conflict
- because the only thing that changes your textual descriptions and your source code,
- they're both options. I tend, which when I use really depends on which project I'm working on.
- A lot of my projects I commit as normal and use NDB dime. If I have a notebook murd situation.
- So wrap up. Gil works great for data science but requires a few tricks.
- You need to be thoughtful in how you handle data, notebooks, things like that, and get works great.
- But there's a few things you need to pay attention to. Notebooks can be a little annoying.
Resources#
NoteBook DIff and MErge (nbdime) — tools for diff/merge of notebooks. Available in Conda:
conda install nbdime
🎥 Extract, Transform, Load#
The Extract, Transform, Load (ETL) pipeline is a common design pattern for data ingest. Sometimes it is adjusted to Extract, Load, Transform.
- This video, we're going to talk about the extract, transform load pattern for handling data transformations and integration.
- So learning outcomes for you to be able to use standard design patterns that think about your data integration and transformation.
- So we saw the pipeline at the beginning of the week.
- If we have some raw source data and we want to transform it into prepared data of some kind, that's ready for further use.
- Using the techniques that we've discussed earlier in the semester,
- turns out that there is there are standard paradigms for thinking about how we structure that process.
- And it involves breaking it down into three stages.
- So extract the extract, transform load pattern takes as input a source of initial unprocessed data, the extract process or stage.
- Gets the data, it exports it from the database, it scrapes it from the Web site, it downloads it from wherever the data comes from.
- And it's how you get the actual raw data that you're going to be working with.
- Then you transform that data into maybe you transforming data from multiple sources into a common format.
- You're integrating data. You're doing some initial cleaning, like deleting invalid records.
- And then finally, you load the data into a system for analysis.
- And the setups we've been talking about so far that a lot of times this is going to look like saving
- it in the file like a part kept file or CSP file that you then can load in the later stages.
- But you might load it into a database. And the result is that you have cleaned and integrated data ready for analysis or modeling.
- And this may seem certainly fairly straightforward,
- but it's what we call a pattern and a design pattern is a common structure for design in general, but in our purposes for software design.
- And it serves as few useful features if it accomplishes a few useful things.
- First is, it gives its common language for documenting and understanding software.
- So if you document I'm here's my extract my transform my load for this for processing this data,
- then others will know what you're talking about because you're using familiar in standard language.
- It's also provides context for developing best practices because you can document, OK, here's good ways to do extraction.
- Here's good ways to do transformation. It could also in some cases benefit from the automation support.
- There are automate. There are automated tools that provide extensive support for doing various types of extractions, transformations and loads.
- Another example, the design is a little bit tableau between an interface in a design pattern.
- But the pattern we're using this, Kate, learning of a model. You fit the model. It updates in place.
- We can think of that as a design pattern for machine learning models in Python.
- So in context, ECL may live in a standalone instead of a project and may live in your repository.
- So you have one or more scripts to do your EDL stages.
- You might have separate ones like an extract, transform and load stage that save the data in a format that's ready for subsequent stages of analysis.
- You don't have to have getting things into easy to work with format, deleting invalid records,
- making sure you have tabular data in your train, a predictive model stage because that's happened in your E Taliban.
- All of your you can do multiple things with your clean data. And you may also, though,
- have a dedicated ECL pipeline project that to give repository that just does your ECL pipeline for big win
- contexts where the loaded data is going to be used in multiple different projects across an organization.
- And so if you've got if you've got an organization where your processing say you're processing some
- government records in a form that's going to be used for informed decision making across the organization,
- then you might have an ECL pipeline that fetches the current version of the government
- records from the from the the the official source does whatever transformations
- you're going to you need to do with them and then load them in a database so that other people in your organization can go and make use of that data.
- And maybe you rerun this once a day, once a month, once a week, whatever the right timeframe is.
- So that then. Anyone in the organization who needs the data can just go to your database.
- Or sometimes it's called a data warehouse or data warehouse is basically a database that's often larger and capacity
- and not designed to support real time production level systems is designed for more long term querying and analysis.
- But anyone can go to the database and get the data as of the most recent result or the most recent import.
- And you've got this ETF pipeline that takes care of that and you can rerun it from time to time as your source provides new data.
- A variant of ECL is extract, load, transform.
- So sometimes you're actually going to want to do the transformations in whatever place you're storing the data.
- An example of this is if you want to use ask you all queries to do your data transformation.
- So you extract your ross' or your raw source data and then you load it directly into initial database tables and then you use your database system.
- If it's a rescue old database, use ask you all queries, but something else, use it.
- It's native data processing capabilities to do your transformations in the database.
- It's usually good to do a layered schema designed to it's clear what tables in your database get the raw or get the data directly out of the load.
- And then what table store the results of your transformations. But it's a very useful variant of ECL for.
- For situations where you've got the capacity and the desire to do your data transformations in a database.
- The book data tools that I used as an example, we were talking about data integration.
- That's an example of an EMT pipeline because we extracted from several data sources.
- We piped that data directly into PostgreSQL. Well, and as night native a format as we could.
- And then we used escudo queries to transform it into integrated tables.
- And that gives it its own get repositories that we have this database that has our book data and any we can.
- Any student in my research group can use that same set of book data for whatever book related projects that they want to work on.
- And then if they need new data extractions,
- we'll add them to the transform layer in the core projects that others can benefit from those changes as well.
- So wrap up design patterns provide a common language for talking about software design, extract, transform, load and extract,
- load, transform our two patterns for doing the data pre processing stage of a pipeline for a data science project.
Resources#
The 🎥 Week 7 Example uses an ELT design
🎥 Split, Apply, Combine#
We’ve seen group-by operations this semester; they’re a specific form of a general paradigm called split, apply, combine.
- This video, we're going to talk about another pattern that's useful in building our data science pipelines, split, apply, combine.
- We've actually already seen it a little bit,
- but we're going to talk about the split of I can apply combine a pattern as a general pattern to analyze and transform data.
- And so you've seen group by we we group some ratings by movie idee and then we
- count the user I.D. group by is how you do split a fly combine and pandas and
- we split the data by movie I.D. We apply the Operation Count user I.D. and then
- we combine the results into a data frame or a series of movie rating counts.
- This is another pattern and it's a pattern that does have direct computational support.
- If you can fit your data in the split, apply.
- Combine that pandas will take care of a lot of the bookkeeping for you in order to do a complex data transformation.
- So the split stage, so partition splits or partitions, a data frame into subsets grouped by as how you do this in pandas.
- So you give it a column and it's going to group by distinct values of that column.
- If you want to do something like group by a range of values, then you would create another column that has the original values rounded,
- that each range is represented by one value and then you can group by that. Each group results in a data frame.
- You can see this if you iterate over a group by. So if you just for X or in group by data, frame that group by a column.
- Then you're going to get a tuple of a grouping key.
- There's a distinct set of values that indicate that of your gripping columns that defined that group.
- And then a data frame that is the subset of the original data frame containing the entries for that group.
- You then have the apply stage and the name, this name comes from programing languages, programing, language theory where you call flunked.
- You apply a function to data is what you call it when you call a function. And pandas provide several ways to do apply.
- One is AG. We've seen this many times. You at aggregate function and aggregate function returns a single value.
- A transform applies a one to one operation. So the output size matches the input size.
- If it gets 20 rows, it will be turned 20 rows means entering data would be an example of a trend or subtracting a value.
- Be an example of a transform. So you could make a split of all I combined.
- That means center's data.
- By using transform compute, it takes a series, computes the mean of the series and that subtracts the mean from all the values and returns.
- The result that's gonna be a transform apply applies an arbitrary function that
- can return anything and maybe turn a value a series and maybe turn a data frame.
- The only thing is that if we should return the same type for every partition, it shouldn't return of value sometimes in a series others.
- It always needs to return the same kind of thing, but it can do any of the above.
- If you know you have an aggregate right transform, it's better to use that it's a little more efficient.
- Pandas actually called the apply function twice on the first group.
- The first time to figure out what kind of value it's going to get. And then it goes and does the apply to all of the groups combined.
- Then you've split the data into partitions. You've applied some operation to each partition.
- Combine combines the results back into a final data structure. So if you're if your apply stamp returns of value,
- then pandas combine is going to give you the combined stage of pandas is going to give you a series that's indexed by the group in columns.
- If you're if you return to a series from the aggregate,
- then you're going to get a series that is that indexed by grouping columns and the index of the result series.
- And then if you're a data frame, you're going to get a data frame that if you're apply return to data frame,
- you're going to get a data frame that's indexed by grouping columns. And then the index of your result data frame.
- Apply your fly data frame should generally return the same columns for every partition.
- Now, why do you want. Why do we think about this in these terms? If you if you formulate your data in terms of split, apply combine.
- Then first, you can pandas takes care of a lot of the bookkeeping. All you have to write is your apply stage and pandas will do everything else.
- It's also easier to understand the code that you could write some loops to do some complex things.
- But then you have to pass out what is the Sloup doing? If you write it in terms of split up like mine and then someone reading the code, oh,
- this is applies a split of like a lineup combine operation and it's easier for them to understand what you're doing.
- And split up play combined is not unique to pandas. Pandas implementation is with group by R has support for split apply,
- combine many different platforms in many different environments, have support for some kind of split apply combine operation.
- Also, it's trivial to paralyze because there's no reason you have to do the groups one after another.
- You could apply to multiple groups and parallel pandas doesn't do this.
- But the package called Dask provides an API that's very similar to pandas.
- It is. Well, it's a subset of the Pandas API.
- It does split apply combine and it can run your apply and multiple processes or on multiple machines in a cluster.
- And by putting in a split, a fly combined paradigm, you can take advantage of that parallelism very easily.
- Want to talk very, very briefly about a related paradigm that's also useful for parallelism in some contexts.
- Pandas does not directly support it, but it's the basis of YouTube and it's called Map Reduce, where you define two operations.
- Map transforms the data and an instance into key value pairs.
- One or more or zero or more actually, and then reduce transforms key,
- a key and then a set of values that key to a single value you might need to reduce multiple times
- because it might it might get off of what I do as a map one but one machine do the mapping.
- You reduce everything to the key. Isn't that machine to reduce. So you don't have to trance transmit as much data off the pipe.
- And that can be done parallel across a very large scale systems. If we want to count ratings again, what we would do is in the map stage,
- we would yield the movie I.D. and the number one and then the reduced stage, we would sum up the kouts, we're gonna sum up all of these ones.
- And if one is the result of another reduced, it's going to we're gonna sum that up as well.
- So this is not code for any particular mass-produced framework.
- It's just to show you the flavor of how Mass-produced works. Example of another pattern where if you can make your code fit this pattern,
- there exists tools to automate and enhance your use of this pattern, for example, by paralyzing.
- So to wrap up the split, apply a combined pattern. Let's transform groups of data and fitting code into patterns like this,
- improves the understand the biology modify ability and in some cases parallelism of our code.
Resources#
🎥 Tuning Hyperparameters#
How can we move beyond GridSearchCV
in our quest to tune hyperparameters?
- Let's talk about tuning hyper parameters, learning outcomes are for you to be able to apply different techniques,
- to tune hyper parameters of your model and to understand the principle of random search.
- So as we've seen with some of our models, we need to be able to pick good values for hyper parameters.
- That might be the regularization strength in a regularized linear model. It might be in a random forest.
- The number of trees with the maximum depth in the forest might be the number of neighbors for a canyon classifier.
- Or if we're doing something with a with a matrix factorization or SVOD the dimensionality of the latent embedding space.
- And so we need to pick good values for these. The general principles that we try different values.
- We measure the effectiveness in some way, maybe the classifier accuracy,
- maybe the classifier area under the curve or some other metric on some tuning data.
- Often with cross-validation, we'll split the training data into multiple partitions.
- So we've seen this with logistic regression CV where we try every combination of lists of values,
- where we have some values for our regularization strength. Lambe They hear or see in the.
- And the structure for Saikat learns logistic regression, some values for our L1 ratio row.
- And then we compute the accuracy at each value and we pick the best as we see it.
- It's built into logistic regression CV.
- And then there's also the grid search c.v class that allows us to search any parameter of any Saikat Learn model.
- So long as the model provides it ability to a score function that returns a at
- some kind of an accuracy or performance score that it can try to optimize.
- So grid search has a couple of issues. It was has a couple advantages.
- It's simple and it's trivial to paralyze if you have access to multiple processors.
- But it's expensive. If we've got any parameters, then values mean.
- And then test its combinatorial explodes as particularly as we add more and more parameters.
- It also only test selected values.
- So here where we're selecting rows in intervals of point one, point two, and if the actual best value is point to three, we're not going to find it.
- So random search is a way to speed up grid search.
- It doesn't help with the well. It doesn't help with the.
- Well, maybe the best values between them very much. But it does help with the the time complexity of the comments.
- Horrible explosion as we acquire more and more points to sample.
- The idea is that you randomly pick end points from either grade or from intervals.
- You define that, you define your different parameters and you define a way to pick values,
- to hybrid parameter values to try and you just pick and different combinations.
- You then measure the performance and use the best one you can use.
- That equals 60 the default. And so I could learn as N equals 10. If you want to.
- If you want to do some more, you can do end equals one hundred. The idea though of random searches based on a couple of principles.
- First, the idea that we actually don't need the best hyper parameter values for most applications.
- We need good enough hyper parameter values,
- values that are going to get the model to perform well enough for our business purposes or organizational purposes.
- Another principle is that more than one setting is probably good enough.
- And so if if we have a search space here, let's say this is lambda and this is.
- And this is Roe. And we have this space here that's good enough.
- If five percent of our space is good enough and we sample 60 points at random from the space.
- In this case, it's uniformally up random.
- But the space takes into account your probability mass of five percent of the volume of your probability is good enough.
- Then if you sample 60 points, you probably have at least one good enough point.
- And that set in the probability that probability is point nine five.
- So the reason that's true. So G is the set of good enough points. Then the probability of picking you randomly pick a point.
- The probability of picking one that's good enough is point of five.
- And this is our fundamental assumption. This is.
- The mark,
- this this approach assumes that this particular approach to selecting how many random points you should try assumes that that this is what it means,
- that five percent of the space is good enough.
- The probability of a randomly selected point, according to the distribution you're using being good enough is point five if you select.
- Then if we say that s 60 is this is the event random, 60 randomly selected points have at least one good enough good enough point.
- Actually, completing that probability is a little bit tricky because you need to compute the probability of having one good
- enough point at any one of the positions and the probability of having to good enough points at any pair of positions.
- It's a fairly complex expression, but it turns out we can turn it around.
- So if there's at least one good enough point.
- That means it's not the case that there are no good enough points.
- So the compliment of there is at least one good enough point is there is no good enough points.
- And if we can turn this probability around to get the probability that randomly selected point is not good enough,
- point nine five, we can just take the compliment.
- And so the probability of at least one good enough point is one minus the probability of no good enough points.
- And the probability of no good enough points is the product. Over 60 points of the probability that that point is not good enough,
- because the only way you none of the points can be good enough is if every point is not good enough.
- So this is the probability of not good enough to the 60, which is point nine, five to the 60,
- push that in a calculator, going to get point O four six, which is less than point five.
- So with probability better than point five, at least one of your.
- Sixty randomly selected points is going to be good enough. So random search needs allows you to get away with only 60 to 100 points,
- regardless of the number of parameters that sometimes go to 100 just because it allows us to get away with a smaller fraction of the search space.
- Having that is good enough points. It's trivially paralyzed, but like good search because you can just paralyze over these hundred points.
- It may not find the best solution. It requires us to this assumption about good enough, Saikat learned, does provide randomize search.
- The random I search c.v works just like grid search CB tip. You can give it distributions of your of your points to try.
- And then another one is that we can also think of hyper parameter searches optimization.
- So when we're training a model or fitting a model, we're trying to find some parameters that minimize a lost function.
- Hyper parameter search, we're really trying to do the same thing, we're trying to find hyper parameters that minimize the lost function,
- but the lost function is cross, validate the model and compute a misclassification rate or and accuracy metric or whatever.
- And we're training a model or training one model.
- We've we have all this training data and we can just look up a data point and see its actual outcome and we can see our prediction.
- But if we want to see the outcome of a.
- Hyper parameter, we have to cross validate a model using that hyper parameter.
- That's very expensive. And it also has no derivative.
- And so a lot of the techniques that we use for optimizing models and solving this argument problem don't work anymore.
- Or they're prohibitively expensive. So there's a technique called Bayesian optimization that works by testing the model with a few
- initial points to get a starting point for how the accuracy is reflected in the search space.
- It then maintains what's called a surrogate model that tries to predict the performance of new ABS unobserved hyper parameter settings.
- And it uses this model to pick the next points it wants to test.
- And this allows it to be more targeted in its search than just random search.
- And it can it can sometimes find better solutions. It's implemented by a package called Cycad Optimized.
- They have based search CV that works like randomized search or grid search c.v.
- They also a function called g.P minimize.
- That's a general purpose function, minimize the like psi PI is optimized function that we saw a while ago that uses Bayesian optimization.
- It trades off parallelization for optimization ability, it might find better solutions.
- But the next search points depend on the results so far. You can do some batch searches rather than just trying one new point.
- You can say try four and then you could do that in parallel. But it's useful for complex search spaces if random search isn't good enough.
- And also, you can't like random, doesn't have the ability to early stop because you need in order for the proofs,
- the whole sheet to try to 60 unless you know the threshold for good enough.
- But with Bayesian optimization, since it's continually trying to improve, then you can use of early stopping to say,
- okay, our last five runs haven't gotten us any better, maybe we can go ahead and stop.
- So in a work, I want to talk brief very, very briefly about using hyper parameter tuning in a workflow.
- So far we've just included in our notebooks.
- We have a pipeline that the cross-validation is just part of a model fitting process that can be useful, especially for relatively simple models.
- But. We might want to not redo the hyper parameter search every time, say, we update our data.
- We might want to do it on a less frequent basis. What I often do is I have a script that does hyper parameter search.
- And so it'll take the training data.
- It will cross validate or I'll use a tuning set and it will do hyper parameter searches and that tuning set or the cross validation.
- And then at the end of that, it's going to save the optimal values it learn through the cross validation to a file, often a Jason file.
- Oh, here's my parameter values and then other scripts can read it like my model training or
- my prediction or whatever script can read those optimal values and use those to train.
- The real model that I'm going to use for actually testing on my data.
- Works great. So to wrap up, hyper parameter tuning is an expensive optimization problem.
- That's really what it is. It's it. Each sample is expensive because it's costly to evaluate that loss function.
- Because you have to train models on cross validated data and also it has no derivative.
- And so a lot of our techniques for a lot of the techniques that are used by other
- packages in the fi ecosystem in order to do solve optimization problems don't work.
- But there are several techniques that are useful with good automation for integration into psychedelic.
Note
There is an error on slide 9. Where it says “≤ 0.5” it should say “≤ 0.05”.
Resources#
📓 Tuning Example#
The Tuning Example notebook demonstrates hyperparameter tuning by cross-validation with multiple techniques.
🎥 Reproducible Pipelines#
I provide very brief pointers to additional tools you may want for workflow management in more advanced projects.
- This video, I want to talk with you about how to make a reproducible data science pipeline.
- I'm not going to be getting in a lot of the details here. I mean,
- are we talking primarily about concepts and giving you a pointer to some software I use that you might find useful and want to study on your own?
- So learning outcomes are few to understand the value for reproducible pipeline for both science and industrial purposes.
- And nowhere to read more about tools. Help build and automate them.
- So reproducibility, what do we. What we mean by reproducibility is that we can rerun the code and get the same conclusions and results out of it.
- And reproducibility and its related concept replicability are cornerstones of current scientific philosophy, basic ideas.
- If you can only observe things all at once, it's unlikely we Valadares, but we need to be able to rise to observe it multiple times.
- We need to be able to reproduce the observation. And we also, from a practical perspective, in many cases, we need to be able to rerun with new data.
- Maybe you're building a forecasting model for.
- For traffic at your business and you need to be able to update it based on the new data you collected this month and re for next month, you need.
- You've got it, Weldy, but you just need to be able to rerun it each month as new data comes in.
- You need to be able to check for bugs and sensitivity if it's easy to have a stage where maybe some P.C. or pipeline,
- you ran an old version of the data. You you don't have results on the same copy of the data.
- Maybe you have your rerunning things in different orders.
- And so you actually have an artifact of the order in which you ran things.
- Maybe you have sensitive, you random seem unlikely, but it can happen if you can rerun your whole analysis front to back.
- That ensures that your actual final conclusions are based on what you thought you did because you reran all of the steps in order.
- So the goal that the high level ultimate goal, four four four on reproducibility is that you can rerun an entire analysis and end with one command.
- I have achieved this in one or two of my projects where there is one command and I've run it on the AA2 cluster and.
- A day later, I have a fresh copy of the results. Reproduced from the from the input data.
- Including rerunning the notebooks and regenerating the figures that day doesn't include hyper brammer tuning,
- hyper parameter tuning would take another week. But we want to be older, rerun it past with new data, new softer versions, new settings,
- a well-documented set of steps like a read me that says, OK, run these five STAP.
- Do these five steps to reproduce the analysis. That also accomplishes the full one script reproducibility.
- That's kind of the ultimate easy go rerun the whole thing.
- You can get a lot of value out of reproducibility just with documenting of a few steps that a human can run.
- But this is the idea that you can rerun the analysis end to end,
- possibly on another computer, possibly new data, and you're gonna get the same conclusions.
- Requirements to be able to do that is first you need to know what steps need to happen, what scripts or notebooks do you need to run,
- what arguments you need to provide with those scripts, what are their inputs and outputs?
- What order do these script steps need to happen in? And then there's an optional thing for optimization is my step up to date.
- So it might be that one of the steps the step hasn't changed, the inputs haven't changed.
- The outputs are there, so you don't need to run it again. Might sound a lot like make.
- And in fact, you can make these kinds of pipelines with make if you've used that tool before.
- I use a tool called data version control for this. And in data version control, the pipeline is made up of stages, each of which has input files,
- output files in the command that will produce the outputs from the inputs.
- And the stages are defined in DVC files that are committed to get.
- You can also have a stage that only has outputs and that basically just records the presence of a file.
- And then you, DVC can reproduce, stages it or checkoffs stages up to date it check looks,
- it's it records the check sums of the inputs and outputs and the commands so we can see.
- Has anything changed since the stage last run? Am I missing an output file as one of my input files changed.
- And if anything's changed or rerun the command to update the outputs before it does that, though, it first make sure that all of the dependencies,
- all of those input files it looks and sees is is for each input file, it looks, and for a stage that produces it as an output.
- And if there is one at first, make sure that stage is up to date so that you're building off of the current dependencies.
- And then once it's commits the checksums, all these checksums get committed to get as a part of the.
- As a part of the run, so that you get there as a part of the DVC files that at stores you store in get here is the check.
- Here was the input check. Some here was the command that ran and he was the output checksum.
- Then you reproduced the entire pipeline by ensuring the final stages.
- Maybe your notebook, along with all their dependencies, are up to date.
- And that's what the DVC repro command does. DVC can also help manage data because it's committing these checksums to get so it's got an M.D. five.
- Check some of the data file. Then it will.
- What will you do is that you have get ignore all the output files it gets not going to manage them, but DVC will look at those output files.
- Anything that's an output of a stage knows about. And it will copy those outputs to and from a data server like Amazon S3.
- That makes it really easy to ensure you have a current copy of the data because it ensures your current file, the DVC files.
- The DVC file says I need the data output file with this checksum and DVC knows how to
- go get that from your from your data server and make sure you have a local copy of it.
- So all of the things that get gives us for make sure making sure we have the right version
- of the code working across multiple machines when we're working with collaborators.
- DDC gives us for the data files. Some practice.
- I have my entire pipeline and DVC experiment with manual commands, and then once I have the script, I want it.
- I created DVC stage. It's going to run it. And then I've run expensive models on the universe.
- One of the university clusters, either R2 or Bora. And I.
- And so I run that. I come and I. I commit my DVC steps.
- I get push. I DVC push to copy the output files to our data server.
- And then on another machine, either my desktop or are our research group server.
- I will pull down those data files and then I'll do my statistical analysis.
- I'll work with my notebooks and things. Their notebooks are annoying to use in the cluster.
- So I just do the big models in the cluster. DVC saves the data. Make sure I know what version of the data I'm supposed to be working with.
- And then I get pulled DVC pull on our on our research group server,
- which isn't as powerful as the cluster, but it's good enough for Jupiter notebooks.
- And I make sure that I'm running on the current version of the results from the cluster.
- So it makes it easy to make sure that I've got the right version of my data files across all of my different machines.
- And I don't have to worry about accidentally copying a file the wrong direction.
- There's other tools that can do similar things. The tool called M.L. Flow, that's one of DVC competitors.
- You can build your own your own pipelines out of make if you're always in a UNIX environment.
- If you're in a Java Bay, if you're doing a Java based project, Gradle was a really useful integration tool.
- When my before I switched all my source code to Python,
- a lot of my extensive models were in Java and I was using Gradle from my automation at that point.
- There are many other tools as well. You've got this and some of them will just do the pipeline management and you have to take care of the data files.
- The right system that can just do the data management. You have to do pipeline with another system.
- DVC does them together. There's a variety of options, but there are tools that can help you build this pipeline.
- End to end. I'm providing a few links in the resources. So to wrap up fully reproducible data science pipelines,
- help science and practice and tools like DVC and make and things can help you build such reproducible pipelines.
Resources#
Some software that supports data and/or workflow management:
Data Version Control — I use this
MLflow — support for machine learning workflows
📃 Software Environments#
Read software environments.
📃 Reproducibility Case Study#
📓 Example Script and Notebook#
You can find an example, with walkthrough of how to run it with the command line on GitHub CodeSpaces, in this example repo.
🚩 Weekly Quiz 14#
Take Quiz 14 in Canvas.
📓 More Examples#
My book author gender project is an example of an advanced workflow with DVC.
📩 Assignment 7#
Assignment 7 is due Sunday, December 11, 2022.