This video introduces Python scripts and modules, and how to organize Python code outside of a notebook.
CS 533INTRO TO DATA SCIENCE
Michael Ekstrand
SCRIPTS AND MODULES
Learning Outcomes
Write a Python script
Put Python code in a module
Understand the Python module/package structure
Photo by Simon Goetz on Unsplash
Scripts
A .py file can be run as a script from the command line:python my-script.py
Runs the code in the file
‘def’, ‘class’, etc. are just Python statements
Example: read in a file, and write a filtered file
Starts with a docstring (optional)
"""Filter ratings to only real ones"""import pandas as pdratings = pd.read_csv('ratings.csv')r2 = ratings[ratings['rating'] >0]r2.to_csv('filtered-ratings.csv', index=False)
Scripts and Pipelines
Typical script:
Reads input files
Does some processing
Pandas manipulations
SciKit-Learn model training/evaluation
Saves results
Data frame as CSV, Parquet, etc.
Model as pickle file
Docstrings
A Python code object can start with a docstring
Script, class, function, module
Documents the code
Purpose
Function arguments
Class fields
Doc renderers & IPython/Jupyter use these
Configurability
Scripts can take command line arguments
python script.py in.csv out.csv
In list sys.argv
0 is name of the program
Libraries help parse:
argparse (in standard lib)
docopt (uses help message)
"""Filter ratings to only real ones"""import sysimport pandas as pdin_file = sys.argv[1]out_file = sys.argv[2]ratings = pd.read_csv(in_file)r2 = ratings[ratings['rating'] >0]r2.to_csv(out_file, index=False)
Import Protection
Python files can be either run as a script or imported as a module
Import-protect your scripts to avoid potential problems & enable code reuse:
Put all code in functions
Call main function in ‘if’ statement at end of script
"""Filter ratings to only real ones"""import sysimport pandas as pddef main(): in_file = sys.argv[1] out_file = sys.argv[2] ratings = pd.read_csv(in_file) r2 = ratings[ratings['rating'] >0] r2.to_csv(out_file, index=False)if __name__ == '__main__': main()
Modules
import foo
Looks for file foo.py
In script’s directory (or local dir for notebooks / console)
In PYTHONPATH environment variable
In Python installation
Runs file to create definitions
Exposes definitions under ‘foo’ object
def bar()… becomes foo.bar
Exposes all assigned names: variables, functions, classes, other imports…
Packages
Modules can be grouped together into packages
A package is just a directory with a file __init__.py
Init file can be empty
Init can have docstring to document package
Packages can contain other packages
Let's see an example…
Script Advice
Write a docstring (quickly glance at script to see purpose)
With docopt, docstring is also script usage information
Import-protect scripts
Provide reasonable configurability
If script has too many different modes, break apart
Multiple scripts
Common code in modules
Disconnected Runs
What if you lose connection?
Can we start a process running, go home, and check it later?
The tmux program does this!
tmux creates a new session
Ctrl+b d (Ctrl+b followed by ‘d’) detaches
tmux attach re-attaches to session
Many other capabilities under Ctrl+b.
General Principles
Use packages and modules to organize code for your project
Layout
Common utilities
Presentation themes?
Always refer to relative paths
Applies to all code!
Beware excessive configurability
In either functions or scripts
If multiple ways to combine pieces, extract pieces & have different scripts or functions that combine them in different ways.
Wrapping Up
Scripts and modules are useful for organizing code in larger projects.
We can reuse code and operations across multiple parts of the project.
Photo by Klára Vernarcová on Unsplash
- In this video, we're going to talk about how to use Python scripts and modules to break our analysis apart into smaller pieces and organize our code,
- learning outcomes are for you to be able to write a Python script. Put Python code in a module and understand the Python module in package structure.
- So Adopt Pi file can be run as a script from the command line.
- So if we have if we have a file like this, we can run it.
- And it saved as my script. Scott Pi, we can write it with Python, my scripts, not pi on some systems.
- You might need to run it with Python three, my script dot pi. But what it does is it just runs the code in the file from top to bottom.
- If you define a function, debt in python def in class are actually just python statements that define a
- function or a class and save the resulting function or class object in a variable.
- It runs a script from top to bottom. So this example here, it reads in a file.
- It filters it so we only have the values with the rating is greater than zero, and then it saves the result back out to another file.
- It also starts with a dock string. So a dock string is this.
- It's this string at the top.
- I'm using triple quotes with which allows us to have a multi-line string in Python triple quotes to limit multi-line strings.
- The string at the beginning just tells us that we've got it. What the script is going to do.
- It's going to filter ratings to only real ones. So the script also was an example of the typical kinds of things that we usually do with scripts.
- So a script is often going to read some input files, do some processing,
- and might do Panda's manipulations and might trying to Saikat learn model and make some predictions.
- It might do a statistical inference and then it's gonna save the results data frame it like if the results are a date,
- one or more data frames save them and CSP files. I really like saving data frames and kept files because they're easier.
- They're they're more efficient to read and write.
- If you can also take entire Saikat Learn model that you've trained and use a library called Pikul to save it to a file on disk.
- And then the next stage of the pipeline,
- another script or a notebook is going to read these outputs that you saved from this script and do something with them like.
- You might train a psychic. Learn model and predict some test data and save the results of that.
- And then a notebook will load the test data and load your predictions of it and compute your accuracy metrics so that you can separate
- perhaps a very computationally intensive model training and prediction stage from analyzing the results of running your predictor.
- So any Python code object, a script, a class, a function, a module can start with a dock string.
- It's just a string and it's just a string. All it is, is a string.
- And it's at the beginning of the file. The beginning of the function or the class.
- And what it does is it documents the code, its purpose, its argument, if it's a function, might document class field, et cetera.
- If you've used Java, it's Python, the equivalent to Java doc.
- And both documentation renderers such as Sphynx and eye python and Jupiters.
- So I Python is the python engine that lives inside of Jupiter.
- They use the dock strings when you ask them to document a particular function or a class.
- They're also useful for scripts. Scripts can also take command line arguments.
- So if we run this script with Python script up high and then we give it to command line arguments in that CSB and outdate CSP.
- Then. What it's going to do is it's going to pass in that CSB.
- As Sister RGV one, and it's going to pass out, that's GSV as Sister AAG too.
- And we can access them in2. We can access them in our script so that we can make a script that can do the same operation on different data files.
- And so if you've got a variety, if you have different data sets, you want to do the same operation.
- You have maybe different models that you want to run. And you know how to run them given a name and a command line.
- This allows you to make scripts that are parameter eyes. You can use the same script code to do multiple different tasks.
- The system RTV variable, it's in the system module C, we import that.
- It's a list of command line arguments. ARG V Zero is the name of the program and then ARG V one.
- And following is the are the actual command line arguments that were passed to your script.
- It does not include any of the command line arguments that were passed to the Python interpreter itself.
- Python strips those out and sets it up so that your program just sees its name and its arguments.
- Then there are some libraries that help you pass command line arguments to allow you to build very sophisticated command line interfaces.
- One is AAG parts. It's in the Python standard library. Another that I use a lot is called Doc Opt and it actually uses your help message
- to define what options are available or they are written in your doc string.
- Another thing we need to do when we're writing a script is do what we call import,
- protect it, because Python files can either be run as a script or imported as a module.
- And if we it's a common convention to import protect. And what we do is we put all of the code and function.
- So I've moved all of our code into a main function here.
- And then at the end of the script, you have this kind of a line where if name is May equals Main, then we call the main function.
- And this underscore, underscore name is a python magic variable.
- That that contains the name of the currently of the module that's currently being run or loaded.
- And as a special case, if you run a Python file as a script, what it does is it sets the name to underscore, underscore, main, underscore, underscore.
- So this is this is how you detect that your file is being run as a script.
- And what it does is it only actually runs the code that's going to do your operations if it's being run as a script.
- If it's not being as it's run as a script, it's just gonna define all of the functions. Couple of reasons for this.
- One is it allows you to just import a function from another script.
- I don't really recommend that if two scripts need the same function, I recommend putting that in a module.
- But. Also, there are some situations where Python may need to reimport your script around certain parallelism techniques.
- I haven't taught you how to do any of them, but some libraries may use them.
- And so import protecting your scripts just provides this extra protection in case in case eventually you wind
- up wanting to do something in your code that uses one of these techniques that requires it to be reimported.
- It's standard practice, though. Most Python scripts you're going to find in the wild, particularly in distributed software, are import protected.
- So I've mentioned module's what is a module when you run, when you have the pilot, the Python command, import that food or import food?
- What it does is it looks for the file called food up pie, and it looks in a few different places.
- It first looks in the scripts directory or if you're just running a python interactive interpreter or a notebook notebook,
- it looks in the notebooks current directory for a console.
- It's going to look in your current local or at current working directory.
- It then searches that environment, the directories in an environment variable called python path.
- Environment variables are a mechanism for a process to have information about its environment and then to pass that on to child processes.
- I put just a little bit about them in the glossary online. It also then looks in your Python system directory.
- And then it runs this file to create its definitions and it runs the whole file because a python,
- a python file just runs and all of your things are statements. Def is a statement that defines a function import the statement that imports code.
- And then what it does is all of the definitions get exposed under the few objects.
- So Foo has a death bar that defines a function called bar.
- Then it's available as food bar in the function or in the code that imports Foo and it exposes all of our assigned names,
- variables, functions, classes, other imports. Any variable that's defined gets made available.
- There's no such thing as a truly private variable. Can convention is you prefix it with underscores.
- Anything that's defined in food is available, food, whatever. Pilot modules can also be grouped together into packages,
- so a package is just a directory with a file called underscore, underscore in it, underscore, underscore, dot pie.
- This file can be empty. All it does is signal that the package is there.
- You can also put some code in it to the default setup, some things in the package if you want.
- It can also the dock string to document the package and present packages can contain modules and other packages.
- So if you import food up bar, what it does is it looks for the food module or the food package,
- a food directory with the init file in it, and then it looks for bardock pie or a bar directory with the init file in it within that.
- So let's see an example of this. This is a project I have for doing some experiments with recommender systems.
- And here, this first file that I'm showing you is a script that I created that splits data.
- And in its Doch string, I give the usage to say how how to run the script.
- You run split data, that pie with with partition, you can give it to options.
- Doc Opt will pass this in order to figure out what options to pass out of the command line.
- And then I have my imports. I have my main function, which actually takes the arguments that are already past the main function, does all of the work.
- And then we finally have at the end we have a.
- We have a file that we have the import protections, if the name is made, that we set up a logger.
- We pass the arguments and we call the main function.
- I don't usually do very much like I just have no more than like three lines or so in my import protecting.
- But here I'm making it's that the main function actually takes the pre passed arguments already.
- Now in this project, I also have a directory called LTA Demo that has a file underscore, underscore and it dot pie that's empty and that makes.
- OK. Demo a package. And within that then I have modules. So I have the module log which defines a couple of functions.
- The setup function to setup logging a script function that sets up logging for, for a script.
- And so then. So I've defined this, this module, this, these functions in this module.
- And then over and split data. What I did was from LDK demo.
- I imported the data sets and log modules. And so then I can call log dot script and it's going to do that initialization process.
- I'm linking to this example code that was prepared by me and some of my graduate students in the resources.
- You can go see an actual example of how this kind of code gets laid out.
- So a few pieces of advice for writing a script first. Always with a dark string for your script.
- That way you can quickly just look at the file. Look at the top of the file and see what that script is supposed to do.
- Also, I like using dark ops that then I just write in my dark string how you actually run the script and what its options are.
- And Doc UPT uses that as the ground truth for how to actually pass command line arguments.
- I'd recommend always import protecting your scripts.
- Recommend providing reasonable configurability, so some options like, OK, I want to run on three different data sets.
- Maybe change a parameter like how many partitions of data to create.
- But if you've got if you wind up creating a lot of options that create a lot of different modes.
- So there's different code pads. You've got a lot of extensive conditionals in order to figure out the right code path to the script.
- I often recommend breaking that into multiple scripts, put the common code in the modules,
- and then for each different way, you need to have combine those functions that you defined in the modules.
- You can write a different script and if you put enough code in the module that each script is very simple and straightforward.
- And that way you don't have near as much code complexity that easy to break as you're doing future development and maintenance of your code.
- So another thing I want to mention briefly is we've got these scripts and run them from the command line.
- We can also run them and the disconnect and leave them running on another computer.
- It's like if you're running on Onex or you're running if your research group
- has a computer that you can run things on or you're running on an Amazon node.
- There's a program called Tmax that creates a terminal that you log into a computer, to a machine, to a machine, overassessed age.
- You run Team X, it creates it starts in your shell, but it's within ti max.
- So you can run programs and there you can start your program running.
- You can then detached from t max hit control be followed by D it's team X will detach.
- You can log out your team X is still running and the program you're running is still running.
- So then you can go home. You can log back into the machine. You can run Team X again.
- Team X attach will reattach to an existing T-Mac session.
- And then you can see. You can check on your program.
- So and also it protects if you lose your Internet connection, if you're just running a program over, as I said, you lose your Internet connection.
- It's going to stop. But if you run your program over SSA, it's through T Tmax.
- You lose your connection. When you connect again, you can tarmacked attach. And the program will never know you disconnect it.
- And so it allows you to run your scripts in a much more robust fashion.
- If you've got a script, that's going to take a while.
- So general principles are they want to recommend or use packages and modules to organize code for your project,
- a variety of things that I put in there, I put in code about how to go find other files so that I can I have my file names defined in one place,
- maybe code about like, OK, here are the data sets that might be stored as a variable.
- And one of my modules, common utility functions that I use throughout, like those logging scripts,
- all of my scripts use those logging functions in order to set up the logging framework.
- I often wind up having a module that has code for doing prison, time for doing plot some visualizations particularly that has the theme.
- So it's easy for me to have the same layout and the same ability to save images to desk etc
- throughout all of my notebooks always refer to rule part files by their relative paths.
- You never want to have an absolute path in a script or in a notebook or in a module,
- because then if someone else is working with the code or you just check it out in a different location in or different computer,
- it's not going to run.
- Always have relative paths relative to the top of the working directory or the top of your repository usually is where I have them from.
- If it's a notebook, it needs to be relative to the notebooks location so that you can move code from one place to another and still run it.
- Also, be careful about excessive configurability, either in functions or in scripts.
- If you've got too many different paths to it through a function or through a script,
- then that's a good sign that you need to pull some code off into the functions in a module, make multiple functions or multiple scripts.
- Each one has one of those paths through the code. So to wrap up scrips, the modules are useful for organizing code and larger projects.
- We can reuse code in operations across multiple parts of the project.