Software Environments¶
Throughout most of this class, we have been using packages installed in a single, global Python environment; typically your base Anaconda environment. This is useful for quickly working on things, but it has a few drawbacks for doing data science work in practice:
Unless you take careful notes, you don’t have a record of precisely what packages your project requires. Even if you do take notes, you need to keep those notes up to date, and there are not good, reliable ways to check that it is.
All your projects share the same software installation, so if you update a package for one project, it updates it for all your projects. If the update has incompatible changes, this can break your other projects.
It’s hard for others to make sure they have the same software stack for collaborating with you or trying to reproduce your results.
The way to fix this is to use a separate software environment for each of your projects. I have been doing this all class, with software environments dedicated to the course web site and to the course notebooks. If you have used Python virtual environments in the past, the concept will be familiar.
Creating Conda Environments¶
Conda supports environments, which are separate isolated sets of packages within your Conda installation.
In my own data science work, each of my projects has its own Conda environment.
You can create a Conda environment with the create
command:
conda create -n cs533 python=3.8 pandas seaborn notebook
Once that environment is created, it doesn’t do anything on its own; you have to first activate it.
The activate
command in a Conda-enabled shell prompt switches your current Conda environment:
conda activate cs533
This sets environment variables in your shell so that it looks for programs in your environment first,
so if you run python
it will run the Python in your new environment. For example:
michaelekstrand@abyss:~$ conda activate cs533
(cs533) michaelekstrand@abyss:~$ python
Python 3.9.6 (default, Aug 18 2021, 19:38:01)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print(sys.executable)
/home/MICHAELEKSTRAND/anaconda3/envs/cs533/bin/python
>>>
You can see that it ran the python
executable in my cs533
environment. Each different environment
has a completely isolated Python installation along with all your packages — you can’t accidentally use
a package only installed in one environment in a different environment.
One consequence of this is that Jupyter (notebook
and/or jupyterlab
) needs to be installed separately
in each of your sessions.
Your “primary” Conda installation is accessible as an environment called base
. Jupyter in your base
environment can create notebooks that use any of your installed environments, but I don’t recommend
doing this — it bakes the environment name into your notebook file and hurts portability a little bit;
my practice is to run Jupyter from within my environment.
Once you have activated an environment, conda install
and conda update
will work in that environment.
Environments also include pip
(at least if they have Python in them), so you can pip install
and it
will install packages into the environment instead of the system or elsewhere.
You need to re-activate the environment every time you log in: conda activate
only changes your current
shell session.
Environment Maintenance and Cleanup¶
The conda env
command (and its subcommands) allows you to inspect and clean up your environments.
To list all the environments you have created:
conda env list
To delete an environment:
conda env remove -n cs533
Environment Specification Files¶
You can also write a file that specifies your environment’s requirements and use it to create or update
environments. You can then commit this file to git
to include your project’s software requirements
along with the project itself.
These files are written in YAML, typically called environment.yml
, and look like this:
name: cs533
channels: defaults
dependencies:
- python=3.8
- pandas>=1.0
- seaborn=0.11
- notebook
- scikit-learn>=1.0
With such a file, you can create the environment with conda env create
:
conda env create -f environment.yml
conda env create
and conda create
have similar purposes, but take different arguments; I usually use
create
to create one-off ad-hoc environments, and env create
for working with environment files.
You can also update an environment to make sure it meets the current requirements in a file:
conda env update -f environment.yml
For a more advanced example, see my environment.yml
for the software we use in this
class; it’s the same environment file I use in my private repository for demo and solution
notebooks.
Alternative: Virtual Environments¶
Python provides its own environment system called virtual environments. They aren’t as flexible as Conda environments for data science, because they can only manage Python packages while Conda environments can manage any software installable with Conda (Python packages, R and its packages, arbitrary command-line utilities, etc.), but they are useful in a number of situations.
To create a Python virtual environment, use venv
:
python -m venv cs533-env
Once you have created the environment, which is just a directory, you can activate it. The syntax varies from platform to platform, but on Linux and macOS it is:
source cs533-env/bin/activate
On Windows PowerShell, it is:
& cs533-env\Scripts\activate.ps1
Once your environment is activated, you can install packages with pip
:
pip install pandas seaborn notebook scikit-learn
Requirements Files¶
pip
also supports a file-based package specification system in the form of requirements.txt
files.
These are simple text files that list package specifications, one per line, like so:
pandas>=1.0
seaborn
notebook
scikit-learn>=1.0
You can then install all packages, along with their dependencies, with:
pip install -r requirements-txt
Best Practices¶
All of my projects have an environment.yml
file (or occasionally requirements.txt
) checked in to
Git that I use to list the project’s software requirements. I also don’t generally install packages
directly with conda install
or pip install
; rather, I add them to environment.yml
and run
conda env update
. This ensures that environment.yml
contains all the packages I need.
From time to time, I also delete my environment and re-create it from environment.yml
, and make
sure the project still works. This provides a check that I haven’t accidentally introduced new
dependencies that I forgot about. For many of my projects, since I use dvc to automate the
pipeline, I can usually re-run the entire experiment end-to-end in the new environment.