Save the Environment with Anaconda

How to create, save, and share portable Python environments for reproducible data science.

barrysmyth
Nov 10, 2017 · 4 min read

TLDR;

$ mkdir myproject # create project folder        
$ cd myproject

$ conda create --name myproject. # create new env.
$ source activate myproject # activate environment
[dsp] $ conda install <package-name> # install package(s)
:
[dsp] $ conda env export > environment.yaml # save env spec.

Introduction

Avoid the temptation to pip install packages into your default Python environment. It will lead to an overly complicated, bloated, and difficult to maintain setup. It will play havok with the reproducibility of your work.

A better way — the correct way — is to create a project-specific environment and store it as part of your project. Not only does this help to isolate the dependancies of your projects from other work, it also makes it much more straightforward for others to reproduce your local project environment.

These days, creating and managing Python environments is easy. One way is to use Python’s virtualenv. Another is to use the more recent condo, part of the Anaconda data-science platform. Unlike virtualenv, conda is a general-purpose package management system, and not just a Python package management system, like virtualenv. Thus, Conda is designed to manage packages and dependencies within any software stack. This makes it particularly useful for managing complex data science projects, which may contain a mixture of systems and languages, code and data. In this post we will described the basics of how to use conda to create, store, and share new project environments.

New Project, New Directory

Every new project, no matter how small, should have its own folder. Even projects that start out small have a tendancy to gain a life of their own and can quickly balloon. Extracting a growing project into a home of its own, later, is a unnecessary burden and so it is always best to give it a home of its own from the start.

$ mkdir dsp

You should also establish a sensible folder structure for your project, creating separate folders for data, notebooks, source code, tests, documentation etc. Cookiecutter is the perfect tool for this, making it easy to create arbitrarily complex project structures with a single command, and providing a host of sensible templates out of the box. We will come back to the wonderful world of cookiecutter in a future post.

New Project, New Environment

Now that we have a project directory (and structure) we are ready to create the project’s environment. The command below does this, creating an environment with a separate copy of Python (3.6), and installing any necessary packages.

$ conda create --name dsp python=3.6

Deleting an environment is easy:

$ conda remove --name dsp --all

Activating/Deactivating the Environment

Next we need to activate the new environment with the source activate command as below.

$ source activate dsp

To leave the environment you can either source activate another or use source deactivate:

[dsp] $ source deactivate

Finally, you can get a list of the conda environments that have been created using:

conda env list

Installing Packages

A key reason for creating an isolated project environment is to facilitate the installation of new packages without interfering with your system Python installation or any other project installation. To install new packages you can use condo install from within the activated environment:

$ source activate dsp[dsp] $ conda install pandas

Note that not all packages are available directly through conda — you can use conda search <query> to search for those that are — but it is also possible to use Python’s usual pip install to add packages. To use pip you first need to install it into your new environment:

[dsp] conda install pip

At this stage there are two versions of pip: a globally accessible one that is associated with the system Python installation; and a newly installed one in the dsp environment. If you pip install <package name> then you will be installing into the global Python environment rather than your local environment. Use pip for local installs means running the version of pip that is associated with your new environment, which means something like the following;

[dsp] $ /anaconda/envs/dsp/bin/pip install <package name>

Note, Anaconda usually stores its environments somewhere like /anaconda/envs/<virtual env name> as above, but may differ from system to system.

Saving the Environment

Now that you have created your new environment, activated it, and installed all the packages your project will need, it’s time to code, right? Wrong! Before you go anywhere it’s worth one final step: saving your environment specification to a file within your project folder. By doing this you are making your project portable and reusable, because this environment specification file can be used by you and others to recreate your exact environment if they wish to run or build on your project.

Use conda export to export a yaml specification file, containing the detailed dependencies of your environment, and save this in a file called, environment.yaml.

[dsp] $ conda env export > environment.yaml[dsp] $ more environment.yaml
name: dsp
channels:
- conda-forge
- anaconda-fusion
- defaults
dependencies:
- pandas=0.21.0=py36_0
: :

The beauty of this is that the environment.yaml file can be used elsewhere, on another machine, for example, to recreate the environment, and doing all the heavy lifting of creating the environment and installing its package dependencies.

$ conda env create -f environment.yaml

Conclusion

Now you have a portable Python environment that is dedicated to a specific project, fully specifies the dependencies of this project, and that can be share with, and used by, others to recreate your project’s environment in the future.

Remember to activate the environment when working on the project and deactivate it when finished. And if you need to install additional packages then you must also remember to recreate the environmnet.yaml file to reflect these changes.

Data Science in Practice

Dedicated to practical aspects of Data Science with a particular emphasis on the Python Data Science Stack.

barrysmyth

Written by

Professor of Computer Science at University College Dublin.

Data Science in Practice

Dedicated to practical aspects of Data Science with a particular emphasis on the Python Data Science Stack.