How to make your python environment reproducible (common practices)— Conda environment series[Part-3]

Mostafa Farrag
Hydroinformatics
Published in
9 min readDec 27, 2022

Make your python environment reproducible using common practices to create, update, and lock your environment using a YAML file.

In a series of articles, I started with the article “Getting Started with Conda environment [Part-1] ”, where I covered the first steps to use conda to create your python environment, so anyone who is just getting started with conda should have a look at the previous article first. then in another article “One step further to improve the Conda environment — Numpy implementation[Part-2]I have shown how to improve the performance of a conda environment and that the BLAS implementation of NumPy changes the size of the environment.

In this tutorial, we will take the next step to maintain our environment using the YAML file to keep track of what packages are installed in our environment and which versions are the packages.

Using YAML files makes your environment easily reproducible, and is a very professional way to share your data science notebook with others.

Tools for making a python environment reproducible

The content of the article is arranged as follows

  • Create an environment
  • Environment.yml
  • Update the environment from environment.yml
  • Conda environment revisions
  • Mamba package manager
  • Create the environment.yml file
  • Lock your environment
  • Recap

So let’s start

Create an environment

conda create -n hapi python=3.9
  • The following list of default packages will be installed, you might need to press “y” to agree to install the listed packages.
default packages installed with python in conda environment
  • Now activate the hapi environment
conda activate hapi
activated conda environment

Environment.yml

First, we need to know what is yaml file.

  • A YAML file is a file that is used to specify configuration information or data. It is commonly used for configuration files, but can be used for any plain text file that needs to specify data in a structured way.
yaml files
  • YAML files are designed to be easy to read and write, and they use a simple syntax that is similar to programming languages like Python. They use indentation to specify the structure of the data, and they use key-value pairs to represent data elements. Here is an example of a simple YAML file:
name: John
age: 30
location: New York
  • Now in our context YAML is used as a configuration to specify the packages that we want to install.
  • The content of the file looks as follows.
channels:
- conda-forge
dependencies:
- python >=3.9,<3.11
- numpy >=1.23.5
- pip
- pyramids >=0.2.10
- geostatista
- digitalearth >=0.1.9
- pip:
- cleopatra
  • We will not go deep into the syntax of the YAML file, but the only thing we need to know is the Dictionary element. to create a dictionary you need to indent all the keys and values under the dictionary name. so for the above file, we have the channel dictionary and the dependencies dictionary
  • In the channel, you can list all the conda channels you want to download packages from, for example, conda-forge.
  • In the dependencies dictionary list all the packages you want to install from conda (supposedly from the listed channels above), you can specify the version number (like the pyramids and the digitalearth packages), or you can leave it without the version number (like the pip and geostatista packages).
  • specifying the version number constraints conda while it is trying to solve conflicts between packages and their dependencies, this might make conda takes longer to solve the conflict.
  • In case you want to install a package from pip in your environment, you can’t define pip as a separate dictionary, but include it in the dependencies, and then indent the pip packages below in a list form.
dependencies:
-...
...
- pip:
- cleopatra
- pytest >=7.1.3
  • If you want to compile a package from a certain GitHub repository you can also include it below pip, get the https link to the git repository then prepend “git+” at the beginning
dependencies:
-...
...
- pip:
- git+https://github.com/Serapieum-of-alex/cleopatra.git
  • Now we have a YAML file with all the packages that we need to install in our environment.
  • For more information about the YAML file syntax, you can check this website [Link]

Update the environment from environment.yml

  • So now that we have already an environment created which has only python.
  • now I have navigated to the directory where I have the environment.yml file above
current work directory
  • To check the content of the environment.yml file you can use the cat command in Linux
Content of the environment.yml file

Conda

  • Now use the following command to update the environment “hapi” using the environment.yml file
conda package manager
conda env update -n hapi --file environment.yml
  • Now conda will try to find the versions that meet the constraints of all the dependency packages, some time this step takes a couple of minutes and sometimes longer, a good practice to minimize this time, is to specify a version for each package.
  • Once the list of packages that will be downloaded is displayed in your terminal, it means that conda managed to solve the dependency of all packages, and it is a matter of time before just downloading the packages to install them.
update environment using conda and environment.yml file
  • Now that we have successfully updated our environment using the environment.yml file, sometimes when the list of packages is long this update step takes a long time.
  • before we start with the substitute package manager that will solve the previous time problem, we need to go back to the state of the environment when only python was installed (with the default package)
  • one way to do this is to uninstall the packages that are in the environment.yml file, however, this way is not practical if you have a long list of packages.
  • Another way is to use the revisions option in conda environments.

Conda environment revisions

  • To check the environment revisions, activate the hapi environment again, and use the following command.
conda list --revision 
revision 0 of our hapi environment in conda
  • the revision of our hapi environment after we updated the environment using the environment.yml has way more packages than you see in this “rev 0”
  • No, we want to install this rev 0 again.
conda install --revision 0
  • The command will first display the packages that were installed after this revision and will be removed to go back to what the environment was, immediately before.
packages installed in rev 1, and will be removed to restore rev 0
  • Then the command will display all the packages that will be removed (the previous package in the environment.yml file and also their dependencies)
list of packages to be removed
  • Now to make sure that we don’t have the packages from the environment.yml file in our environment, list the installed packages
conda list

Mamba package manager

  • Now we are back to the main problem, of conda taking a long time to solve the dependency problem between packages.
  • Luckily, there is another package manager from conda as well that can save us a long waiting time, Mamba is another package manager that solves packages dependency (for more about mamba check their website [here])
Mamba package manager
  • Now to use mamba we have to install it in our base environment.
installing mamba in the base environment
  • We will use the same env update command but with mamba, not conda
mamba env update -n hapi --file environment.yml

Create the environment.yml file

  • Now that we have created an environment from scratch and updated it from an environment.yml, what about an environment that we have been using for a long time, we have installed and uninstalled lots of packages that will make using the conda list command to create the yml file manually a nightmare.
  • You can just use the “env export” command to export all the packages to a yml file .
conda env export > export-environment.yml
  • if you checked the generated export-environment.yml you will find that it is very different than the original environment.yml file that we have used to update the environment (they are supposed to be similar except for the python part that we used in the create command, not in the file), and that it has all the dependencies of all packages listed in the file not only our packages, here is just a few lines from the top of the export-environment.yml file
name: hapi
channels:
- conda-forge
- defaults
dependencies:
- _libgcc_mutex=0.1=conda_forge
- _openmp_mutex=4.5=2_kmp_llvm
- affine=2.3.1=pyhd8ed1ab_0
- alsa-lib=1.2.8=h166bdaf_0
- attr=2.5.1=h166bdaf_1
- attrs=22.2.0=pyh71513ae_0
- blosc=1.21.3=hafa529b_0
- boost-cpp=1.78.0=h75c5d50_1
- branca=0.6.0=pyhd8ed1ab_0
- brotli=1.0.9=h166bdaf_8
- brotli-bin=1.0.9=h166bdaf_8
- brotlipy=0.7.0=py39hb9d737c_1005
prefix: /mnt/c/MyComputer/miniconda3/envs/hapi

Lock your environment

  • Another way to make your environment reproducible is using the lock files
  • The lock file is a file that contains a list of dependencies and their exact version numbers for a specific environment. The lockfile is used to ensure that the environment can be recreated in the future, even if the dependencies or their versions are no longer available.
locking python environment
  • To create a lockfile for an existing conda environment, you can generate a .yml file as stated previously. Then you can then use the --lock flag to create a lockfile from the .yml file.
  • To install the conda-lock package in your base environment
conda install -c conda-forge conda-lock
  • Create the lock file from the environment.yml file
conda lock
  • The generated lock file will look as follows (this is not the whole file, just the top).
# This lock file was generated by conda-lock (https://github.com/conda-incubator/conda-lock). DO NOT EDIT!
#
# A “lock file” contains a concrete list of package versions (with checksums) to be installed. Unlike
# e.g. `conda env create`, the resulting environment will not change as new package versions become
# available, unless you explicitly update the lock file.
# Install this environment as "YOURENV" with:
# conda-lock install -n YOURENV --file conda-lock.yml
# To update a single package to the latest version compatible with the version constraints in the source:
# conda-lock lock --lockfile conda-lock.yml --update PACKAGE
# To re-solve the entire environment, e.g. after changing a version constraint in the source file:
# conda-lock -f environment.yml -f C:\MyComputer\01Algorithms\Hydrology\Hapi\environment.yml -f C:\gdrive\01Algorithms\Hydrology\Hapi\environment.yml --lockfile conda-lock.yml
metadata:
channels:
- url: conda-forge
used_env_vars: []
content_hash:
linux-64: d4b46d10f102c3e9a7033c8db747d07f4a0ee3426bec22abfc6d19e9199d4d63
osx-64: 4a9aa6665c96f4ed9c279ec5ade21be588522b9cbca202624ff236b934b45452
win-64: f818812b5bfc669eec625ff88f57dfecb70e3164cd6fddf4b5c5412081737616
platforms:
- linux-64
- osx-64
- win-64
sources:
- environment.yml
- C:\MyComputer\01Algorithms\Hydrology\Hapi\environment.yml
- C:\gdrive\01Algorithms\Hydrology\Hapi\environment.yml
locking your environment using conda lock

Recap

  • So now we have created an environment from scratch, updated it using the environment.yml file, locked it using conda lock and created a lock file, we have also used mamba instead of conda to update the environment just in case you are not satisfied with conda performance.

--

--