Creating and Understanding Virtual Environments

Marius
Data Science in Business Blog
5 min readSep 3, 2023
A user with access to four different virtual environments.

Imagine you get onboarded on a data science project where you will work together with other colleagues. You will build models to make predictions on machine failure in the context of predictive maintenance. If you had to code every single line of code by yourself, it’s probably gonna take a very long time until you complete the project. But there are millions of other people who write code, and some have made their code publicly available in software packages for tasks that come up in any data science project, like visualizing data, manipulating tables, or evaluating models.

Before we get into any of those tasks, you need to make sure that each one of you in the team has access to the same tools. You can imagine the set of tools as a stack on top of which you will write your code. This shared stack on which your team will build is called a virtual environment. It allows you to reuse functionality that has been built previously to satisfy needs that arose in multiple instances. The goal of a virtual environment is thus to provide the same set of tools for everyone in your team on top of which you will solve the data science project.

The beauty of virtual environments is that you can create as many of them as you like. In practice, it is common to have a separate environment for every project which is quite natural as you need a different set of tools (stack) to address each project. I’ll first show you step-by-step how you create your virtual environment (practice) before explaining to you what happens under the hood (theory).

Practice | Creating your virtual environment

  1. Create the configuration file

In the root directory of your data science project, you create a new file using your favorite text editor, but probably VisualStudioCode or PyCharm if you work as a data scientist. You can imagine the configuration file as a set of instructions of how the environment should look like, i.e. the name of the environment and which packages should be part of it. By convention, this configuration file is called environment.yml: it has information on your environment and is written in the YAML (“Yet another markup language”, .yml) language which is easy to read for humans. Here’s an example:

name: predictive-maintenance
channels:
- conda-forge
dependencies:
- polars
- scikit-learn

Let’s break it down step-by-step:

name: predictive-maintenance

The first line specifies the name of your virtual environment. The name should be a succinct keyword description of your project. You will use it every time you start to work in your virtual environment (“activate the environment”).

channels:
— conda-forge

Each Python package belongs to a specific channel. Here you should list all channels that should be searched. Conda-forge is the most popular channel for data science projects where almost all relevant packages can be found.

dependencies:
- polars
- scikit-learn

Each package that you want to use is listed under dependencies. We install two of the most well-known data science libraries for manipulating data (polars) and modeling (scikit-learn).

It might be the case that you need functionality that has only recently been added to the package, or it is only included in a very specific version of a package. Using “>= x” you can specify that you want to have a version that is at least as recent as x.

polars>=0.18.0

Using “= x” you can specify that you want to have exactly version x.

polars=0.18.0

2. Create the environment

Now we can start to create the environment. I recommend installing mamba which is a faster reimplementation of conda in C++. You open a terminal in the directory where your environment.yml file is stored and run

mamba env create

This command will

  • Look for the packages in the channels
  • Solve the dependencies (more in the theory section below)
  • Download specific versions of the package

The environment creation process can take a couple of minutes, depending on the number of packages you require, the complexity introduced via dependencies, and the download size of each package.

3. Using your environment

You can easily use (“activate”) your environment every time you work on your project:

conda activate predictive-maintenance

Once you have activated your environment, you have access to all functionality from the installed packages and are ready to move on to the data science part of your work!

Shortcuts Cheat Sheet

Below are some commands I regularly need as a data scientist:

mamba env create               # create a virtual environment
conda activate <env> # activate environment, replace <env> with the name of your environment
conda deactivate <env> # activate environment, replace <env> with the name of your environment
mamba list # list installed packages with respective versions
mamba env update # update installed versions of packages
mamba env remove --name <env> # deleted environment, replace <env> with the name of your environment

Theory | Solving dependencies

By now you know how you can create your virtual environment in practice, but let’s have a look at what happens under the hood. This is not only nice to know, but also helpful when you encounter issues, for example, that “dependencies could not be resolved” during the environment creation process.

Each Python package available on a channel has a version. You can for example filter different versions of a package like polars on the Anaconda website. When you develop a package, you will probably want to reuse functionality from other libraries such as numpy. This introduces a dependency of your package and you can list dependencies of a package like scikit-learn by running

conda search scikit-learn --info

which yields the output

dependencies: 
- blas * openblas
- joblib >=1.1.1
- libcxx >=14.0.6
- llvm-openmp >=14.0.6
- numpy >=1.21.5,<2.0a0
- python >=3.9,<3.10.0a0
- scipy >=1.5.0
- threadpoolctl >=2.0.0

For example, scikit-learn requires one out of a range of numpy versions to be installed.

Now, back to the process of creating your environment: when you specify a list of packages in the environment.yml file, you essentially create a network dependency graph. For example, scikit-learn depends on numpy, numpy depends on Python etc. The core of creating an environment is to find a combination of package versions that

  1. satisfy all required dependencies and
  2. are as recent as possible

Conda achieves this by converting your list of desired packages into a Boolean Satisfiability Problem (SAT) which has been proven to be very hard to solve (NP-complete). An SAT problem is to find an assignment to the boolean variables x_i (True/False) that satisfies a clause, for example:

(x_1 OR x_2) AND (x_1 OR NOT x_2)

which is satisfied if x_1=True, no matter the assignment of x_2. Back to the process of solving an environment, conda will translate the list of required packages into a SAT problem that achieves the two goals from above of which the solution indicates the versions of packages that will be installed.

If you’re interested in-depth in what happens when you run mamba env create, I can recommend you to read Anaconda’s article on Understanding and Improving Conda’s performance.

Conclusion

Virtual environments form the foundation for managing packages in data science projects. They are essential to provide everyone in a team with the same tools and such a configuration can be easily shared using a YAML configuration file.

If you have any thoughts on how you use virtual environments in your projects, please leave a comment!

--

--