The Case for Switching From Conda to Virtual Environments in Python

Picture Credit: https://www.xkcd.com/1987/

Written by Michael Gomes Vieira, Associate Data Scientist

Recently our Applied Data Science and Machine Learning team at Sainsbury’s has undergone a massive transformation in our ways of working. One of the transformations we made was moving away from the Anaconda Python Distribution towards managing our Python environments ourselves with the use of Virtual Environments. In order to discuss why we made these changes, we will first introduce Python Virtual Environments and explain what they do.

What is a Virtual Environment?

A Python Virtual Environment allows you to create a virtual “isolated” Python installation and install packages on to the isolated Python installation, which is critical in ensuring the reproducibility of your code. Virtual environments allow the installation of a package or a specific version of a package in one environment without having to worry about how it will affect your global installation of Python or other Python Virtual Environments.

The best example of this is if you imagine having two projects that you’re working on project_a and project_b. Both projects use the python package random-py-package; however, project_a only runs with version 1.2.7 while project_b needs some functionality that is only available on version 2.1.1 and above.

Virtual Environments allow us to have both packages installed at once in different places. I could set up a Virtual Environment for project_a with version 1.2.7 of random-py-package and only have this on when I want to work on, or run, project_a. I could then have version 2.1.1 of random-py-package installed globally, or even better, within a separate virtual environment for project_b. Thus, allowing both packages to easily live together on the same computer without interfering with each other.

Why we came to adopt Virtual Environments

When doing any Data Science reproducibility is key. We train our algorithms to predict, classify and automate lots of low-level decisions. To ensure that our algorithms are behaving sensibly we need to be able to understand and explain why our algorithms produced any specific result.

We need to minimise the possible ways in which reproducibility can be compromised. One way reproducibility can be compromised is if a package is updated where a difference in some function can affect the output of an algorithm.

Version 0.21.0 of scikit-learn for instance may have seen changes from the previous version 0.20.0 such that with the same data and parameters a different model is obtained from the previous version. Previously we were using Anaconda to do this.

One of Anaconda’s most important features is the ability to replicate environments in which your models run. Anaconda also comes with a Python interpreter and most of the packages that you would ever need to train a machine learning model and do some Data Science straight out of the box.

Recently we have reorganised our team to include Python Engineers who are integral to the productionisation of our algorithms. These Engineers are an integral part of the Applied Data Science & Machine Learning team, and after joining they often raised the point that Anaconda adds an extra layer of obscurity in our algorithms, and that they preferred using Virtual Environments. Commands in Anaconda are distinct from Virtual Environment commands. To export the list of packages that you are using in your environment you would do

conda list -e > requirements.txt

in anaconda, but to do this in a Virtual Environment you would do

pip freeze > requirements.txt

We wanted to align our ways of working so we changed our ways of working so that we were also using Python Virtual Environments.

Advantages of Virtual Environments over Anaconda

Since we have made the jump from Anaconda to Virtual Environments we have as a team increased out understanding of Python. Our algorithms are more versatile and much easier to productionise, and we do not feel that we have lost anything in moving away from Anaconda.

If you’re doing any Data Science you can just download Anaconda and the chances are that any packages that you need to use are already installed and ready to be imported. It also includes commonly used utilities like iPython and Jupyter Notebooks. These features make Anaconda a “batteries included” Python install.

However, with all of these batteries comes a lot of bloat. As of the time of writing this article the smallest Anaconda distribution that can be downloaded on the Anaconda website is 530MB. It includes over 1500 packages for Data Science, many of which most of us will likely never use. Installing Anaconda can take well over 10 minutes.

Most Data Science projects use a core set of packages such as Pandas, Scikit-Learn and Matplotlib as well as Jupyter Notebooks. All of which can be easily pip installed into a Virtual Environment instead using the command line, as can most packages whose necessity is highly project specific.

Anaconda loyalists amongst you might be up in arms arguing that Anaconda also manages dependencies of packages for you, or wondering how you’re going to use Jupyter Notebooks without Anaconda. Sure, Anaconda does manage package dependencies well. However, due to the large number of packages that are installed in the distribution, often Anaconda does not have the latest version of a package as it will break a dependency of a package that often you are unlikely to be using.

If you need the most recent update to a Python package you will have to wait until Anaconda is updated to work with the newest update of the package. Or you could pip install it as you would with a Virtual Environment, but doing this leads to more convoluted package management.

We can also use Jupyter Notebooks with our Virtual Environments in the same way that we would with Anaconda. Installing it is as easy as entering

pip install jupyter

into the command line with your Virtual Environment activated. To use a ipython kernel that is associated with a Virtual Environment you can then do `pip install ipykernel` followed by:

ipython kernel install — user — name=projectname

More information can be found about using the packages in your Virtual Environment here: https://anbasile.github.io/programming/2017/06/25/jupyter-venv/.

In summary, the switch from Anaconda to Virtual Environments has been a positive one — our algorithms are easier to productionise, aligning to engineering ways of working, and we do not feel that we have lost anything in moving away from Anaconda.

--

--

Sainsbury's Applied Data Analytics
Sainsbury’s Data & Analytics

Insights, opinions, tutorials and much more from Data Science, Analytics and Reporting at Sainsbury’s