Running Python/R with Docker vs. Virtual Environment

Rami Krispin
4 min readJun 5, 2023

Someone asked me last week about the benefit of developing Python on a dockerized environment vs. using a virtual environment (VE):

Screenshot from social media

The short answer is the level of reproducibility — Docker provides a higher level of reproducibility with respect to VE.

Regardless if you are a software engineer, data engineer, or data scientist, reproducibility is critical when you develop an application and deploy it to a different environment (e.g., server, cloud, etc.) and/or work in a collaborative environment.

What is reproducibility?

Reproducibility is the ability to generate the exact outcome when running the same code regardless of the user or machine on which the code is running.

The first time I met the term reproducibility was during my bachelor’s degree, where I learned that reproducibility starts and ends by setting a seed number to lock down random numbers. My favorite seed number is 12345.

When I started to work as a data scientist, I realized that reproducibility goes beyond setting a seed number. Here are the main elements that can impact code reproducibility:

  • Version control — First and foremost, reproducing the same results starts with the ability to track changes in your code
  • Randomization — Controlling the random generation of numbers by setting the seed number
  • Software version — The versions of your Python or R (or any other programming language) and its dependencies (e.g., libraries) impact the outcome of your code. For example, code that was built with pandas v1.0 may not run on v2.0 due to functions deprecation
  • Operating System (OS) — Most programming languages, particularly R and Python, use different compilers (e.g., C, C++, etc.) and other built-in OS components. The type of OS and its version could impact the outcome of your code
  • Hardware — Last but not, the type of hardware (or infrastructure) could impact your results (ARM/Intel/Apple processor, etc)
What could go wrong during the deployment…

The first item above, version control, is handled by Git (and Github, Gitlab, Bitbucket, etc.). Randomization, in most cases, can be set by using a seed number (with some edge cases that might be related to OS type). Docker and VE tools provide solutions to package versioning control. In addition, Docker solves potential OS-related issues and supports different hardware configurations.

In most cases, simple applications fail during deployment due to differences and mismatches of settings between the development and deployment (or target) environments.

Docker vs. VE (or together)

VE is a great tool for setting the environment’s dependencies (e.g., libraries versions) and sharing it with other users. On the other hand, VE by itself cannot prevent potential OS dependency issues. Code that was developed and tested on a controlled environment on Windows or macOS will not necessarily run on Ubuntu (and the other way around). This is where Docker shines, enabling you to develop, test and deploy your code with the execute same OS and dependencies ensuring high reproducibility.

Deploying code with a virtual environment

Like anything in life, there are no free lunches — Docker requires more effort and has a higher learning curve and complexity than a typical VE. If you are developing and running your code locally or sharing it with other users with similar OS (or don’t have experience with Docker), using VE might be a better choice. Similarly, I highly recommend learning Docker if you are deploying code to a remote environment that supports containers. The long-term benefits surpass the learning costs and time spent debugging missing dependencies during the deployment (from painful personal experience).

Last but not least, VE could be a great supplement to Docker for setting the Python or R environment during the build time of the image with venv (or conda) or renv. I typically use conda to set my Python environment inside Docker.

Summary

Docker and VE both provide tools for setting up a reproducible environment. The main advantage of Docker over VE is that it solves potential operating system issues and conflicts between local and remote environments. That comes with the cost of a higher learning curve and complexity.

--

--

Rami Krispin

Senior Manager Data Science and Engineering | Time series and forecasting | MLops | Open source | Author | https://linktr.ee/ramikrispin