Understanding the “Why” of VM’s, Containers, & Virtual Environments

Published in

Ml Ops by Mikiko Bazeley

14 min readMay 20, 2022

A friendly guide to understanding & incorporating virtual environments, containers, & VM’s into your data science projects.

Part of the “Software Fundamentals for Machine Learning Application Developers” Series.

Introduction

It is a truth universally acknowledged, that a data scientist in possession of a trained model, must be in want of a reliable means of productionization and deployment.

Deriving ROI from machine learning models is hard, if you believe that at least 80% of projects fail.

As modern-day DaVincis (data scientists, machine learning engineers, product managers, etc), success is tied to producing viable assets in the form of reproducible and scalable code, marrying creative experimentation to structured processes and workflows ultimately designed to run bug-free.

But many of us data scientists or data folks will come from a wide range of backgrounds. I myself studied anthropology and economics in college and didn’t write code until 2018, at least 5+ years after graduating college. (And for more information on how I became a data scientist and then eventually an MLOps Engineer, check out the links way at the bottom.)

In this blog post readers will learn how to:

Define and differentiate between: Virtual environments, containers, virtual machines, dependency management, isolation, reproducibility;
Understand some of the most popular tools and libraries like pip, venv, conda, Poetry, pipenv, Docker and the specific pain-points they attempt to solve;
Evaluate new tools within the context of an effective development workflow.

Although most of the concepts discussed here are language agnostic, this series is targeted to Python users and for data science practitioners who:

Would like to implement software engineering best practices in their projects;
Lack foundational knowledge in productionizing data science models;
Want to start moving beyond Jupyter notebooks;
Want to become more effective at managing multiple projects and ensuring more stable & reproducible builds.

The ultimate goal of this series of tutorials is to help set up current (and aspiring) machine learning application developers and engineers for success from Day 1. This series is ideal if you are new to engineering, data science, or switching careers with a non-traditional background (much like me many years ago).

The Life of a Model (or a Code-Based Project)

Let’s illustrate some of the common pain points that arise in the course of developing and productionizing a model.

As a data scientist focused on producing forecasts for an e-commerce business (or even for your portfolio), you’d like to train a model, share the model code (and potentially the model itself) with both model consumers and collaborators (either through a version control platform like Github or a package repository like PyPI or Artifactory).

We’ll assume that the model will be written in Python and that you’d like to start analysis and model training locally.

The typical process for most data scientists is as follows:

Perform exploratory data analysis (EDA) through either a notebook environment (like Jupyter notebooks) or IDE (such as VSCode or PyCharm) to extract and interpret trends and signals from the data for use as features;
Train a model locally and experiment with various feature engineering techniques, different model architectures, and different hyperparameter values;
Pickle or save off the model weights or model code;
Package the model code with additional code (such as tests, config files, etc);
Deploy the code (either as a library or a container) to a production server or service as part of a CI/CD (continuous integration — continuous deployment) process;
Model predictions are consumed either via API endpoint or looked up as pre-computed values in a table. (Models can also be deployed on edge devices but this can be more complicated and requires a specialized framework like Tensorflow Lite, PyTorch Mobile, Apple’s Core ML, etc or SDK )

“Aligning Data Science frameworks” | Image Created by: Mikiko Bazeley | Adapted from: Full Stack Deep Learning, Tutorialspoint: SDLC, Wikipedia: CRISP-DM

This process is not linear as data science projects can face multiple iterations between planning and requirements gathering, data collection and labeling, training and debugging, deployment and testing, and finally maintenance or continuous learning.

For data scientists or engineers that are new to training and productionizing models (or even new to developing software) there are a number of challenges that need to be addressed.

The first set of challenges involve getting your model to work, reliably, on a single computer or server, between multiple sessions. And by a working model, we mean model code that runs bug-free without manual intervention as opposed to maximizing or minimizing a particular model metric.

The second set of challenges, once you have a model that works reliably, is getting that model to work on a different computing instance.

And of course, throughout this entire process, the goal is to try to minimize the time to delivery (and in the case of cloud computation, costs) as much as possible.

Abstractions All the Way Down

In the following sections we’ll attempt to build the intuition behind the “why” of virtual environments, containers, and VMs.

Although they sound very similar on the surface, especially to new data science practitioners, they are very different tools that offer varying degrees of granularity and control, as well as isolation and reproducibility.

In the diagram below, we try to convey the major differences between the different solutions.

“Comparison of Physical Computer, VM, Container, Virtual Environment” | Image Created by: Mikiko Bazeley | Adapted from: Medium: Dr Stephen Odaibo, Presentation: Distributed & Cloud Computing, Docker: Blog

An important point to note is that virtual machines, containers, and virtual environments aren’t mutually exclusive. There are strong use cases outside of data science for using containers within VM’s and virtual environments within containers.

“Environments Within Containers within VMs within a Physical Computer” | Image Created by: Mikiko Bazeley | Adapted from: Medium: Dr Stephen Odaibo, Presentation: Distributed & Cloud Computing, Docker: Blog

Virtual environments and containers are the de-facto tools for the development, productionization, and deployment of data science models and consequently we’ll spend the most time with these areas.

Works 60% of the Time, Every Time

How do we ensure a model works consistently on our local machines or a single instance? (And even on a non-local machine like a Goole Colab or AWS Sagemaker instance?).

1. Solving Python Version Management

Our first challenge is specifying the version of Python we’d like to use, especially as Python 2 has been sunset and there’s been more than 10 years of releases between Python 3.1 and Python 3.9 (from June 2009 to October 2020). And there’s a number of reasons why we’d want to be able to specify the version of Python, not limited to contributing to a project that supports multiple versions of Python and/or contributing to a project that utilizes specific versions.

2. Solving Package & Dependency Management

One of the biggest reasons to use Python (“the second best language for everything”) is the vast ecosystem of 3rd party libraries or packages.

Rather than writing a neural net or random forest implementation from scratch, users have the ability to import these 3rd party packages & libraries from package repositories like PyPI or Anaconda (even Github, which can be a risky choice without thorough vetting and evaluation). We need to be able to track and manage the immediate packages and applications we install.

Our project dependencies (i.e. the libraries we end up using in our projects) also have libraries they rely on, also known as transitive dependencies. We need to keep track of and manage our dependencies in a way that we can easily install, update, and remove packages as we need them.

We also want to automatically resolve dependency conflicts, such as if the version of Tensorflow being used has two dependencies that each require a different version of NumPy, instead of getting an error and then having to search through the requirements by hand.

3. Solving Dependency & Project Isolation using Virtual Environments

We’ve solved the problem of specifying the Python version we’ll use, installing the 3rd party packages & libraries, and managing dependencies.

But without isolating our projects and dependencies, we run the risk of potentially damaging and cluttering up our computers and local environments. This is less of a concern for cloud-based notebook environments like Google’s AI Platform notebooks or AWS Sagemaker notebooks, as instances can always be deleted and recreated easily, but a very big concern for local development and training. By default, most package managers will install applications and libraries for the entire system to use, regardless of the unique needs of your projects.

We’d like a solution that helps separate, or isolate, projects and their dependencies such that we avoid version conflicts and that works for both packages and interpreters.

We can use virtual environments to do this, which can be visualized as separate folders that have their own Python executables (for the specified Python version) & the associated site-packages (i.e. these 3rd party libraries).

While package managers solve the problem of accessing and downloading packages, virtual environments solve the problem of organizing these packages, which together combined with dependency managers go a long way towards ensuring a clean and well-defined development experience.

It Worked On My Machine

How do we ensure a model works on a different machine? There are a number of ways we can ensure reproducible environments and application deployments.

1. Solving Reproducibility via Requirements Files

What do we need in order to replicate our model code?

We’d need the immediate code we’ve written. We’d also need a snapshot or log of the 3rd-party libraries and packages we used.

Is it enough to capture our immediate dependencies, however? We can imagine our dependencies requiring specific versions of their dependencies and so on. What happens if there’s a silent update in an upstream library?

One way we could capture these dependencies is with requirements files.

Combined with virtual environments, requirements files allow us to approximately replicate Python environments. A requirement file will have a list of packages (usually with specific versions) and can specify different types of requirement files depending on the use case. For example, models in development usually require linters, testing frameworks like PyTest, debuggers, and profilers. But in a production environment, we’d prefer a very lean model, with only the code required to either call the model to perform inference or to perform batch computation.

2. Solving Reproducibility with Containers

There are limits to reproducibility using requirements files.

What if we’d like to run multiple workloads requiring different filesystems and namespaces on a single OS system?
What if we need separate apps with different requirements running at the same time and we’d like more robust isolation than what a virtual environment can provide?
What if we’d like to ensure that we’re able to fully replicate the entire runtime environment across different environments in a single package, including the dependencies, libraries & binaries, and configuration files?

We can solve this problem with containers. While we won’t go into specific details about the most popular containerization solution out there (Docker), the important concept to understand is that there needs to be a way to define & specify the container (an image) and that a container is the instantiation of the image.

Some great reasons to use containers include: deploying multiple instances of a single application, deploying a number of apps on a single server, having a lightweight and robust reproducible environment, and ensuring a reproducible and easily scaled deployment with certain cloud vendors.

3. Solving OS & Kernel Specification With Virtual Machines

As a data scientist practitioner, there aren’t many use-cases requiring administration and direct usage of VM’s.

However it’s important to understand why VM’s exist and appreciate that VMs form the backbone of many cloud computing services.

VMs can be thought of as emulations of a physical computer or server.

VMs will share the physical resources of the hosting machine and are also called guest machines.

While much ado has been made of containers, which are certainly a key tool of data science practitioners focused on successfully developing and deploying working models, VMs offer additional benefits.

VMs allow for the installation of multiple OSs (especially important when developing applications that have specific OS requirements) and are isolated from the host OS through the hypervisor. VMs offer better security and isolation for application development and experimentation.

VMs are ideal for situations that require fine-grained control over resources as well as managing a variety of OSs, managing multiple apps on a single server, and running an app that requires all the resources & functionalities of an OS.

Introducing A Crowded Field of Similar Names

We’ve discussed some pain points involved in developing working, reproducible software (including data science models).

We’ve also hinted that there are a number of ways to ensure reproducibility and dependency isolation using virtual environments, containers, & virtual machines.

In this section we’ll do a high-level comparison of the available libraries and packages providing everything from Python package installation, dependency management, and virtual environments.

Although there are many tools out there (pip, pyenv, pyvenv, virtualenv, venv, Conda, Poetry, pipx, piptools, setup-tools, pipenv, etc), our goal is to present an opinionated list of high-quality tools that are either widely-used within the data science community, recommended by PyPA (the Python Packaging Authority) or by the Python Software Foundation.

“Comparing Popular Python Packages & Libraries” | Image Created by: Mikiko Bazeley | Adapted from: Twitter: Ned Letcher, SO: Difference between venv, etc, SO: Feature Comparison, Real Python: An Effective Python Environment

Pip and venv

Pip and venv are packaged with the Python Standard Library and have well-known patterns. They’re also very lightweight, simple in commands, and familiar among data science practitioners with experience in Python programming outside of the data science and machine learning contexts.

Pip and venv are easy to get started with and can be a nice learning step to using more robust options like conda and Poetry.

Pipenv

Pipenv attempts to combine the Python package installation, virtual environment management, dependency management, and dependency resolution powers of pip and virtualenv in as lightweight a package as possible.

Rather than attempting to use multiple tools in conjunction like pip and venv, the goal is to simplify commands. In addition Pipenv introduces two files, the Pipfile (a replacement for requirements.txt) and the Pipfile.lock (a file to enable deterministic builds by specifying the exact requirements for repoducing an environment, including sub-dependencies).

While lacking support for packaging and non-Python package installation, using Pipenv leaves the user free to leverage other tools for packaging and distribution.

Poetry

Aside from managing virtual environments, Python packages, and dependency management, Poetry also includes support for creating and publishing packages. Poetry core propositions are developing deterministic builds, building and packaging commands with a single command, and making publishing to PyPI easy.

For data science practitioners that abhor switching between multiple tools, prefer fewer commands, and are content to stick with Python as their main language in project development, Poetry is the right choice.

Conda (including Miniconda)

On top of Poetry’s offerings, Conda offers additional support for installing non-Python packages and for bundling common data science and machine learning packages and libraries. For users that would prefer to install their dependencies from scratch, Miniconda is a great alternative, as it’s just conda (the command-line tool) and Python.

One major difference in publishing between Poetry and Anaconda is Poetry makes publishing to PyPI seamless whereas Conda publishes and installs from Anaconda.

Because of Conda’s language agnostic capabilities and Poetry’s focus on a seamless Python workflow, some advanced users end up using both. Pip can also be used in conjunction as pip will install packages from PyPI but beware occasional compatibility difficulties, especially when distributing conda packages.

If you’re a new data science practitioner or someone that prefers simplified workflows (and are fine going all-in on a specific suite), starting with Poetry or Conda is the best option.

Take-Aways & Summary

We’ve covered a lot of ground and there’s still more to learn in future upcoming tutorials.

Most important key takeaways:

Virtual environments, Docker containers, & Virtual Machines represent increasing levels of abstraction and control.
By analogy, you can compare virtual environments to project folders, Docker images to file systems, and virtual machines as everything but the hardware.
Virtual environments are used to isolate Python project dependencies, Docker containers are used to isolate applications and their required binaries and libraries, & Virtual Machines are for managing multiple guest operating systems. These technologies aren’t mutually exclusive and can be used in tandem.
The main value proposition behind tools like pip, venv, pipenv, Poetry, Anaconda is some combination of dependency management + virtual environment management + package installation + reproducibility + bundled distributions.

Upcoming

We talked about the very beginnings of a project. At the end of the day machine learning code (either through an application, a library, or website) is meant to be shared and consumed.

In upcoming parts of this series, we’ll dive into:

The different ways to deploy and serve models, i.e. make them available for use either directly by end-users and/or by other engineers and data scientists;
How those different archetypes or patterns of deploying and serving impact what you need to do to get started for your project;
The different ways data scientists and engineers design their tool chains and how the data science tech stack changes depending on if you’re at a startup versus a bigger tech company;
The different options you have and where some of the more popular libraries for productionizing, deploying, and serving models fit.

Additionally there will be standalone blog posts (similar to this one) that are more theoretical in nature and meant to dig a bit deeper in a particular pre-requisite question or concept important to understanding good software practices around machine learning applications and operations.

About This Series

My goal with this series is to help provide the necessary software engineering context and knowledge that is foundational to productionizing and deploying models, as well as eventually designing whole ML systems and ML native apps.

When I first moved into ML Engineering (and eventually MLOps) I was overwhelmed and confused by how much I didn’t know about the mechanics of pushing and writing great code, especially code meant to support large machine learning pipelines and models.

Through taking courses, mentoring, and speaking on panels about MLOps and ML Engineering, I realize that this lack of history and knowledge was true for many people like myself, who may have come from very non-traditional backgrounds and are scrambling to keep up with the rapidly changing field.

And the gap doesn’t come from a lack of solid technical resources out in the field but from bridging material, i.e. material written for curious and self-motivated individuals that are capable of doing their own googling but end up finding material written for senior or experienced engineers that assume the reader has at least 5+ years of experience with ML, system design, web APIs, etc.

This series doesn’t make that assumption and follows the principle of “all models are wrong, some are useful”. I hope you find this series useful (even if it’s not totally accurate) and please send me your questions, comments, and suggestions at the following sites below.

About Me

I’m available on LinkedIn, Youtube, and Twitter!

👩🏻‍💻 Mikiko B. - Senior Software Engineer, MLOps & Infrastructure - Intuit | LinkedIn

I help companies build & launch ML apps & infrastructure ➕ candidates break into MLOps & MLEng Prior to Mailchimp's ML…

www.linkedin.com

Miki Bazeley - The MLOps Engineer

Hi there! I'm a Senior Engineer that focuses on MLOps and MLEngineering. I've worked as a data analyst, data scientist…

www.youtube.com

JavaScript is not available.

Edit description

twitter.com

And if you really love what I wrote, consider buying me a coffee at www.buymeacoffee.com/mmbazel & keep me writing! ☕