How to Get Docker to Play Nicely With Your Python Data Science Packages

A guide to Dockerizing your data science projects

Timothy Mugayi
Mar 6, 2020 · 8 min read
Image for post
Image for post
Photo by Chris Liverani on Unsplash

When it comes to Python package management, you have a couple of choices to handle your Python dependencies. The most widely used are Conda, pip, and pyenv.

In this article, we’ll be looking at two approaches you can use within your Docker containers when Dockerzing your data science applications: the conventional pip shipped with Python by default and the Conda approach.

In order to understand how to Dockerize your data science projects, you need to understand the key difference between Conda and pip. Making the best decision early on will prevent future rework.

What’s illustrated in the examples below

  • Get started building out your Docker image while exploring the differences between pip and Conda.
  • Learn how to configure Docker to pull from a private PyPI server
  • Explore the idea of toggling between Python versions within your Docker for your data science Python applications

What’s pip?

pip installs Python packages — usually, when you read any online literature this is the case, but this doesn’t mean you can’t bundle other applications with pip and leverage on Python subprocesses to do some non-Pythonic installations.

Recently, I made a GAMS-wrapper Python-installer package that allows installing GAMS on Windows, Linux, and MAC. The package was hosted on a private PyPI server. The MAC install is performed via a DMG image, while the Linux install is carried out via an executable Linux binary package. The Windows install is performed via an unattended installation with an additional option to perform a silent (or quiet) installation. This approach of doing a pip install isn’t part of the standard package process, but it’s something I decided to do to standardize all my package installations via pip.

pip packages are usually hosted on a PyPI. This can be a private or public repository for open-source or private Python packages. For those not familiar with the Python package repository, when you think of PyPI, think of RubyGems for Ruby, Packagist for PHP, Maven for Java, CPAN for Perl, and npm for Node.js.

pip has limitations, such as lack of package isolation. This means out of the box, you can’t run multiple Python versions in an isolated manner. This is where virtualenv solves this very specific problem by allowing multiple Python projects that have different (and often conflicting) requirements to coexist on the same computer.

What’s Anaconda?

Think of it as your Swiss Army knife for all things data science — all consolidated into one place. Anaconda consists of its own package manager similar to pip called Conda.

While pip can install other things other than Python packages, this isn’t by design and isn’t part of the way pip was built. Conda solves this by allowing Conda packages to support non-Python–library dependencies, such as HDF5, MKL, and LLVM, which don’t have a setup.py in their source code and also don’t install files into Python’s site-packages directory.

As opposed to pip, Conda provides the isolation of each environment by design. It’s hard to use Conda without using isolation. By default, when you use Conda, it ships with a base environment where it itself is installed. It’s usually encouraged to create and then activate a new, application-specific environment for your own needs.

Conda makes it an ideal package manager in Python if you’re working on data science–related applications. The ability to switch Python environments seamlessly via a Conda YAML file means you don’t ever have to worry about changing your Docker configurations if there’s a need to switch out some package or even the Python version.

Let’s jump into an example to get your Conda and pip setup using Docker with a few data science packages for illustration

Prerequisites

$ docker --version

This should produce the following output:

Docker version 19.03.5, build 633a0ea

Let’s create our sample helloworld.py for illustration purposes. In this example, we’ll be illustrating a simple program that uses the Dask framework, a lightweight data science–distributed computing framework that allows you to perform larger than memory computation.

In essence, we can parallelize normal Python code or scale pandas NumPy libraries with Dask. Or we can add parallelism to existing Python workflows. Think of it as a lightweight version of PySpark on top of the data science packages you love.

helloworld.py leverages the Dask framework for data science parallel computing

The pip Way

Most of the Python packages we use are open-source external packages, but they’re instances where you need to use a private server.

Dockerfile pip example

In our above Docker file, we use Python 3.6's slim-buster. The slim-buster variant Docker images lack the common package’s layers, and so the image sizes tend to much smaller.

Do keep in mind the Docker size is influenced by how much stuff you got going on in your application. Usually, size only matters for CI/CD if your build speeds are important but doesn’t necessarily equate to better application performance.

In our Docker file, we use ARGs to define variables so as not to expose security credentials within our Dockerfile. ARGs are only available during the build of a Docker image and not after the image is created and containers are started from (ENTRYPOINT, CMD). To avoid the transit nature of ARG, we assign ARG values to ENV values. ENV values are available to containers but also to RUN commands during the Docker build, starting with the line where they’re introduced — as well as during the lifetime of the Docker container.

Below is the pip requirements.txt that consists of two custom packages and common data science packages. Pip doesn’t support some of the packages like PyTorch and TensorFlow out of the box — that’s where Conda excels. Before Dockerizing your project, it’s always best to dry run your code in a virtual environment so you get your dependencies in order and to export your dependencies to file with the following command:

$ pip freeze > requirements.txt
requirements.txt

The below command allows you to build your Docker image while passing in additional build ARGs. For this example, we pass in credential details that allow us to authenticate against a private PyPI server. If you don’t have one, you can omit these arguments.

$ docker build \
--build-arg ARTIFACTORY_USERNAME={YOUR_USERNAME} \
--build-arg ARTIFACTORY_SECRET_TOKEN={YOUR_SECRET_TOKEN} \
--no-cache -t helloworld:latest .

docker build will generate the following output:

Dockerfile build pip

If you’d like to try out the helloworld.py sample, ensure you remove the GAMS and CPLEX entries within the requirements.yml or environment.yml files since these two packages are for illustrating how to install pip from a private PyPI server.

Execute the run command to get your helloworld.py running:

docker container run helloworld
Image for post
Image for post
Executing the docker run helloworld output

The Conda Way

FROM continuumio/miniconda3:4.7.12

This is a minimal version of Anaconda that includes only Conda and its dependencies. It contains the Conda package manager and Python 3.7 by default. Using Miniconda, we get the ability to install various Python versions in the same Docker image. This makes sense in situations where there's a need to have different data science microservices working on different versions of Python. Let's go ahead and define an environment.yml file equivalent of the pip requirements.txt.

environment.yml

The pip section of our environment.yml file can contain a private pip package or any other packages that might not be supported by Conda.

pip will automatically resolve packages that are defined within private PyPI servers if found from the extra-index-url we’ve defined within the pip.conf file stored within your /etc/ folder.

Conda has a similar concept through the use of channels. Conda channels are the locations where packages are stored — which can be Conda-define locations that resolve to URL repositories. They serve as the base for hosting and managing packages.

RUN echo "[global]\nextra-index-url= https://$ARTIFACTORY_USERNAME:$ARTIFACTORY_SECRET_TOKEN@artifactory.com/api/pypi/simple" >> /etc/pip.conf
Dockerfile example in Conda

Executing the same Docker build command we used earlier on during the pip example would yield the following output:

If you’ve ever encountered the below warning during the Docker build process …

“Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies. Conda may not use the correct pip to install your packages, and they may end up in the wrong place. Please add an explicit pip dependency. I'm adding one for you, but still nagging you.”

… then this is how you go about fixing it. Add pip to your environment.yml file as the suggestion says. In the latest versions of Conda, pip was removed — hence, you need to add it explicitly.

Conda environment.yml file with pip configured as a dependency

To run your Docker instance, execute the below command, a slight variation of our earlier pip variant. Take note the flag -d enables you to run your Docker instance as a daemon. If you omit the flag, your code won’t fire and exit — hence, you’ll need to exit the program by pressing Ctrl+Z to interrupt the running Python script.

$ docker container run -d helloworld

If you need to use other Python packages, you can use the Conda search command as illustrated below. This assumes your Docker is running in a daemon state.

$ docker ps# prints the list of running docker containers with their corresponding docker container ids$ docker exec -it {CONTAINER_ID} /bin/bash# command above allows you to enter the container by executing an interactive bash shell$ conda search python# above command list out all available python versions 
you may proceed to exit the container by typing. if you wish to export your package such as how pip freeze works you can achieve the same thing with the below command
$ conda env export --from-history > environment.yml# --from-history allows exporting of top level dependencies. If you would like the full blown conda export remove this argument$ exit# Followed by$ docker stop {CONTAINER_ID}# which will terminate the executing docker instance.
# Update the environment.yml with whichever python version you wish to use

Conclusion

Useful tools

  • pipdeptree: view dependency trees of your packages in your environment
  • check-pip-dependencies or pip-conflict-checker: get to the root of depdency conflicts
  • pip-autoremove : helps you remove unused orphaned dependencies from your environment by listing the top-level requirements
  • conda-depgraph: a command-line utility to plot Conda dependency graphs
  • peep: cryptographically guaranteeing packages to ensure you always get back the same untampered pip package, useful for ensuring security in your docker images

Happy coding.

Better Programming

Advice for programmers.

Sign up for The Best of Better Programming

By Better Programming

A weekly newsletter sent every Friday with the best articles we published that week. Code tutorials, advice, career opportunities, and more! Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Thanks to Zack Shapiro

Timothy Mugayi

Written by

Tech Evangelist, Instructor, Polyglot Developer with a passion for innovative technology, Father & Health Activist

Better Programming

Advice for programmers.

Timothy Mugayi

Written by

Tech Evangelist, Instructor, Polyglot Developer with a passion for innovative technology, Father & Health Activist

Better Programming

Advice for programmers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store