The need for speed — optimizing Dockerfiles at Unbabel

Over the past year, we have been containerising many of our services at Unbabel. We’ve learned a fair bit in that time, and we’d like to share some of key lessons that have come out of the process.

Introduction

Containers are a portable and secure way of packaging your applications to run alongside others in a shared environment. Several solutions exist, with Docker being the most widely used.

Docker containers are comprised of a container image and runtime configurations (such as external mount points, network definitions and environment variables).

A Dockerfile exists to tell the Docker daemon how to build a container’s images and may also contain some or all of its runtime configurations. This file is especially useful because it can be checked into version control systems and provides a repeatable way of building containers.

Today, we would like to share some of the lessons we have learned in order to overcome the overhead that the building of these images imposes, since they have to package runtime libraries, language interpreters and source code, which may take a long time or occupy large amounts of disk space.

Docker build process

In order to understand what may be taking up time in your build processes or making your container images occupy too much disk space, let’s first understand the Docker build process. Consider the following Dockerfile:

FROM ubuntu:xenial
# Upgrade any ubuntu packages
RUN apt-get update
RUN apt-get upgrade -y
# Get tools to download languagetool with
RUN apt-get install -y unzip
ARG DEBIAN_FRONTEND=noninteractive
# Install oracle java 8
RUN apt-get install -y software-properties-common python-software-properties
RUN add-apt-repository ppa:webupd8team/java
RUN apt-get update
RUN echo oracle-java8-installer shared/accepted-oracle-license-v1-1 select true | /usr/bin/debconf-set-selections
RUN apt-get install -y oracle-java8-installer
# Download languagetool
ADD https://languagetool.org/download/LanguageTool-3.7.zip .
# Extract languagetool and remove archive
RUN unzip -qq ./LanguageTool-3.7.zip
RUN rm ./LanguageTool-3.7.zip
EXPOSE 8080
ENTRYPOINT [ "java", "-cp" , "languagetool-server.jar", "org.languagetool.server.HTTPServer", "--port" , "8080", "--public" ]

This Dockerfile produces an image of the popular proofreading software LanguageTool. As you can see, it starts building from a ubuntu:xenial base image, fetches and installs any new package updates, installs unzip, installs Oracle Java (LanguageTool’s runtime) and finally downloads and installs LanguageTool. Every RUN or ADD line you see, gets run in their own Docker image layer:

> docker build -t languagetool .
Sending build context to Docker daemon  3.584kB
Step 1/18 : FROM ubuntu:xenial
---> f7b3f317ec73
Step 2/18 : RUN apt-get update
---> Running in 1b5bf2cd1fe6
...
---> 224ae66d6885
...

Every hash you see after each step runs is the hash that represents a Docker image layer to the Docker daemon and to the registries. These are layers that are all mounted on top of each other in order to produce the final filesystem structure for your container, with a run-time read/write layer on top. Because of this, all of these layers have to be saved and packaged for re-use, in order for the final filesystem to be reproducible, no matter what happens in the build steps.

Build-time pitfalls

Because all image layers, each representing a “snapshot” of the output of a build step, are mounted on top of each other, if you choose to create files in a build step and delete them afterwards, the space occupied by them will still be used in the final Docker image.

This is particularly frustrating when you work with an interpreted language, such as Python or JavaScript (on Node.js), and have to compile libraries that come from PyPI or NPM. You first have to install a compiler toolchain, then make the package manager download the source, and run any build scripts. The compiler toolchain will still occupy space in the final image, even if you choose to remove it in a later build step!

So how do we fix this? Well, we can exploit that fact that Docker “snapshots” every layer at the end of every build step. Consider the size of the image that the aforementioned Dockerfile produces:

> docker images | grep languagetool
REPOSITORY    TAG     IMAGE ID      CREATED              SIZE
languagetool  latest  7d2a4f6549a6  About a minute ago   1.24GB

What if at the end of each step that produces spurious files, we just delete them? Let’s try this:

FROM ubuntu:xenial
# Upgrade/install any ubuntu packages and delete package lists afterwards
RUN apt-get update; apt-get upgrade -y; apt-get install -y unzip wget; apt-get clean
ARG DEBIAN_FRONTEND=noninteractive
# Install oracle java 8
RUN apt-get install -y software-properties-common python-software-properties; \
add-apt-repository ppa:webupd8team/java; \
apt-get update; \
echo oracle-java8-installer shared/accepted-oracle-license-v1-1 select true | /usr/bin/debconf-set-selections; \
apt-get install -y oracle-java8-installer; \
apt-get clean
# Download and install languagetool
RUN wget https://languagetool.org/download/LanguageTool-3.7.zip; unzip -qq ./LanguageTool-3.7.zip; rm ./LanguageTool-3.7.zip;
EXPOSE 8080
ENTRYPOINT [ "java", "-cp" , "languagetool-server.jar", "org.languagetool.server.HTTPServer", "--port" , "8080", "--public" ]

Note that some of the steps have been optimised to clean apt lists after installing packages and the last one specifically is removing the downloaded archive in the same step. Let’s see how we fared:

> docker images | grep languagetool
REPOSITORY    TAG     IMAGE ID      CREATED              SIZE
languagetool  latest  14c6e6bbb18e  5 seconds ago        1.13GB

Well, 90Mb may not seem like much, but it’s just an example of what you can accomplish with this technique. You may even run all of the commands in a single RUN directive, removing any unneeded run-time packages at the end.

CI-time pitfalls

If you have embraced all of the CI/CD methodologies of recent years, you’re probably already building container images in a CI runner and downloading them in your host machines, starting a brand-new container with them in your deployment processes. In the average case, the time a deployment takes is usually bound to the time it takes to download the just-built container of a Docker registry. Thus, the bigger your container, the longer it takes. But this time can be reduced.

Remember this line from previous Dockerfiles?

FROM ubuntu:xenial

And this is not just for images built by other people. You can and should build your own intermediate layers! If your runtime (Java, Node.js, Ruby, etc.) version does not change often, you should prepare an intermediate container image containing it and any unchanging tools required at run-time. This allows your host and CI runner machines to keep this image in disk and only download layers than change. This makes the deployment much faster.

At Unbabel, we are heavy users of Python 3 and, thus, uWSGI. For one of our new micro-services, we chose to be especially bleeding-edge and pick Python 3.6. However, our preferred Linux distribution, Ubuntu, does not come with the Python 3.6 plug-in for uWSGI built by default, in the package coming from the official repositories. So we had to build it by hand.

The Dockerfile for this specific microservice was building up from a Ubuntu image, installing uWSGI and compiling the plug-in. Afterwards, it was adding the microservice source code, installing its requirements and setting up its configuration. The build time was close to 8mins! Yikes! 😅

We managed to reduce this micro-service’s average build times down to 5 minutes just by separating the Dockerfile into two: one that builds from Ubuntu, adds uWSGI and builds the aforementioned plug-in; and another that adds the micro-service source code and installs its requirements and required configurations. Then, when the host machine or the CI runner runs it, they don’t have to re-download everything, they just download the one that changes with the source code.

Conclusion

Docker is an incredible environment for packaging and running applications, but it takes carefully reading its documentation to be able to have a lean and mean running machine. Hope this post is useful to you as its contents have been to us at Unbabel. See you soon! 👋

Tomás Pinho
DevOps @ Unbabel

Like what you read? Give Tomás Pinho a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.