Popper Tutorial 3/3
Customize Your Docker Image
Introduction
Welcome to the third part of my tutorial series on how to build a “Popperized” project in computational science. At the end of the second tutorial we left off with an example project that ran a simple R script in a containerized environment. We used Packrat to restore all R packages from files in an r-base
Docker container. Our project was self-contained with Git and Git-LFS, having only Popper and Docker Hub as external dependencies.
You can find this tutorial series also here in a repo on GitLab.com. If you don’t have the example project from the second tutorial on your disk anymore, clone it from the tutorial_2
branch:
git clone https://gitlab.com/wtraylor/popper-tutorial.git -b tutorial_2 tutorial
cd tutorial
So far we have used the r-base
Docker container as it is provided by Docker Hub. We have managed to include R packages, but what about additional software applications? The best way for that is to create a custom Docker image. Remember from the first tutorial: A Docker image is the “blueprint” for Docker containers. An image defines what the virtual operating system looks like, and a container is an instantiation of that system. A Docker image is described in a Dockerfile
, and that’s what we will look at first.
Creating a Dockerfile
Dockerfiles are a powerful tool for describing a Docker image in detail. The Dockerfile reference describes all available options, but for our purpose we will keep it very simple and take one step at a time. The r-base
Docker image from Docker Hub, which we have been using in version 4.0.2, is based on the Linux distribution Debian. The Dockerfile
of the r-base
image is available here. It looks like a lot of complicated stuff, but the essence is: Take a Debian image, install R
on top of it, and give that new image the name “r-base.” As you can see, creating a new Docker image is just putting a new layer on top of an existing image. That makes our life easy because we don’t need to start from scratch.
For this tutorial, though, we will build our custom image not on top of r-base
, but on plain Debian. Remember that, for reproducibility’s sake, the external dependencies should be as stable as possible. Debian is like an eternal stronghold in the Free Software community. This operating system has been around since 1993 and is the basis for many other popular Linux distributions. That’s why I consider it most likely that also the Debian Docker images will still be available for a while.
Let’s start by creating a Dockerfile
in a new folder docker/
. This one doesn’t install anything on top of Debian yet:
FROM debian:stable-20201012-slim
In the FROM
instruction we specify an existing image with its version. That’s called the base image. Fortunately for us, Debian releases dated images, which helps us to be really precise in specifying our external dependency here. We choose a “slim” version because we don’t need any of the default software that usually ships with Debian. This way we can save some storage space.
As usual when creating a new file, you assign a license to it right away with the reuse-tool. You choose your favorite license and put your name + email (or your employer) as the copyright holder.
reuse addheader --copyright="Jane Doe <jane@example.com>" --license="Unlicense" docker/Dockerfile
In the previous tutorials I suggested that you could run reuse
in a Docker container if you don’t want to install it. You can create a Bash alias so you can just type reuse
to run the container: alias reuse='docker run --rm -it -v "$(pwd):/data" fsfe/reuse'
Now we can try out if Docker downloads the base image and builds our image on top:
docker build docker/
The docker build
command takes as an argument only the directory, not the Dockerfile
itself. But later, Popper will take care of those details for us.
I chose the “stable” version of Debian because we are less likely to run into bugs. The drawback is that we don’t have the latest features. That’s why I can’t use R in version 4, for example, but need to use version 3. You can look up the version of each software package in the Debian package directories. If you type “r-base” into the search field, you see the version of the r-base
package for each Debian release. There it shows that only Debian Bullseye or later have R in version 4.
Now let’s install the r-base
package into our image and instruct Popper to use it. Packrat also wants the wget
package as a “secure download method” (even though it won’t need to download anything in our case). This is the next version of our Dockerfile
:
FROM debian:stable-20201012-slim
RUN apt-get update && apt-get install --yes r-base wget
You could try docker build
at this point again, but Popper will also do that for us in a bit. This is the next version of the workflow file .popper.yml
:
steps:
- uses: "./docker"
args: ["scripts/bootstrap_packrat.sh"]
- uses: "./docker"
args: ["scripts/plot_box_and_whisker.R"]
Instead of specifying an image directly from Docker Hub with docker://
, we point to the local folder ./docker
. Now try out the workflow:
popper run
Popper should instruct Docker to build the image and then execute the workflow. For building the image, the Debian package manager (apt-get
) downloads a bunch of packages that are required as dependencies for r-base
and wget
. Fortunately that is only done once because Docker stores all images it has built or downloaded on your computer to be reused again. Building the image might take a while. If you want to have all output of the build process, you can call docker build
manually, as described above.
Let’s create a new commit with our changes:
git add docker .popper.yml
git commit -m 'Replace the r-base image with custom one'
Perhaps you have realized that we have now actually taken a step back: By downloading all these Debian packages we have introduced more external dependencies than before. Even worse: We have no control over the versions coming from the Debian repositories because we cannot specify the package version to be installed when the Docker image is being built.
We could fix that by building the Docker image and uploading it to Docker Hub. Then we would only need to specify name of our custom image in the Popper workflow, and Popper would download the built image from Docker Hub.
An alternative would be to ship the built Docker image as a file in the repository. That would guarantee reproducibility, but as of now, Popper doesn’t support this (compare Popper issue #958).
Either solution would be the topic for another tutorial.
You have come a long way: from running a simple script in a pre-made container over including R packages all the way to defining your own Docker image for your project. I hope you will be able to apply some parts of what you’ve learned to your own work. It would be really a shame for all the effort and great ideas you’re putting into your research if your scripts were not reproducible and reusable. Keep your skills sharp and enjoy coding!
Appendix: Cleaning up Docker
The more you play around with Docker, the more images you are accumulating on your system. And these images can take up a substantial amount of disk space! The same is true for the containers. Therefore I clean the slate every now and then.
Check how much space is used by Docker:
docker system df
Delete unused Docker files:
docker system prune
This work is licensed under a Creative Commons Attribution 4.0 International License.