The Docker snippets feature in Gigantum: A beginner’s example

Tyler Whitehouse
Gigantum
Published in
5 min readDec 4, 2018

Someone recently shared a Gigantum Project with me that uses NLTK , a popular Python NLP package utilizing a variety of supplementary data for things like grammars and stopwords. This data typically must be downloaded after pip installing NLTK. While there is a conda package that includes all of the data, we are going to ignore that fact for three reasons.

  • First, the conda package doesn’t let you selectively download the supplemental data;
  • Second, my experience with the NLTK data revealed a confusing aspect of working in containerized environments that Gigantum users may encounter: the container’s read/write layer can make things disappear somewhat arbitrarily;
  • Third, this was a good excuse to write my first post on Gigantum!

In this post I address the confusion due to the transience of the read/write layer, as well as give a simple approach to getting around it. Specifically, I:

  • Give a high level explanation of the Docker read/write layer;
  • Give an elementary introduction to the Docker snippet feature and show how it can be used to install library data sets.
An example of the Docker snippets feature used to install CERN’s ROOT library.

But first, a brief introduction. Gigantum is a browser-based data science platform that combines a work environment with a publication and collaboration platform. It runs on Docker to let you set up environments, work in them, and then easily share them for collaboration or publication.

The basic element is a Gigantum Project, an augmented Git repository that versions and organizes code, data, and environment while capturing work history in a “who-what-when” timeline. The actual web application, the Gigantum Client, runs locally to manage Projects, render work history in an illustrated and interactive timeline, and provide containerized Jupyter environments attached to the Project. If you already work with Jupyter, then Gigantum pretty much lets you do your thing, only now with greater speed, transparency, versioning and portability benefits. (NB: RStudio Server will soon be available as another environment.) If you are interested, there is a deeper dive into Gigantum here.

Returning to the NLTK example, the basic problem is that work done in a containerized environment doesn’t always persist. This is because it is done in the container’s read/write layer, which is a temporary layer that goes on top of the Docker image when the container starts. The read/write layer is part of what makes Docker efficient, but efficiency comes at the cost of losing some things usually taken for granted, i.e. things persisting in the file system.

Ultimately, this isn’t a problem as long as you are aware of it and know how to address it. In most situations Gigantum gracefully accounts for this to make sure that everything that needs to be permanently captured does not disappear when the Project container shuts off. However, there are still some occasions when there are unexpected consequences.

For example, unless you are very intentional about where you put data while working in a Project container, it will vanish on restarting the container. In my NLTK example, I was downloading the data from a Jupyter notebook using the typical command nltk.download('stopwords'), and thus I had to download it every time I restarted the Project container. This was a bit annoying, so I decided to make the data permanently available by putting it somewhere other than the read/write layer.

Gigantum provides the /mnt/labbook directory for users to put data that should persist between container runs. For example, this is where the code, input and output directories (visible from the Gigantum and Jupyter tabs) are. Putting something here ensures that it is accessible, won’t vanish when the container stops, and is under the automated version control by Gigantum. You can see this yourself in various ways. For example, if you run nltk.download('stopwords','mnt/labbook') in a Jupyter cell, then you will note the data will still be there after restarting the Project container and that all of the usual versioning is applied to it.

While that solves the data persistence problem, the next problem is that NLTK can’t find the data there. Remember, NLTK has a set of specific locations that it searches for the data, and you can neither alter such paths nor put the data in the proper location because the Project container doesn’t have the ability to write to such locations.

So, what to do?

One answer is to use the Docker snippets feature to customize the Project environment in combination with the package managers pip, conda and apt. This feature lets you do pretty much whatever you like to the development environment, with the results put into the image and persisting between container runs. For my NLTK example I just needed use this feature to download the data to the proper place and the problem was solved.

If you aren’t already familiar with it, Docker is a software development tool that is gaining traction in data science to solve problems around reproducibility and portability. It has its own API and does a variety of complicated things, so it can seem a little esoteric at first. However, if you have some Linux terminal experience then Docker can be fairly accessible, at least for simple things. See an intro here.

For my case, to download the data permanently I used a variant of the the proper terminal command to get the stopwords corpus installed when I set up the environment. You can see how easy it was in the GIF below. The Docker command required no more than appending the Python command to RUN. Note that Docker snippet commands are run as root, which is a privilege that you don’t have while working within the Project container.

After the Project built, the stopwords were available for import in the Jupyter environment and neither I nor my collaborator had to download them when restarting the container.

So, that is it. This was an elementary example used to address one subtlety of working in a Project container and to show how easy it is to use Docker snippets. There are much more creative things that can be done using the feature, as we will show in future posts. Thanks for reading and if you have any questions send us a message at hello@gigantum.com.

--

--