The Containerized Read-only Database

Published in

IHME Tech

12 min readOct 9, 2019

Containerizing databases is controversial. There are a number of challenges and dangers, most of which have to do with maintaining state. Some authors advise against the practice altogether. Others allow that it can be convenient in a development environment but should not be used in production. Typically, though, these authors assume that the contents of the database will change over its lifetime. I have yet to see a treatment of the subject of containerizing databases that accounts for the case where the database is read-only. Such a database, once created, is never changed, so maintaining state is not an issue. Does having a “stateless” database change the equation at all? Can we avoid the major pitfalls of containerizing databases if they aren’t expected to maintain state? Are there patterns available to us for managing such databases that wouldn’t be possible with stateful databases?

These are some of the questions I’ve been thinking about while developing IHME’s Local Burden of Disease data visualization tool. The demands of this application are a bit unusual compared with the other visualization tools our team supports. Rather than using a single database with a fixed schema, the tool is designed to support multiple datasets or “themes,” each of which gets to define the shape of its data with a small configuration language. Once a theme is defined, with a configuration file and a set of data files, a utility program builds the database for the theme. Any given deployment of the application stack can connect to any number of such theme databases. As IHME researchers produce new themes and update existing themes (at the time of this writing, we’ve created nine themes, and many more are in the works), the challenge of creating and maintaining databases over multiple deployments has become an increasing challenge.

The approach I’ve developed to deal with these challenges, which is the subject of this article, defies conventional wisdom in a couple of ways. Not only are we deploying databases in Docker containers, we’re actually building these databases offline as static Docker images. This approach, while unusual, offers a number of advantages, as I’ll describe below.

First Approach: Database Initialization Containers

Our first solution to the problem of building and managing these databases involved creating Docker images containing all the raw materials for building a theme, including the configuration file, the data files, and the database creation utility. We’d deploy these database initialization containers, typically one at a time, alongside the Postgres container. The initialization container would run the database creation utility to completion and then stop. Postgres would store its contents in a volume mounted to the container, so that the container could be replaced without having to rebuild the databases.

diagram of approach using database initialization containers

This solution worked okay, but I felt there was room for improvement. Combining the data and the utility program together into massive Docker images wasn’t ideal, because updates to either the data or the code would require rebuilding the entire container. It also became burdensome to keep launching these ephemeral containers, which needed to be deleted manually after they ran to completion, since our orchestration layer wasn’t up to automating this for us.

Second Approach: Database-Creation Utility as Persistent Service

In the next iteration, I converted the database creation utility into a persistent service that would build and update our databases on demand. Because it was now running continuously, we no longer had the problem of creating and deleting initialization containers. This approach also accomplished the goal of decoupling program from data. Instead of having the raw data files baked into Docker images, the database creation utility would now download the relevant data (into a volume) using Git LFS, before starting the process of building or updating a database.

diagram of approach using persistent database creation service

While this refactor generally represented an improvement over the initial design, it still had some drawbacks:

While each database was stateless in itself, the catalogue of databases on the Postgres server was not. With multiple deployments and multiple themes to manage, we still found ourselves frequently wondering which databases were on which servers and what versions of each theme they represented.
The database build process typically took a long time. Building these databases in-place on the live database server used by the application presented a number of problems. It meant we’d have to wait quite a while from the time the application was first deployed until it was ready to accept requests. And because it was hard to predict when a database would be available for use by the application, we couldn’t time releases of new themes (or updates of existing themes) with much precision.
The database build process required a lot of computation for the database server, which was detrimental to query performance on the live application attached to the server.
The apparatus used for building and updating databases complicated the design of the system tremendously (see diagram below). I implemented a client-server model in which a command-line utility would issue commands to a Celery task queue. The database creation utility was refactored as a worker process consuming tasks from this queue. This change added several new microservices to the application stack. It made the database utility responsible for managing local Git repos and synchronizing with a remote Git server over SSH. The task queue introduced more statefulness into the system, thus increasing the risk of errors.

diagram showing details of the Celery apparatus and admin CLI

Goals for a Better Design

After using this system for a while and experiencing its drawbacks, I began formulating a wishlist for the next iteration:

Deployments of the application should be ephemeral, easy to create and destroy.
The database server should be truly stateless, not just mostly stateless. It should not contain a catalogue of databases.
We should not build databases on the same server that needs to respond to requests from the application.
It should be trivial to determine which revision of the data was used to build a given database.
We should be able to build a database once and deploy it many times, rather than having to build it anew for each deployment.
In the deployment environment, a database should be fully populated on startup. We should not have to wait a long time for the database to be ready to accept queries.
Databases should be modular. Deployments of the application should be able to connect to (and disconnect from) one or more databases quickly and easily.

Options for Treating Databases as Static Assets

In order to fulfill all the items on this wishlist, it was clear that we needed to be able to build databases “offline” (i.e. not connected to the application), store the result as a static asset, and have a mechanism for deploying these built assets. The options I considered included:

Generate a database dump file after building the database. When deploying, copy the dump file to a volume attached to the database server container, then populate the database server using the dump file.
Store the (compressed) raw disk contents after building the database. When deploying, copy this data to a volume attached to the directory where the containerized database server expects its data.
Store the fully populated database server as a Docker image. Deploy this image instead of an empty database server image.

Using a dump file (option 1), seemed a reliable, tried-and-true approach. Indeed, dump files exist precisely for the purpose of replicating a database from a static file. Conveniently, the pg_dump utility Postgres provides creates dump files that are portable to newer versions of the database server, so that a dump from Postgres 9, for instance, should still work in Postgres 10. A drawback to this approach, though, is that the database would still need time and computing resources to be rebuilt in the deployment environment. We would need to keep the database server offline (i.e. not accepting queries from the application) until the rebuild was complete, adding complexity to orchestrating deployments. Further, since each host on which we deploy typically has a large number of containers running on it, the resource demands of the database rebuild (e.g. CPU and memory) could impact performance of the services running in those other containers.

Option 2, replicating a database from the raw disk contents, would eliminate some of the problems associated with repopulating from a dump file. Because the database server wouldn’t need to be running while the data was moved into place, we could simply wait until the move was complete to start the server. When the server finally started, it would be fully populated and ready to accept traffic. Since merely copying data would likely be less resource-intensive than rebuilding from a dump file, the impact on other containers running on the same host should be negligible. Using the raw disk contents, though, would also come with some disadvantages. The total size on disk would likely be greater compared with a dump file in archive format, since pg_dump is able to compress the contents. Although we could compress the raw disk contents ourselves, I would expect pg_dump to be able to compress more aggressively compared with general-purpose compression, because it has detailed knowledge of the data representation. Another drawback of option 2 is that when upgrading the database server itself (from Postgres 9 to 10, for instance), the database files generated with the older version could be rendered unusable and would need to be rebuilt, a potentially expensive endeavor given the large number of databases we maintain.

Option 3, storing the fully populated database server as a container image, seemed to be the simplest solution for our needs. Docker provides easy-to-use facilities for creating, storing, and deploying images, and our team already had infrastructure and a well-established workflow for working with such images. Options 1 and 2 would require:

finding a place to store the large files produced, with facilities for copying them over the network securely
setting up Docker volumes to hold the data
coordinating a multi-phase deployment process

None of these requirements would apply to option 3. We already had a private image registry to store the images, and copying to and from the registry would be as simple as running docker push or docker pull (in fact, our orchestration layer handles the latter automatically during deployment). Since the database container would already have all the data it needed, no volume would be required. Finally, deployments could be accomplished with a single step, telling the orchestration layer to create a Postgres service using the appropriate image.

This is not to say option 3 would be without drawbacks. The size of our container images would almost certainly be larger than the equivalent archive-format dump files, even when subtracting the size of the lower layers of the image (i.e. the size of the parent image). The process of building a database in the top, writeable layer of Docker’s union file system would likely be slower compared with writing to a conventional file system, thus increasing the time needed to build a database image. Finally, I’ve been unable to determine conclusively whether there’s an upper limit on the size of an image Docker and related technologies are able to handle. As the Local Burden of Disease project has progressed, database sizes are getting larger and larger, and I worry that a time may come when a database will exceed the maximum possible image size. For our immediate needs, though, the advantages of building databases as static images seemed to outweigh the disadvantages.

Creating a Database Image

I’ll now describe, on a technical level, how we build databases as static images. The recommended approach for creating a Docker image is to write a Dockerfile. Dockerfiles are great, because they provide:

transparency — allowing us to see the precise instructions that were used to create an image, and
repeatability — an image can be rebuilt by replaying the instructions, and the result should be more or less the same each time, assuming care is taken to control for the variability of external resources, like software downloaded from the internet during the build process.

For the purposes of building a database, though, the Dockerfile approach is less than ideal. We would have to copy the source data files and the database builder utility into the image. That would mean a larger final image size, including many files that were needed in the build process but are not needed at runtime. Of course we could delete those extra files after building the database, but because of the layered design of Docker images, adding files in one layer and then deleting them in a subsequent layer does not actually achieve the desired effect of reducing the image size. Another problem is that we would need to run the database server and the database creation utility concurrently during the build process, and the Dockerfile format does not provide a simple way to do that.

Luckily, there is another way to build a Docker image. Since a container is essentially just the top, writeable layer of an image, one can run a container, update its contents, and then save the result as an image with the command docker commit. In the general case, this approach is not as good using a Dockerfile, because we don't record the instructions that were used to build an image, and the process is therefore not as easy to repeat for subsequent rebuilds. For our purposes, though, it's preferable, because we can keep the database builder utility and the source data files separate from the database image we're building. To address the shortcomings in transparency and repeatability, we use strict versioning for the database builder utility, and we tag each database image with both the version of the utility used to create it and the revision of the data itself (i.e. the Git commit hash, since we store our data in Git LFS repositories).

When building a database image, we use docker-compose to run two containers linked together, the PostGIS database server (initially empty) and the database creation utility ( choroscope-db). Here's a simplified docker-compose.yml file illustrating the setup:

Note that placeholders of the form ${SOME_NAME} allow us to parameterize the file at run time. To run the database build, we invoke docker-compose like this:

This command will run the stack until choroscope-db completes the build process. When choroscope-db exits, its container will shut down. Because we use the flag --abort-on-container-exit, that will also cause the docker-compose command to exit, and with it the PostGIS container. The exit code of docker-compose will be the same as the exit code of the choroscope-db container, making it simple to determine whether the build succeeded or failed. Now we can save the stopped postgis container as an image, supplying a tag incorporating the version of choroscope-db and the revision of the data as described above:

Revised System Design Using Database Images

Once the system for building database images was in place, I turned my attention to revising the application stack so that a single deployment could be linked to an arbitrary number of containerized database servers. Where previously a single containerized Postgres server held all the databases used by the application, and a table on that server maintained a registry of these databases, now we had separate containers for each database. This more modular design would allow us to add and remove databases at will, but we needed somewhere to store the registry of databases connected to the application and a mechanism for creating and updating the registry.

We already had an instance of Redis in our application stack, used by the API server for caching frequently-used data. This seemed an obvious place to store the registry. Since the cache was already stateful, I didn’t see any harm in adding an additional piece of state to it. For managing the registry, I added an additional service to the application stack, a utility called cache-sync. cache-sync would be given (via an environment variable) a list of database hosts to which it should connect. On startup, it would request some basic information from each database. Using this information, it would then create a registry of connected databases in Redis, or else update the registry if it already existed. After the startup phase, cache-sync would continue to poll the databases on a schedule, looking for changes to their content. Any change would cause cache-sync to update the registry in Redis, as well as clearing any other cache entries related to the database(s) that had changed. The API server, meanwhile, would use this registry to determine what databases were available for it to query.

diagram of the new modular application design

Conclusion

The new modular application design using database container images has served us well as the Local Burden of Disease project grows. We’re able to build our databases offline, store them in an image registry, and deploy them reliably wherever and whenever we need them. The image tags we use make it easy to determine the contents of a given database image. The loose coupling between databases and application stack allows for very flexible arrangements. We can have any number of application stacks (including separate deployments for different internal and external audiences) linked to any number of databases. Multiple application stacks can even share some or all of the same databases. Managing state within the application stack proved to be tricky, and there have been a number of bugs to work out related to the cache-sync service, but the system is working as designed, and maintaining it is now far easier than at any time in the past, even with the increase in the number of databases and instances of the application stack.

Originally published at http://github.com.

The Containerized Read-only Database

Written by David Schneider