Ephemeral Databases with ZFS

So you’ve built a scalable container orchestration framework that lets you create new dev environments in just a few seconds. Great! But what do you do about your development database? You know, the one that is ~100 GB and takes you two or three hours to clone each time you want to create a new environment? Here’s how the Greenhouse Engineering team brought that clone time down from hours to milliseconds using Mesos, Docker, and ZFS.

The Background

When Greenhouse started 4 years ago, we chose to use Heroku to host our application. Heroku is great for getting startups up and running fast, but we knew we’d outgrow it one day. To plan for that day, we started working on our own application deployment and infrastructure management platform (known internally as Dajoku) to keep up with our growing customer base and engineering team.

The Problem

One of the problems we encounter as a growing engineering team is shipping features regularly. As teams grow, things tend to slow down as more process is introduced and mitigating risk takes priority. To combat this, we’ve tried to keep our process as lightweight as possible. Features are deployed within a few days after development is done. During that time, the feature is deployed to an integration environment for some manual QA before it ships to production.

When we were 5 developers, we had one integration environment (called dev) on Heroku and that was enough. The environment was always free when someone needed it. Then, we started hiring and we needed to create dev-2. A few months later, dev-3. A year later, we were up to dev-9 and had a spreadsheet to coordinate who could use what and when.

This was terrible!

  • QA became a bottleneck as we simply didn’t have enough places to deploy our feature branches for testing.
  • Although these environments were relatively easy to create thanks to heroku fork, they would quickly get out of sync or into inconsistent states. For example, if we introduced a new environment variable, it would need to be manually set in 9+ places.
  • All these integration environments were expensive! We were spending nearly $5,000/month on dev environments alone.

No one was happy about this. We clearly needed a better way to manage things. Ideally, we’d have cheap, unlimited environments that were easy for developers to create and destroy.

The Solution

Well, remember Dajoku? As this was happening, our internal deployment system was slowly and steadily being built out. It wasn’t quite production ready yet, but complete enough for developers to start playing around with.

Dajoku was built (on Mesos and Marathon) to create and deploy applications easily. The only requirement to deploy an application on Dajoku is that it must be dockerized. Once dockerized, we can arbitrarily create and deploy applications to Dajoku. Our dev environments should be a perfect use case!

But, we had another requirement. Each of these ephemeral applications would also need to be backed by an ephemeral database.

In development, we load a scrubbed production database with all Personally Identifiable Information removed so that developers can work against realistic data. Scrubbing, dumping, and restoring the database takes a non-trivial amount of time. Thankfully, as part of some other work, we already had the scrub and dump (the most time intensive portion) happening in the background so a recent scrubbed database is always available. This helps, but each environment would still need to wait 5 hours to restore it’s own database.

Waiting around for 5 hours is not an option. So, what could we do?

  • Do we really need a scrubbed database? What if, we seed the database with some basic data instead? This was quickly shot down by developers as both ineffective to work with and onerous to maintain.
  • What if we restored the database to an EBS volume, then mount additional volumes for each environment? Hypothetically, possible, but this set up is pretty complex and not something we wanted to maintain.
  • What if we created a Docker image with postgres and our scrubbed database already loaded? Each environment could then run this image. This turned out to be hopelessly inefficient as any reads or writes to Postgres have to traverse through each layer of the Docker image to modify the filesystem.

ZFS to the rescue!

Thinking about filesystems got us thinking about ZFS.

ZFS is an advanced file system that‘s great for things like data integrity and scaling. ZFS uses a Copy on Write strategy for writing data. This means, that when writing new data, it doesn’t overwrite the existing data. Instead, it writes what has changed into a new location. This makes creating snapshots of the filesystem cheap and easy.

So, what if we restored our database to a ZFS volume and took a snapshot of it? This means we only need to restore once, in the background, and then we can quickly clone a snapshot of the filesystem with the database for each environment on demand.

Yes. Perfect. Let’s do that!

Here’s our final process:

  1. Each day, we create a ZFS volume and restore our scrubbed database there. Doing this restore creates all the postgres index files on disk.
zfs create greenhouse/scrubbed_db_${date}

2. Then, we take a snapshot of the filesystem in this state.

zfs snapshot greenhouse/scrubbed_db_${date}@scrub

3. When developers create a new environment, we clone the snapshot to create a scrubbed database within minutes!

zfs clone $(latest_snapshot) “greenhouse/${DAJOKU_HOSTNAME}”

4. We run a cron job to clean up any unused volumes. If they are still mounted, that means their environment is still being used by a developer and we leave it alone.

unused_scrubbed_filesystems | xargs -n1 -r zfs destroy -r

This works really well for us! Developers are able to create and have a running environment within a minute. This lets us create an environment for each feature in process. The isolation this provides has benefited other aspects of development as well. For example, previously, a developer would test their database schema changes in a development environment and leave the database in an unexpected state. Then, when the next developer uses the environment, they encounter errors and unexpected behavior. This is no longer an issue since each developer is guaranteed a fresh, prod-like database.


Diana Liu is a Tech Lead on the Product Engineering team.

Mike McClurg is the Director of Infrastructure.

Come work with us. We’re hiring!

Like what you read? Give Diana Liu a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.