Wrangling FoundationDB: Zero downtime OS upgrades

matthew zeier
2 min readNov 12, 2018

--

Not only am I an Operations Wrangler leading a production engineering team running Wavefront, but also work on running some of the largest deployments of FoundationDB. This is our story.

[This is a sneak peek of some topics I’ll be talking about at the FoundationDB Summit on Dec 10. You should come.]

Sometimes we add FoundationDB instances to a running cluster. Sometimes we contract.

Sometimes we replace all instances on a running cluster. Sometimes we do that to change instance types.

Sometimes we do that to pick up OS updates or to move from Ubuntu 14.04 to Ubuntu 18.04.

But we do this live, without disruption. And we do this on clusters that sustain > 800,000 writes/second.

The Fdb Health Dashboard. This cluster is doing 1.2m writes/second.

Magic

There’s a big of magic that we do that I’ll only mention briefly. You’ll have to come to the Summit to learn more.

On instance boot, a couple things happen:

  1. /etc/init.d/landingparty runs and turns an otherwise stateless Fdb instance into a working member of a cluster (it pulls down fdb.cluster from S3 and on a new cluster will handle bootstrapping & creating the database)
  2. an \@reboot cronjob runs ansible-pull to configure templated configuration files including foundationdb.conf (it also manages ensuring EBS volumes have matching tags.

Together this lets us launch new Fdb instance that auto-configure themselves and are ready to join an existing cluster.

How we replace an entire running cluster

In this example we’ll do a live fleet replacement of a relatively small 74GB key-value store across 3 hosts.

This is a test cluster and write load is relatively small but we routinely do this to production clusters taking over 1m writes/second on key-value stores well over 200TB.

Our tools are a mix of Ansible, bash & jq and terraform.

The process

  1. tag running instances with a special tag:planned_removal : true EC2 tag (output)
  2. delete the existing instances from the Terraform state file (with jq)
  3. launch new instances
  4. start Fdb on new instances (output)
  5. start excludes (output)
  6. stop instances, disable termination protection, terminate instances (output)
After steps 1–3.

Ta da!

Before we started:

ubuntu@red-2b-db3-i-0b1d0ad6fe5a5dda7:~$ lsb_release -dDescription:    Ubuntu 14.04.5 LTS

And after we’re all done:

ubuntu@red-2b-db3-i-0a5cc0af209a1e928:~$ lsb_release -dDescription:    Ubuntu 18.04.1 LTS
It’s a wrap!

--

--

matthew zeier

Operations Wrangler @ Wavefront, ex-Apple, ex-Mozilla, recovering network engineer.