Wrangling FoundationDB: Zero downtime OS upgrades
Not only am I an Operations Wrangler leading a production engineering team running Wavefront, but also work on running some of the largest deployments of FoundationDB. This is our story.
[This is a sneak peek of some topics I’ll be talking about at the FoundationDB Summit on Dec 10. You should come.]
Sometimes we add FoundationDB instances to a running cluster. Sometimes we contract.
Sometimes we replace all instances on a running cluster. Sometimes we do that to change instance types.
Sometimes we do that to pick up OS updates or to move from Ubuntu 14.04 to Ubuntu 18.04.
But we do this live, without disruption. And we do this on clusters that sustain > 800,000 writes/second.
Magic
There’s a big of magic that we do that I’ll only mention briefly. You’ll have to come to the Summit to learn more.
On instance boot, a couple things happen:
/etc/init.d/landingparty
runs and turns an otherwise stateless Fdb instance into a working member of a cluster (it pulls downfdb.cluster
from S3 and on a new cluster will handle bootstrapping & creating the database)- an
\@reboot
cronjob runsansible-pull
to configure templated configuration files includingfoundationdb.conf
(it also manages ensuring EBS volumes have matching tags.
Together this lets us launch new Fdb instance that auto-configure themselves and are ready to join an existing cluster.
How we replace an entire running cluster
In this example we’ll do a live fleet replacement of a relatively small 74GB key-value store across 3 hosts.
This is a test cluster and write load is relatively small but we routinely do this to production clusters taking over 1m writes/second on key-value stores well over 200TB.
Our tools are a mix of Ansible, bash
& jq
and terraform.
The process
- tag running instances with a special
tag:planned_removal : true
EC2 tag (output) - delete the existing instances from the Terraform state file (with
jq
) - launch new instances
- start Fdb on new instances (output)
- start excludes (output)
- stop instances, disable termination protection, terminate instances (output)
Ta da!
Before we started:
ubuntu@red-2b-db3-i-0b1d0ad6fe5a5dda7:~$ lsb_release -dDescription: Ubuntu 14.04.5 LTS
And after we’re all done:
ubuntu@red-2b-db3-i-0a5cc0af209a1e928:~$ lsb_release -dDescription: Ubuntu 18.04.1 LTS