Part 2: Migrating Snowflake’s Metadata with FDB Replication Service

Part 1 of this blog series introduced Snowflake’s migration of FoundationDB (FDB) and characterized it as surgery on the “heart” of the Data Cloud’s metadata store. This blog is about the system built by Snowflake’s engineering team which made this surgery possible.

We designed our “Pole Vault” migration from FDB3 to FDB6 to leverage several large-scale features we were already developing at Snowflake, such as FoundationDB Snapshots, and more importantly, FDB Replication Service (formerly known as Snowcannon). While developing these features, we didn’t know we would also need them to migrate from Snowflake’s FDB3 version to the open source version, FDB6. The existence of the replication service helped us decide Pole Vault would be a migration strategy, instead of a much riskier endeavor of upgrading FDB3 to FDB6. This blog, which is part 2 in this series, explores the history and design of FDB Replication Service and how integral it was to our migration efforts.

Replication for Availability

The development of FDB Replication Service started well before FoundationDB was open source. So our motivation back then was not to be able to seamlessly switch from one FoundationDB version to another. Instead, we recognized that we needed much better high availability and disaster recovery solutions.

FoundationDB does a good job of handling machine failures. When running on three availability zones, the typical configuration for FoundationDB clusters at Snowflake, we can survive the loss of one availability zone plus one machine in any other availability zone without losing availability and, more importantly, without data loss. In addition, our clusters store all durable state on a block storage device such as AWS Elastic Block Store (EBS), so even if we lose more machines than we can handle, chances are very high that we can recover. But even though we expect the risk of losing data on any FoundationDB cluster due to hardware failures to be low, we still have to be prepared against catastrophic failures.

Another possible risk is data corruption due to hardware failures. FoundationDB has some safeguards against this. Mostly, these are checksums associated with data on disk and on the network. Running on block storage also makes the probability of at-rest data corruption low.

The last risk, and the hardest to quantify, is data corruption due to software bugs, either in the application or within FoundationDB. Our main safeguard here is testing. But as Edsger W. Dijkstra once famously wrote, “Program testing can be used to show the presence of bugs, but never to show their absence.” In many cases corrupt data can be repaired manually, but if the damage is too high, one needs a mechanism to recover data.

Disaster Recovery vs. High Availability

A disaster is any event that causes a database to get into a state in which its data is unrecoverable, missing, or corrupted beyond repair. A disaster recovery solution then is one that allows us to bring back a new database instance with a correct, though usually older, state of the data. High availability solutions are usually not designed to handle all possible disasters. Instead they are designed to provide availability even if large parts of the system go down. One example could be the loss of a whole region. This is usually achieved by replicating data synchronously to a hot standby that potentially runs in another geographical location.

So disaster recovery and high availability provide different benefits. With a high availability solution, we typically won’t expect any data loss or down time. However, if data is accidentally deleted or corrupted, as with an application bug, it is likely the standby will also get corrupted as all transactions are synchronously replicated. A backup mechanism, on the other hand, allows us to restore to an older version of the data from before the corruption happened. But when recovering from a backup, one would expect a small amount of data loss and a potentially long availability loss.

It is important to understand that disaster recovery and high availability complement each other and any serious database system will provide mechanisms for both. Our FDB replication service is a high availability solution. For disaster recovery, FoundationDB provides a backup mechanism.

FDB Replication Service Design

The FDB Replication Service is implemented as an external service. However, we chose to use FDB as a basis for this service, mostly because FDB solves many general problems any distributed system has to solve. Furthermore, we wanted to make use of FDB’s excellent testing capabilities.

The heart of the system is the FDB Replication Cluster. This cluster is serving a simple distributed, highly available queue. It will run multiple replication queue processes and manage the incoming streams in a way that each entry is replicated across multiple machines. An FDB can then be configured to send all writes synchronously to this replication cluster. The Replication Cluster will then keep this data durable and send it asynchronously to another FoundationDB cluster, the secondary, which will ingest this data (see Figure 1).

FDB Replication Service Overview

Figure 1: Synchronous push from primary and asynchronous push to secondary

This scheme of synchronous replication to a queue and asynchronous replication to a standby is very powerful, for the following reasons:

  • The standby doesn’t need to exist. It could instead be brought up as needed from a backup and the queue can be used to restore the backup to the most recent change.
  • Any availability loss to the secondary will not impact the availability of the producer.
  • Failover and failback can be done cheaply and with very minimal downtime.
  • The Replication Cluster can send writes to multiple destinations and a destination doesn’t have to be another FoundationDB cluster. So this can be used, for example, to write an incremental backup to a blob storage, or to migrate to a different version of FDB.

Partially Committed Transactions and Other Challenges

From a high level view, this design looks rather simple. However, there are many corner-cases that need to be handled by this system. Probably the most obvious, and also the trickiest one, is that in failure cases a transaction could get committed to the Replication Cluster but not to FDB or the other way around. This can happen because, for performance reasons, we’re writing to the FDB cluster and the Replication Cluster in parallel.

FDB already implements an algorithm to recover from failures and rolling back partially committed transactions. With the FDB Replication Service enabled, this recovery algorithm has to communicate with the Replication Cluster to get a consensus to which version the cluster has to recover. If an FDB cluster can’t recover, for example, because we lost too many machines, the FDB Replication Service can decide to forget about the existing but failed primary and find a proper recovery version on its own, allowing us to fail over to the secondary.

Integrating the Replication Service into FDB posed more challenges than we could list in a blog post, but we can share some. One challenge was multiple FDB recoveries happening in parallel, a case that can occur if a network partition results in multiple competing leaders.

Recovery Service needs to play well with all these recoveries until one of them succeeds. Avoiding impacts to performance was another challenge, we decided to implement a quorum-based replication system. The system needs to handle disk failures, network partitions, gray failures, etc. While these are known problems with known solutions, implementation and testing is non-trivial. Another problem that deserves to be mentioned is that clients need to discover the correct FoundationDB cluster, as this could change whenever we fail over. Writing to the wrong cluster has to be impossible as this would cause data corruption. To achieve this, the FDB client has to coordinate with the FDB Replication Service and any non-primary FoundationDB cluster has to reject incoming transactions.

Adapting FDB Replication Service for Pole Vault

When Apple made FDB open source, we were thrilled and a bit scared at the same time. Thrilled, obviously, because we now could make use of all the amazing changes the folks at Apple implemented while keeping our own changes. Scared, because we immediately knew that our version of FDB had diverged far from the open source version. So we knew that migrating to the open source version would be much more complex than just integrating our changes into the open source repository and running an upgrade.

But here is where having spent a lot of time on a design for our high availability solution paid off. As mentioned above, FDB Replication Service was able to write a mutation stream to any endpoint. So there was no reason it couldn’t write to an FDB cluster that was at a very different version. We just needed to build an endpoint that could do that. Internally, we called this the external consumer adapter. (Yes, it’s a mouthful, and yes, we do need our marketing department.) The external consumer adapter is a program that receives mutations and writes them to FDB6, using the FDB6 client. FDB Replication Service controlled its lifetime, so that only one of those would ever be able to write data to the primary cluster.

The other problem was that we designed FDB Replication Service to orchestrate the switch on both clusters, so we could failover from primary to secondary and vice versa. However, in order to do this, we would’ve needed to port FDB Replication Service to the open source version, because FDB Replication Service needed to be integrated into the commit pipeline. This problem we solved through what we called the switch (again: we’re not the marketing department). The idea is rather simple: Instead of switching the two clusters, we lock the producer, apply all missing mutations to the consumer (a process that takes at most a few seconds), and then unlock the consumer cluster. In order to make this all work properly, we had to modify the client to behave correctly during the Pole Vault migration. The client is aware of both clusters and only attempts to write to the secondary cluster after it verifies that the original primary is in a locked state. Or in other words, the client can check whether a failover has happened by trying to write to the old primary. That way, we figured we could move to a completely different FDB cluster with little to no customer impact.

What’s Next?

The first blog in this series gave an overview of our migration process and this blog post describes the core component used to achieve this: FDB Replication Service. But making sure that FoundationDB, the FDB Replication Service, and the application work well together is imperative and requires a lot of testing. Additionally, irreversible changes to a production system are scary, so we built this in a way so the change can be reversed in a worst case scenario. We detail our testing and rollback strategy in the next blog post.

--

--