Understanding Distributed Databases

Distributed databases are relatively new and extraordinarily complex pieces of software; they rely on a host of distributed algorithms that most programmers have not been exposed to, such as vector clocks & Raft/Paxos. Additionally, a cross-product interface equivalent to SQL has yet to be adopted. These are two of the biggest obstacles facing programmers who are interested in learning more about distributed databases. My goal is to surmount these initials barriers to entry and start building a foundation that will allow you to reason about them effectively.

Before we begin discussing specifics we should decide on a definition for what a distributed database actually is. According to Wikipedia, “A distributed database is a database in which storage devices are not all attached to a common processing unit such as the CPU…” This definition encompasses all networked databases; however, I would like to exclude databases that rely on master-slave replication (more on this later). Thus, we will settle on the following definition: a distributed database is a database which is designed to run on an arbitrarily sized cluster of networked computers.

Since distributed databases run on a cluster of computers instead of a single machine, they can be more fault tolerant and scalable than their relational counterparts. However, this property also introduces some notable tradeoffs that you need to be aware of; most of which are the result of the theoretical limitations of distributed systems. In the next section we will cover a few important algorithms as well as their limitations; afterwords, we’ll put it all together and see where these algorithms fit into a distributed database.

Distributed Consensus is most easily explained through the use of the Two Generals’ Problem. Two generals are attempting to coordinate an attack on a city located in a valley between two hills. The generals are located on the two hills above the city and must decide the exact time to attack the city by sending a messenger through the valley and up to the other general’s hill. There is a non-zero probability that the messenger they send is captured and killed in the city below. Due to the possibility of a general’s acknowledgement being lost, achieving consensus with 100 percent confidence requires an infinite series of messengers. This analogy highlights the impossibility of reaching consensus in a distributed database given the possibility of network failure.

Simplified Categorization of Distributed Databases

The CAP Theorem states that a distributed system cannot simultaneously provide consistency (all nodes agree on the same state), availability (the system is able to determine if a request failed or succeeded), and partition tolerance (the system continues to function under network failure). This theorem has many important implications for distributed databases and limits the effectiveness of the algorithms that run on them. Although the CAP theorem states that these three guarantees cannot be fulfilled at the same time, distributed databases circumvent this constraint by providing a weaker form of consistency known as eventual consistency: operations may take an infinite amount of time with probability decreasing (usually exponentially) as time goes on.

Vector Clocks and Raft/Paxos are the bread and butter of distributed databases; they do most of the work to keep the system state synchronized. Vector clocks are usually used to synchronize individual operations by tracking the replication of the operation throughout the system. Raft or Paxos is used to achieve consensus by establishing a protocol to elect and remove leaders in the system, allowing important global operations like schema changes to be ordered throughout the system. It is beyond the scope of this article to explain these algorithms in greater depth; if you want to learn more, here are some links to get started with Raft and Vector Clocks.

So far, the algorithms we discussed are generically applicable to a wider scope of distributed systems; however, write replication is, for the most part, only applicable to distributed databases. The underlying concept of write replication is related to a fundamental goal of distributed databases: “external” consistency; the purpose of which is to create pairs of operations that are guaranteed to return consistent results. Without write replication it is possible to write to one node and try to read from another, thus failing to retrieve the most recent system state. Lets create a cluster with 3 nodes and describe all of the possible ways to get external consistency from the system. We will do this by specifying a write and read replication factor (RF): how many nodes replicate the given operation.

Replicating Writes and Reads in a Three Node System

This image is essentially a depiction of Dijkstra’s Pigeonhole
Conjecture: if n + 1 pigeons are put into n pigeonholes, there is at least
one pigeonhole with more than one pigeon. For the system to be consistent, write and read RF must be non-zero and add up to n + 1. Intuitively, this means that we must have an overlap of at least one node between the the set of nodes we write to and read from. A typical practice is to distribute the replication cost equally amongst reads and writes in the system. This is accomplished by performing operations at quorum; in other words, operations must always be replicated to a majority of nodes in the system.

Now that we have introduced the basic algorithms that distributed databases use to synchronize data across the system, we can look at the big picture and pinpoint where each algorithm is actually getting used. Lets start by tracing an operation from the client program, through the database, and back to the client.

Client sends a write to N3

The client for a distributed database usually connects to one node in the cluster and uses this node to communicate with the entire system. To start off, the client sends a write to N3 with RF set to quorum (in this case that is 2). N3 replicates this write to N2 and returns a success message to the client. At this point, both N3 and N2 share the correct & updated state of the system. Furthermore, both N3 and N2 are able to tell that their state is the most recent by comparing their vector clocks to other vector clocks in the system.

The client now reconnects to the database, using N1 as an entry point, and sends a read to N1 with RF at quorum. N1 replicates the operation to N3 and realizes that its state is stale. At this point, it has determined the most recent state among 2 nodes in the system, so it can send N3's state back to the client.

The above example illustrates why consistent reads and writes are essential when using a distributed database; if N3 or N1 did not replicate the operation the client could have received stale information. It is important to note that inconsistent behavior is not always an issue; for example, a distributed database that is serving as a cache could simply not provide any consistency guarantees.

So far, we have only discussed distributed databases in perfect conditions without network partition or failure. These failures are very possible and must be accounted for when creating distributed algorithms.

Lets consider the same system of three nodes; however, this time, one of our system admins didn’t have enough coffee and screwed up N1's routing table. Now, N1 can no longer reach N2 and N3, but the client can still communicate with N1. In this state, the client will still be able to perform operations on the database if it connects to N2 or N3, but connecting to N1 will cause all requests to hang indefinitely (or until a timeout is reached).

In many systems there are operations that require global consensus before they can be performed; for example, schema changes and configuration changes (like adding a new node to the cluster) must be ordered by all nodes before the operation can be performed. Supporting global ordered operations is significantly more difficult than guaranteeing consistency for individual operations, because they require the use of a distributed consensus algorithm such as Raft or Paxos. These algorithms describe a protocol for reaching consensus on the ordering of a log, a sequence of causally ordered events. As I mentioned earlier, it is impossible to provide an algorithm that achieves this goal in a finite amount of time, so Raft and Paxos must be eventually consistent.

Split-Brain Schema Example

Let’s go over an example that requires Raft or Paxos. Your database is running on a cluster of four nodes which are all accepting writes. N2 is exhibiting write failures and needs to be taken down and replaced with a new database node. If any of the nodes aren’t alerted of this change, then the system could exhibit erratic behavior (like replicating operations to a node that doesn’t actually exist anymore). Now, at the same time we have another hardware issue with N4 and need to replace it as well. If we try to remove these nodes by concurrently applying two separate schema changes we could end up in a “split-brain” state. In this example, the system cannot make progress because the two schema changes have no causal ordering, which means that the database cannot determine whether or not Schema A follows Schema B.

Many systems are built using two relational databases, where one serves as the master and the second as a slave. This system can work well in many cases, but often lacks the consistency guarantees you expect. If you rely on one of these systems in production and the master fails there is often no guarantee that the slave is going to have the most recent state of the system. However, the same system of two nodes running a distributed database is formally proven to be consistent for all reads performed after either of the machines fails (given that writes were performed at quorum). Additionally, a distributed database can accept client connections from both machines where as a master-slave system can only perform writes on the master.

Distributed databases are wonderful data stores in theory, but in practice tend to be far buggier than established relational databases. Some products claim to provide strong consistency guarantees, but exhibit non-deterministic inconsistent behavior which results in you losing your data. Another issue is that many products do not make their default settings evident in their respective APIs. MongoDB is an example of a product that is that has struggled with both a lack of clarity and long living consistency bugs (one of these bugs was in their product for almost 2 years). This cautionary note is not intended to discourage you from using a distributed database, but instead to encourage you to thoroughly vet the one you decide to go with.

Thanks for reading! Please leave a comment if you have any questions.