Understanding Cluster Replication Scalability
CLUSTER REPLICATION AND LOGARITHMIC SCALABILITY
If you have been using cluster replication with some open source operational database, you might have noticed that they do not scale out well. If you are interested in knowing why, this is the post to read. Cluster replication was introduced in the mid-1990s as a way to scale out databases. The basic idea, which is called full replication (most commonly known as cluster replication), is to have a cluster of server nodes, each of them running a database engine with a full copy of the database. But how do we keep all replicas consistent and up to date? The strategy typically used to update the replicas is ROWAA (Read One Write All Available), where each read operation is executed on any one replica while a write operation is executed on all replicas. So, what is the scalability of cluster replication? On one extreme of the scalability spectrum, if we only have writes in the workload, we have null scalability, since all replicas do the same and the cluster throughput is the same as that of a single node, i.e., it does not scale. On the other extreme, if we only have reads, assuming a uniform load across replicas, we have linear scalability, i.e., a cluster with n replicas has a global throughput equal to n times the one of a single node. In between, we have logarithmic scalability, that is, the cluster throughput only grows logarithmically when increasing the number of nodes. The reason is because the bigger the cluster size, the higher the wasted capacity per node. Figure 1 depicts graphically what happens. On the lower part, we see how many servers we have for a particular cluster size. The orange line indicates how much capacity of the servers is wasted, i.e., the space between the x axis to the orange line is the wasted capacity.
SCALABILITY FACTOR
But how can we actually quantify scalability? We devote the rest of the post to it. Let’s develop our analytical model. Firstly we do it intuitively, and then we formalize it mathematically. A database with cluster replication is able to process a number of read and write operations, that is, it is able to deliver a certain maximum throughput. We can make the throughput relative to that of a single node, this is what is actually called the scale out factor [Jiménez-Peris et al. 2003]. To get the scale out factor, f, we simply divide the useful work, which is the actual throughput, by the total amount of work (see Figure 2). The optimal scale out factor is the size of the cluster. That is, for a cluster of n nodes, the optimal scale out is n.
EFFICIENCY OF 1-NODE CLUSTER
Let’s consider a workload with 50% reads and 50% writes. For simplicity, assume the cost of reads and writes are the same. The single node will devote half of the capacity to execute writes and the other half to execute reads. If we execute a read and a write, the throughput will be 2 operations (the read and the write) and the work done 2 operations (the read and the write) as well, so f=2/2=1. This is easy (see Figure 1).
EFFICIENCY OF 2-NODE CLUSTER
Let’s now look at a cluster of two nodes. Each node wants to execute one read and one write. However, each write executed at the other node also must be executed locally. We call it remote write. Thus, each node will do its local read, its local write, plus a remote write, meaning that 2/3 of the capacity of the nodes is employed for useful work. Note that this means we are wasting 1/3 of the capacity of each node doing remote writes. This is the price of full replication, executing writes everywhere. So each node of the two nodes does three operations: one read and one write plus the remote write, thus: f=2•(2/(1+1+1)=4/3=1.33. In other words, the two nodes deliver the same throughput as one node and one third of a node.
EFFICIENCY OF 3-NODE CLUSTER
Let’s take a look at a three-node cluster and from there we can easily generalize the formula for an arbitrary cluster size. If we have 3 replicas, each replica processes 1 read and 1 write, but will also have to execute two remote writes corresponding to the writes from the other two replicas. Therefore, they execute four operations (the read, the write and two remote writes), but only two are useful work: f=3•(2/(2+1+1))=6/4=1.5. Having 3 replicas we attain throughput 1.5 times that of a single node, that is, half of the 3 node cluster capacity is wasted.
SCALABILITY ANALYTICAL MODEL
In the box below, we generalize for a cluster with n replicas and generate the mathematical analytical model. If you are not interested in the math (Figure 6 shows the inference of the model in a visual intuitive way), just skip the below box.
With a one node cluster the ratio between useful work or actual throughput and total work done of the system is 2 operations (1 read & 1 write) and the work done is actually 2 operations, so we have that the scale out factor is the amount of useful work (the two operations) divided by the total amount of work as depicted in line “One Node” in Figure 6. With two nodes for performing the two operations, but the work actually done is two writes and one read as shown in line “Two Nodes” in Figure 6. And with three nodes we still do two operations of useful work but a total work of three writes and one read as seen in line “Three Nodes” in Figure 6. The generalization for n nodes is easy and it can be seen in Figure 6. We can make the figures relative to one operation (we were considering two operations, one read and one write) as shown in Figure 6. The first 0.5 corresponds to the fraction of writes (50%) and the second to fraction of reads (also 50%). To make it general for an arbitrary fraction of reads and writes, if the fraction of writes is w, then the fraction of reads is (1-w). The final formula is shown in the blue box in Figure 6.