Quantifying Trade-Offs Between Active and Passive Clusters

Koonseng Lim

Published in

DBS Tech Blog

9 min readNov 19, 2020

Should you put all your eggs in a basket or build a link chain?

Find other interesting tech blogs written by some of my colleagues at DBS Tech Blog

I’m a site reliability engineer and so notions of reliability are something that I profess to be somewhat obsessed with. I was therefore very interested when I overhead a recent water cooler conversation between 2 engineers (whom I’m not acquainted with) that went something like this:

Kim: The users wanted more reliability, so we refactored our app into a hundred independent microservices! …
David: Wouldn’t that make it more prone to failure since you now have more possible points of failure?
Kim: No, not really, we’re running them on an active-active setup. That’s supposed to give you maximum reliability, right?
David: If you spread them out everywhere then all it takes is one machine to fail and your service could potentially fail, no?
Kim: But spreading them out makes sense also, haven’t you heard of the saying “Don’t put all your eggs in one basket”?…
David: Well, in that case, may I remind you that “The weakest link breaks the chain”?

As I listened to them, each passionately making their case, I realized that in a way, both sides were correct to some extent. But whose solution is better and under what circumstances? What are the hidden assumptions that they each make?

This led to the following chain of thoughts in my head (pun intended). Consider the following scenario:

You have 2 identical servers (A and B) each of which independently fail 10% of the time. Your application consists of 2 microservices, s1 & s2. Both microservices are have to be up for the application to function. Assume that a microservice will be up if the server on which it runs is up.

In Kim’s case, you would run them by running one microservice on each server. In David’s case, you would run them all on one server and keep the other as a spare. This is illustrated by the 2 figures below:

Kim’s Approach

If we followed Kim’s approach, then the application availability is:

Because of independence between A and B, we can express this as:

David’s Approach

If we followed David’s approach, we will run both s1 and s2 on server A, and if server A fails, then s1 and s2 on server B. The availability of the application is thus:

So, David’s solution was better! In this case, his configuration provided a much high reliability than Kim’s setup.

A word on assumptions …

In order to keep things tractable and to focus the discussion on concepts rather than secondary details, we assume that the application runs 24 by 7 and the servers do not require maintenance. Furthermore, we assume requests (coming to the microservices) are uniformly distributed throughout the life span of the application.

General Service Availability for given p and N

From the above, we can generalize this expression, for M microservices, N servers (M > N) and the probability, p of the server being up to:

In our case, this availability will be less than 0.1 for N > 21, meaning that over 21 servers, the service availability will be less than 10%, a far cry from 90% if we used only 1 server and an order of magnitude lesser than the 99% if we used David’s setup.

For David’s setup, the equivalent expression is:

Intuitively this can be understood as the sum of the probabilities of failure of previous N-1 servers before success in the Nth server (or a negative binomial distribution with parameters p, N, and r=1, see https://en.wikipedia.org/wiki/Negative_binomial_distribution).

Alternatively, it can also be seen as:

The two plots graphs below show the application availability for various values of p and N.

Service Availability Comparison between Kim’s vs David’s Approach

From the chart, it’s very evident that David’s approach is almost the symmetric mirror image of Kim’s. Its reliability increases with more servers while the other decreases.

At this point, astute readers would have spotted the primary weakness in Kim’s approach. Namely, the constraint:

The service fails if ANY microservice fails

What if we introduced redundancy so that each microservice is replicated?

Replication in Active-Active Configuration

In such a scenario, Kim’s approach would be to replicate both microservices s1 and s2 on both servers.

If a request came in for a microservice on a machine that had failed, the request would be routed to the second machine.

The probability of the application working would now be:

Wow! What an improvement! This brings her reliability score up to par with David’s Active-Passive setup.

In general, for N servers, the Active-Active setup offers the same availability as Active-Passive which is

Cost of Running Servers

Although both Active-Active and Active-Passive configurations yield similar availability, let’s examine the cost of running them on infrastructures that offer pay-per-use charging (e.g. Cloud providers like AWS, GCP, etc.)

Let’s assume that cost of running a server per unit time is C.

For Active-Active, the cost of operating 2 servers is thus (2 * 0.9)C = 1.8C. Note that we do not count the 10% that both servers are unavailable because of failure.

For Active-Passive, on the other hand, since you pay for server B only when server A fails, the cost is 0.9C + (0.9 * 0.1)C = 0.99C. This is because server A is up 90% of the time, while server B is up will cover 90% of the remaining 10%.

Thus Active-Active is almost twice as expensive as Active-Passive for the same level of availability.

Generalizing this to N servers, then

The cost ratio between Active-Active and Active-Passive is thus:

which in the limit, tends to Np because p < 1 and (1-p)^N tends to 0 for large N.

The chart below illustrates this relationship.

This is quite alarming as Active-Active is clearly far and away much more expensive than Active-Passive. Why would anyone design Active-Active systems then?

The Benefit of Active-Active vs Active-Passive

In our calculations above, we took the liberty of omitting one fact, for the sake of simplification, which was to the benefit of David (can you guess what was that?)

We assumed that switching on server B to take over the load when server A is down is instantaneous. Clearly this is unrealistic. The amount of time needed to power up failover servers represent downtime and should be counted against service availability.

For example, suppose bringing up a server takes 5 minutes. If we take a unit of time as a 30-day period (60 * 24 * 30 = 43,200 mins), this would constitute 0.011% of that time. Let’s denote this by f.

Hence, the actual availability of David’s approach, considering one restart should be:

In general, the availability of an Active-Passive system is then:

The significance of this difference really depends on the business value of the application and the loss of revenue during the interim period when the secondary server is being brought up.

One way to capture this into our calculations would be to express it as a factor of C, the infrastructure cost of a server. Let’s denote this with V per unit time.

For both approaches to be equivalent, the cost of the loss of availability due to server restarts must be offset by the gain in cost savings of not having the servers always up.

Thus,

For our 2-server system, works out to (1.8C — 0.99C)/0.00011 = 7,363C. This means simply that, if the value generated by the application in a 30-day period is more than 7,363 times the cost of running a server in the same period, then a 2-server Active-Active configuration will, in effect provide a lower cost compared to a 2-server Active-Passive configuration. The loss of revenue due to a single server restart will be greater than the cost of running an additional server full time.

To put this into perspective, let’s consider a cloud provider that provides a VM with 8 vCPU and 32 GiB of RAM at a cost of USD $0.3328/hour. Over a 30 day period, this works out to USD $239. If our service is generating more than 7363 * 239 = USD$1,759,757 during the period, the loss of revenue of 5 minutes of downtime, will exceed the monthly bill of running the secondary server. The figure below plots the curve based for varying levels of server reliability and # of servers.

As server reliability increases, so does the running cost of a server (because it is up more of the time) and correspondingly the amount of revenue the application must generate to justify the expense. Understandably, this also goes up with the number of servers but only sub-linearly. This is because, the probability of failure of the whole application decreases exponentially with addition of each new passive server — each new server will only be started if ALL earlier servers have failed. It is interesting to note that for this scenario, the revenue requirement tops out at around USD $2.2M regardless of the number of servers or reliability of each VM.

While it is common knowledge that for lower revenue generating application, the cost of system downtime and reliability requirements are lower, the above models provide us the quantitative tool to tie these two measures precisely. Aside from just server compute costs, the models can be embellished to add consideration of other operational costs, such as monitoring costs, manpower costs, or risk measures such as reputational risks etc. As site reliability engineers, it is imperative that we understand these trade-offs well.

I never did stick around to overhear the rest of the conversation between Kim and David. However, I did make a point to hang around the water cooler more frequently, just to see if I might chance upon them again — to tell them that I’ve found a way for them to see whether putting your eggs in one basket or having a link of chains is a wiser decision 😊.