Planned outages can make systems at Google more reliable.
In Embracing Risk, I wrote about how by making sure a system would be unavailable on a regular basis, then you can be sure that people know how to cope when it’s down. The context there was a system that management judged was only deserving of 99% reliability.
Chubby is a system at Google at the other end of things. An example of an open source lock server is Apache Zookeeper.
Imagine that chubby runs with an SLO of 99.99% — this is almost 13 minutes downtime allowed per quarter.
Fortunately, this system is actually so amazingly stable and well behaved they often are much more reliable than that.
Google runs two variants, we call them Local Chubby: This serves consistent data inside a Google Datacenter, and Global Chubby: This is a consistent global data storage.
The Global Chubby Planned Outage
From the SRE book, Chapter 4.
Written by Marc Alvidrez
Chubby [Bur06] is Google’s lock service for loosely coupled distributed systems. In the global case, we distribute Chubby instances such that each replica is in a different geographical region. Over time, we found that the failures of the global instance of Chubby consistently generated service outages, many of which were visible to end users. As it turns out, true global Chubby outages are so infrequent that service owners began to add dependencies to Chubby assuming that it would never go down. Its high reliability provided a false sense of security because the services could not function appropriately when Chubby was unavailable, however rarely that occurred.
When a service depended on the global version of chubby being up in order to serve requests, any outage in Global Chubby would be an outage for that service! So it’s very important for services to be resilient to those failures.
What’s even worse than the idea of a service having an outright failure because of a global dependency like this is the secondary effects. Imagine impacts like:
- Losing consistent global storage could result in widespread data corruption.
- Servers running out of memory because too many requests are blocking waiting for data that is delayed because the lock service is not responding.
- Multiple unrelated products all having problems at the same time, making it seem like every product broke at once.
Clearly a solution is required, and while saying “Please Don’t Do This” is one approach, unexpected systematic failures are easy to introduce and a defence was needed.
The solution to this Chubby scenario is interesting: SRE makes sure that global Chubby meets, but does not significantly exceed, its service level objective. In any given quarter, if a true failure has not dropped availability below the target, a controlled outage will be synthesized by intentionally taking down the system. In this way, we are able to flush out unreasonable dependencies on Chubby shortly after they are added. Doing so forces service owners to reckon with the reality of distributed systems sooner rather than later.
The Chubby planned outages were the first of this kind at Google that I am aware of, but it is not the last. This solution works best when the system is the combination of:
- Reliable enough that a casual observer might think that it’s up 100% of the time.
- Important because multiple similarly important systems depend on it.
- Possible to cope without it, through a fallback mechanism.
The goal is to create a developer culture of fault tolerance between independent systems, instead of the shared fate that tightly coupled systems.
I am a Site Reliability Engineer at Google. The opinions stated here are my own, not those of my company.