As our system grows in scale, we will inevitably come to the conclusion that serving all those incoming traffic from just a single machine containing our app is no longer feasible. Upgrading your server’s hardware could easily mean that your operational cost would go through the roof. It could hurt your business.
In such situations, one of the solutions that people often choose is to put the components of the app into their own separate machines, or clusters of machines. That way, scaling and optimization can be handled independently for each component. Those machines will be connected through a network and communicate to each other via message passing (for example through HTTP).
Many recognize this approach as distributed computing, in particular a microservices architecture, where your app is decomposed into network-connected smaller apps each with their own well-defined responsibility, so that each part of the system is easier to understand, develop, and test. This approach works well and is common at big companies such as Netflix, Facebook, and even Traveloka.
However, those solutions do not come for free — there are costs involved.
For one, although it will be easier to understand what each component does, it is often harder to have a holistic understanding of how the whole system works. That includes the challenge of operational stuff such as deployment, monitoring, and logging.
Another important cost, which we are going to focus on in this article, is the fact that they are connected through networks.
In software engineering world, there are a set of assertions known as the fallacies of distributed computing. They are false assumptions one new to developing distributed applications commonly have. Although there are eight fallacies in total (see the linked wikipedia article for more), I want to draw your attention to the first three of them:
- The network is reliable.
- Latency is zero.
- Bandwidth is infinite.
As the name suggests, these are fallacies, and in the world of distributed systems, they are not true. Network experiences outages. Latencies go to seconds, more so if multiple network hops is involved. Limited bandwidth causes bottlenecks. They translate into one thing: failure.
Dependencies fail, (almost) all the time
Let’s consider a single component, say a service, in a distributed system. For it to serve a specific purpose, the service may need to connect to other components, such as a database, another service, or third-party APIs. We often call these components the dependencies of the service.
If there’s one thing every software engineer must know about distributed systems, it’s that dependencies fail. If they don’t, they will.
It could be the case of a faulty deployment of a service, a sudden surge of traffic, a maintenance of the third party APIs, or even one of the engineers’ bad day. Therefore, it is necessary to build your services in a way that they are resilient not only to logic failures, but also network ones involving their dependencies.
Types of Dependency Failures
In essence, failures related to dependencies can be categorized into three:
- General errors, which in most programming languages are usually represented by exceptions.
- High latency, when your usual 3ms calls shoot up to 2s or more.
- System overload, when your service is calling a dependency more often than the dependency can handle.
A huge volume of these failures can easily make your service go haywire, such as having high open files, maxed CPU and memory utilization, and many other anomalies which inevitably would cause your service to go down under, and in the worst case may cause downtime or outages.
(Seriously, I have seen the latency of a service degraded and triggered high open files alerts because one of its connected cache clusters was having a hardware degradation. Ironically, the cache is intended to boost performance and is not really crucial, so there’s clearly something wrong.)
In this case, we say that the failures cascaded from the dependencies to the service itself. But it shouldn’t be that way!
For companies doing online business, outages can cost you dearly: in revenue, or worse, in customers’ trust. Hence, it makes the most logical sense to protect your systems from anything that may potentially cause cascading failures.
That is where circuit breakers come into the picture.
Just as how software design patterns took some ideas from the constructions of buildings, the concept of circuit breakers in software development was adopted from its counterpart in electrical engineering. An electrical circuit breaker is designed to protect a circuit from damage caused by a short circuit by interrupting the flow if a fault is detected.
The basic premise of circuit breakers in software engineering is quite straightforward:
If a dependency fails, flooding it with requests is unlikely to bring it back up.
In short, if a dependency is experiencing failures, the best thing a client can do is to stop sending requests to it and give the dependency a chance to (hopefully) recover. Flooding it with requests may actually make it worse. Much like how you cool down your laptop when it is overheating — you don’t keep spawning programs and hope that it will get better.
The process of stopping requests to the dependency is called short-circuiting or tripping. If a circuit breaker is short-circuiting, it is said that the circuit is open, whereas the opposite state where requests can go through is called closed state.
There are many open-source circuit breaker implementations. One of the most popular is Hystrix from the folks at Netflix. There is also Resilience4j which provides circuit breaking capability and more resiliency features. The two are Java-based, but you can most likely find implementations in your preferred language by doing a Google search for
circuit breaker <language>.
So how do circuit breaker implementations detect a fault? That is, how should it know when to trip the circuit so that subsequent requests are blocked?
The answer to this question is related to the three types of failures we talked about previously. Requests are deemed faulty if it returned failures. The three types of failures are measured differently, namely:
- General errors are measured by counting the number of exceptions occured.
- High latencies are measured by counting requests that exceed specified timeouts.
- System overloads are measured by the use of size-controlled thread pools and counting rejected executions due to pool exhaustions.
Most implementations have the concept of error rate threshold, which is the percentage of request failures at a given timespan. The default value for this percentage is usually 50%, meaning that if 50% or more requests ended up in failures, the circuit breaker should short-circuit and block further requests.
Some libraries also allow clients to define a request volume threshold, which is the threshold after which the circuit breaker should consider short-circuiting.
Suppose that a call to a dependency fails, either it timed out, excepted, or short-circuited. What should we do?
Most circuit breaker implementations allow you to define fallbacks, failover computations that should be done in case of failures. The tricky point about fallbacks is that they are largely contextual — most of the time, only clients know what they should do in case of such error: should they return
null? Default values? Empty object? Should they propagate the exception? Should they get a fallback value from a secondary storage?
In situations where the result of the call is not critical for the core business flow, for example in getting user reviews of a product, we might be able to get away with returning empty results or null values. For more critical flows, such as a call to payment service to process user’s payment, it doesn’t make sense to “silently fail” and proceed. In that case, it probably would be better to abort the operation, e.g. throwing an HTTP 500 error.
This is one important point in programming defensively against failures: we have to include the possibility of failures in our design, and plan the fallback mechanism accordingly. This would lead us to engineer a more robust and resilient system.
Remember that circuit breakers are not about “not throwing errors”, but rather isolating your components from dependency errors. It is perfectly okay to propagate an exception if the situation needs it.
So far we have only talked about short-circuiting and blocking requests in case of dependency failures. What happens when the dependency recovers?
Fortunately, most if not all circuit breaker implementations have an auto-recovery mechanism. It works by going into what is called the half-open state, where it allows a small number of requests to go through and act as “scouts”. If they can successfully call the dependency, they will signal the circuit breaker that all is well and the circuit can be closed again. Other than those few requests, all the other ones are still short-circuited when the circuit is half-open.
Most libraries also allow clients to determine how long the circuit breaker should “sleep” after short-circuit is triggered before going half-open, and how many successful tries it should take before indicating that the circuit is safe to be closed.
This auto-recovery mechanism allows us to confidently fail fast and rapidly recover. In a world where failures are inevitable, it is a really desired feature.
Is it a silver bullet?
Hopefully we have come to the same conclusion that circuit breakers can help us build a complex distributed system that is resilient to failures. Unfortunately, it is not a silver bullet — at least not one that we can easily shoot just like that.
We cannot just put circuit breakers with default settings into all our external dependency calls and hope for the best. We should first do our due diligence in determining the importance and criticality of each dependency and how they should degrade gracefully via fallbacks.
Determining the error measurements is also important: how long should the timeouts be for a given operation? How large should the thread pool size be? We must do the calculations beforehand (hint: Little’s Law can prove to be helpful), and in some cases you might need to include the “business people” in your assessments and discussions.
That being said, if you work on a complex enough distributed system, circuit breakers should definitely be in your toolbox to deal with unreliable dependencies over the network.
Embrace failures. Design for resiliency.
I acquired all this knowledge as I’m dealing with real-life engineering problems with my team while working at Traveloka, one of the largest online travel companies in Southeast Asia. If you’re a software engineer interested in tackling real scalability and reliability issues in a large distributed system which allows users to create moments with their loved ones, have a look at the opportunities on Traveloka’s careers page!
Suggestions and corrections for this article are welcome.