Understanding the ins and outs of distributed systems

10 min readApr 27, 2024

Knowing about the ins and outs of distributed systems is key for both backend engineers and anyone working with large-scale systems. These systems might handle heavy loads, store a lot of data, or focus on being fast and reliable. One of the most interesting books I have found on this topic is Understanding Distributed Systems.

In this article, I will continue to discuss understanding the ins and outs of building a complex distributed systems resiliency by explaining its downstream and upstream.

Downstream Resiliency

Downstream resiliency refers to the ability of a component or subsystem in a distributed system to gracefully handle failures or disruptions from components or subsystems it depends on, known as its downstream dependencies. In other words, downstream resiliency ensures that a component can continue to function correctly even if the components it relies on experience issues.In this section, I will discuss patterns that protect a service from failures of downstream dependencies.

Timeout
Retry: exponential backoff, retry amplification
Circuit breaker

Timeout

When a network call is made, it’s best practice to configure a timeout. If the call is made without a timeout, there is a chance it will never return. Network calls that don’t return lead to resource leaks. The role of timeouts is to detect connectivity faults and stop them from cascading from one component to another. In general, timeouts are a must-have for operations that can potentially never return, like acquiring a mutex.

Modern HTTP clients such as Java, .NET, Golang etc do a better job and usually, come with default timeouts. For example, .NET Core HttpClient has a default timeout of 100 seconds.

How do we determine a good timeout duration?

One way is to base it on the desired false timeout rate. For example, suppose we have a service customer calling another service order, and we are willing to accept that 0.1% of downstream requests that would have eventually returned a response time out (i.e., 0.1% false timeout rate). To accomplish that, we can configure the timeout based on the 99.9th percentile of the downstream service’s response time.

To have good monitoring in place to measure the entire lifecycle of a network call. The main point need to make sure here is that we have to measure what happens at the integration points of our systems, or we are going to have a hard time debugging production issues.

Retry

By now, it’s established that a client should set a timeout when initiating a network request. However, what should we do if the request fails or times out? At this point, the client faces two options: it can either opt for a fail-fast approach or attempt to retry the request. If a short-lived connectivity issue caused the failure or timeout, then retrying after some backoff time has a high probability of succeeding.

However, if the downstream service is overwhelmed, retrying immediately after will only worsen matters. This is why retrying needs to be slowed down with increasingly longer delays between the individual retries until either a maximum number of retries is reached or enough time has passed since the initial request.

Exponential Backoff

To set the delay between retries, we can use a capped exponential function, where the delay is derived by multiplying the initial backoff duration by a constant that increases exponentially after each attempt, up to some maximum value (the cap):

delay = min(cap, initial-backoff.2attempt)

For example, if the cap is set to 8 seconds, and the initial backoff duration is 2 seconds, then the first retry delay is 2 seconds, the second is 4 seconds, the third is 8 seconds, and any further delay will be capped to 8 seconds.

Although exponential backoff does reduce the pressure on the downstream dependency, it still has a problem. When the downstream service is temporarily degraded, multiple clients will likely see their requests failing around the same time. This will cause clients to retry simultaneously, hitting the downstream service with load spikes.

To avoid this herding behavior, we can introduce random jitter into the delay calculation. This spreads retries out over time, smoothing out the load to the downstream service:

delay = random(0, min(cap, initial-backoff.2attempt))

Alternatively, a process can park a failed request into a retry queue. The same process, or possibly another, can read from the same queue later and retry the failed requests.

When should we refrain from Retrying?

Just because a network call can be retried doesn’t mean it should be. If the error is not short-lived, for example, because the process is not authorized to access the remote endpoint, it makes no sense to retry the request since it will fail again. In this case, the process should fail fast and cancel the call right away.

Retry amplification

Imagine that handling a user request requires going through a chain of three services. The user’s client calls customer service, which call order service, which in turn calls product service.

If the intermediate request for order service to product service fails, should order service retry the request or not? Well, if order service does retry it, customer service will perceive a longer execution time for its request, making it more likely to hit customer service timeout. If that happens, customer service retries the request, making it more likely for the client to hit its timeout and retry.

Having retries at multiple levels of the dependency chain can amplify the total number of retries — the deeper a service is in the chain, the higher the load it will be exposed to due to retry amplification.And if the pressure gets bad enough, this behavior can easily overload downstream services. That’s why, when we have long dependency chains, we should consider retrying at a single level of the chain and failing fast in all the others.

Circuit Breaker

Imagine the customer service encounters persistent failures from a downstream dependency, retrying requests will degrade performance for its clients and potentially impact the entire system. Thus, it needs a mechanism to handle non-transient failures effectively.

To deal with non-transient failures, we need a mechanism that detects long-term degradations of downstream dependencies and stops new requests from being sent downstream in the first place. After all, the fastest network call is the one we don’t have to make. The mechanism in question is the circuit breaker, inspired by the same functionality implemented in electrical circuits.

The circuit breaker aims to prevent a subsystem’s failure from affecting the system’s overall speed such as customer service to order service to product service. It achieves this by temporarily stopping requests to the failing subsystem. Once the subsystem recovers, the circuit breaker resumes allowing requests.

A circuit breaker can be implemented as a state machine with three states: open, closed, and half-open

When the circuit breaker is closed, it allows network calls to pass through. It monitors failures like errors and timeouts, and if they exceed a certain threshold within a set time, the circuit breaker opens.

In the open state, network calls are immediately unsuccessful. Since an open circuit breaker can impact business, it’s important to handle non-critical dependencies gracefully. Instead of halting entirely, services should degrade smoothly, ensuring essential functions continue even if non-critical features are unavailable.

After a period of time, the circuit breaker moves to the half-open state, giving the downstream dependency another opportunity. In this state, the next call is sent to the downstream service. If it succeeds, the circuit breaker returns to the closed state; if it fails, it reverts to the open state.

Upstream Resiliency

Upstream resiliency refers to a component’s capability to withstand failures or disruptions caused by components that depend on it. In other words, upstream resiliency ensures that a component remains robust and operational even if the components relying on it experience failures or issues.

Load shedding
Load leveling
Rate limiting: single process and distributed implementations

Load Shedding

A server can become overwhelmed with requests, leading to slow performance or even unavailability. To manage this, it can reject excess requests when it reaches its capacity. This process, known as load shedding, involves the server quickly refusing incoming requests and returning a “Service Unavailable” status code (503). The server might prioritize which requests to reject based on factors like their priority or how long they’ve been waiting. However, even rejecting requests incurs some overhead, like opening a connection or reading the request, which can still strain the server. As a result, while load shedding can provide some relief, if the load continues to increase, the server’s performance will ultimately degrade.

Load Leveling

Another option besides load shedding is called load leveling. Here, a messaging channel is set up between clients and the service. This channel helps manage the flow of requests, allowing the service to handle them at its own pace. Load leveling works well for smoothing out sudden spikes in demand, but if the service can’t catch up, it will accumulate a backlog of requests, which can cause problems of its own.

The channel smooths out the load for the consuming service

Both load shedding and load leveling don’t directly tackle an increase in demand but rather protect the service from being overwhelmed. To handle more demand, the service must be scaled out. That’s why these protection methods are often used alongside auto-scaling, which detects when the service is under heavy load and automatically increases its capacity to manage the extra demand.

Rate-Limiting

Rate-limiting, or throttling, is a mechanism that rejects a request when a specific quota is exceeded. A service can have multiple quotas, e.g., for the number of requests or bytes received within a time interval. Quotas are typically applied to specific users, API keys, or IP addresses.

For example, if a service with a quota of 10 requests per second per API key receives on average 12 requests per second from a specific API key, it will, on average, reject 2 requests per second from that API key.

Single-process rate limiting implementation

We will start with a single-process implementation first and then extend it to a distributed one.

To limit requests to 2 per minute per API key, we can use a method that tracks request times. Initially, we might use a list to store timestamps for each request. Periodically, we clean out timestamps older than a minute. However, this method becomes memory-intensive with many requests. To save memory, we can divide time into fixed intervals, like 1-minute buckets, and count requests in each bucket instead.

Buckets divide time into 1-minute intervals, which keep track of the number of requests seen.

A bucket contains a numerical counter. When a new request comes in, its timestamp is used to determine the bucket it belongs to. For example, if a request arrives at 12.00.18, the counter of the bucket for minute 12.00 is incremented by 1.

Bucketing compresses request information without growing with the number of requests. Now, with this memory-efficient setup, how can we enforce rate limits? We can implement rate limiting using a sliding window that moves across buckets in real time, tracking the requests within it. The window’s length aligns with the quota time unit, like 1 minute. However, because the sliding window can overlap with multiple buckets, we must calculate a weighted sum of bucket counters to determine the requests within the window.

A bucket’s weight proportinal to its overlap with the sliding window

Although this is an approximation, it’s a reasonably good one for our purposes. And it can be made more accurate by increasing the granularity of the buckets. To summarize, this approach requires two counters per API key, which is much more efficient in terms of memory than the naive implementation storing a list of requests per API key.

Distributed rate limiting implementation

When multiple processes handle requests, local state isn’t enough to enforce quotas across all instances. A shared data store is needed to track total requests per API key.Initially, storing two integers per key per bucket is considered. However, updating concurrently could cause issues.

To avoid this, transactions are used, but they’re slow and costly. Instead, atomic operations like get-and-increment or compare-and-swap are better options. Rather than updating the store per request, updates can be batched in memory and flushed asynchronously, reducing load and dependencies on the data store.

Servers batch bucket updates in memory for some time, and flush them asynchoronously to the data store at the end of it.

What happens if the data store is down?

Remember the CAP theorem’s essence: when there is a network fault, we can either sacrifice consistency and keep our system up or maintain consistency and stop serving requests. In our case, temporarily rejecting requests just because the data store used for rate-limiting is not reachable could damage the business. Instead, it’s safer to keep serving requests based on the last state read from the store.

Conclusion

In summary, failures will happen, but by designing systems resiliently, we can mitigate their impact. By incorporating downstream and upstream resiliency, we can maintain system stability, reliability, and availability despite challenges and failures.

I’m constantly delighted to receive feedback. Whether you spot an error, have a suggestion for improvement, or just want to share your thoughts, please don’t hesitate to comment/reach out. I truly value connecting with readers!

Reference

Understanding Distributed Systems

How to build resilience in large-scale distributed systems, Gojek Blog

The Pragmatic Engineer Blog

Bring Resilience to Modern Distributed Systems