Load Shedding in Web Services

Published in

helpshift-engineering

10 min readSep 4, 2020

Load shedding is a design pattern used by high performance web services to detect and fail gracefully when there is traffic congestion. When the service cannot handle incoming requests, load shedding is a strategy to identify the requests we should serve in order to prevent larger cascading problems across our distributed systems.

This is a two-part series. In this post, we discuss the different aspects of load shedding and why it should be part of the design of web services. In the second post, we will implement our strategy and corroborate it with experiments.

Effect of insurmountable load

Web services play a pivotal role as they are the juncture between demand from users and back pressure from dependent services. Other than the service itself malfunctioning, outages in web services may arise due to the following causes:

Surge in incoming traffic or slashdotting: A sudden increase in demand for a service can cause delays in responding to requests which causes a growing number of pending requests, which the service just cannot keep up with.
Degradation of downstream dependency: If a database becomes slow in responding to queries from a web service, requests will start getting queued up at the web service even if the rate of incoming traffic is the same. Degraded dependencies cause a back pressure which the service has to cope with.

Sometimes both causes form a negative feedback loop which breaks the system completely. For example, if the general rate of requests increases, it can cause the database to degrade in terms of response time, which can slow down request service time even further — a vicious cycle hard to recover from. Collectively, we refer to these two effects as traffic congestion.

*Web services are sandwiched between demand from users and back pressure from dependent services.*

Robust web services demonstrate the following qualities:

Save the service from dying: Load can have an adverse effect on the health of the web service, for example: out of memory errors, unresponsiveness due to continuous garbage collection, thrashing.
Graceful communication: A graceful error code indicating that the server cannot process a request is a better experience than a client timing out or a client facing connections being refused from the service. Co-operative clients can then choose to use these error codes to backoff and retry at a later time. For HTTP services, these error codes can be 429 (too many requests) or 503 (service temporarily unavailable) potentially with headers that indicate a backoff like the “Retry-After” header. This allows for the web service to record necessary information for future assessment, like logging of unserviced requests.
Prevent load from cascading into dependencies: A surge can impact internal services like databases, which can cause a much bigger incident which is significantly harder to recover from.

To sum up, a robust web service’s responsibility in real-time incidents goes much beyond ensuring its own good health. It needs to effectively communicate to clients of unserviceability while protecting critical internal services.

Load shedding is a pre-mortem strategy to minimise service during traffic congestion, in the interest of preventing a wider incident or a total collapse of the service.

Load shedding goals

Dealing with load is a two-fold endeavour. Firstly, we need a way to define a threshold in terms of system resources which will constitute the maximum “load” that the current system will serve normally, anything beyond this will be a candidate for “shedding”. Secondly we need to define the behaviour of the system when “shedding” starts.

A server comprises a queue of requests waiting to be served and threads processing those requests. The set of waiting requests and in-process requests are called concurrent requests.

We will consider “load” to be defined in terms of two properties: number of in-flight requests and the time a request spends waiting to be serviced. We will explain why we chose these two properties and not more commonly measured properties like CPU usage or memory utilisation. Our load-shedding service should therefore have the following traits:

Limiting the number of concurrent requests: The number of open requests in the system should be bounded so that the amount of work a server has to do does not pile up.
Bounded wait-time: The time a request spends in the queue waiting for a server thread is often overlooked in services, primarily because the application does not maintain the request queue, it is done by the web server container like Jetty or Tomcat. However, when the load is high, the wait-time becomes the dominant factor. Just limiting the number of concurrent requests is not enough because requests still spend time waiting for service which can have an adverse effect on the quality of service as we explain later in this post.
Disaster recovery by unblocking clients: Once the system has detected the condition for shedding load, the service should communicate this to the client. This has a three-fold utility. Firstly, this helps clear out system resources like open TCP connections and request context in memory. Secondly, this provides a way to communicate to clients that the service is under load, which in the absence of a load-shedding strategy, could mean that the client never sees a response from the server and is forced to time out on its own. Thirdly, by closing client connections for old requests, it frees up resources that can be utilised to respond to new requests.

#1 Limiting the number of concurrent requests

In queuing theory, Little’s Law says that in the stationary (stable) state of a system, the following condition holds true:

L = λ W

L = Average number of requests being processed
λ = Average rate of incoming requests
W = Average time a request takes

If the average time to serve a request (W) increases, or if the average request rate (λ) increases, the average number of requests that are open in the system (L) will also increase. In other words, if the request rate increases or the time taken to serve requests increases, the load in the system will build up. Since real-world servers have finite resources — total number of threads, maximum queue size — load can render the service sluggish at best and unusable at worst.

Little’s Law is independent of the number of servers or the number of threads on a server or how powerful the servers are or any other detail about the hardware or software the server runs on.

Little’s Law is independent of such factors and can be applied to systems as a whole or to individual servers. If the rate of incoming requests increases on a single server, that single server must now handle more requests concurrently, failing to do so would mean requests getting queued up. This applies to the one server as well as the whole system as a whole, or even to CPU processor cores. Little’s Law is profound because, as David Simchi-Levi puts it, Little’s Law is “not influenced by the arrival process distribution, the service distribution, the service order, or practically anything else”.

As an example, a tried and tested implementation of Little’s Law is TCP’s flow control strategy. Irrespective of the number of packets that are pending to be sent, the number of in-flight packets at any time is bounded by the size of the window, ensuring that there is a limit on the number of in-flight packets over the network so as not to overwhelm the receiver or the network. Unlike TCP, the number of requests made to our services cannot be flow controlled. Hence the service itself must shed load as a defence mechanism against surges in demand or degraded service time of dependent services. Limiting the number of concurrent requests is the first step in that direction.

#2 Bounded queueing time

After coming into the system, a request spends some time in a queue before it can be served (depending on how loaded the system is) and once it is picked up, a thread calls the application handler function to generate the response. The time to process a request is:

R = W + S

R = Total time taken to respond to a request
W = Time spent in the request queue
S = Time taken by the handler to process the request

*In a billing counter, the time to process each customer’s bill may be a constant value of 5 minutes, but the time observed by a customer can be a lot more if there are many people waiting in line.*

Request processing does not begin until the server container can provide a thread to process it. Therefore, the client can see a much longer timeout than the application handler. During traffic congestion, the time a request spends in the queue can become the dominant factor in the total time taken to get a response. If a request has spent too long in the queue, it means that there were many pending requests before it which server could not respond to fast enough.

The term “goodput” is used to refer to the amount of actual or useful work that a service is able to do, while “throughput” is a term used to refer to the amount of work that is demanded of the service. As the service takes on more and more work, the response time increases to the point that no request can be responded to within the client’s timeout. At this point, although the service is doing a lot of work, its goodput is close to zero as is depicted in the following diagram. In the context of load shedding, increased queueing time is a strong indication that the system is under load and if left unchecked, it can affect the quality of service severely.

When the demanded throughput of a system increases beyond a certain value, the server’s response time becomes too high and clients stop waiting for the response. This means that although the server is working on requests, it is not useful to the client, so the amount of actual useful work, ie goodput, of the server falls. https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/

#3 Disaster recovery by unblocking clients

If the service is too busy processing enqueued requests, then recovery from a surge load will not be possible until all requests in the queue are finished processing, which can take a long time depending on the length of the queue. It is important to preempt requests that have already spent too long in the request queue by responding to the client with a graceful error code in order to make room available for the service to process new incoming requests. This is similar to diverting vehicles away from major crossroads during a traffic jam in order to prevent a complete shutdown of every street in the city.

Moreover, clients may choose not to implement timeouts or have very high timeouts, which can hold up system resources until the request can be processed. A typical web-scale application backend doesn’t expose itself directly to the client. A request may have to traverse multiple load balancers and request routing proxies to finally reach the origin service which can process the request. The following diagram depicts Facebook’s end-to-end infrastructure of a similar nature.

*End-to-end infrastructure at Facebook, taken from “Zero Downtime Release: Disruption-free Load Balancing of a Multi-Billion User Website” by Usama Naseer, et. al.*

During traffic congestion, each component in a system like this will be facing the problem of accumulation of requests. It is imperative to detect when downstream request processing is taking too long in order to prevent system resources being held for too long.

Irrespective of whether the client implements a timeout on its own, each layer in the cloud infrastructure should implement a means of closing client connections to free up system resources in order to make room for serving new requests.

Conclusion

It is absolutely critical for live scalable systems to think about high-load scenarios and design for the top 1% of a service’s uptime which often cause much bigger incidents. To sum up:

Capacity planning is an important aspect of productionising services and a service should not take up load it is not designed to sustain.
Increasing load can cause the goodput of a service to fall to zero.
It is important to limit the number of requests that are concurrently being served and shed load when there are too many requests.
Queueing time is a significant portion in the response time from a loaded server. During traffic congestion, truncating the queue size is necessary to recover from degradation.
Clients should be unblocked to free upstream connections.
It is better to fail gracefully than become completely unresponsive or initiate a cascading effect of failure on dependent systems.

The theme that we touched upon in this post is that every system, however performant, has its limit on the amount of work that it can do at a time. While performance enhancements can stretch the limit, it is a fallacy to think that there is no limit or to think that there is no chance of hitting that limit. Data centres today consume about 2% of electricity worldwide and it is postulated that it could rise to 8% of the global total by 2030. Much of that energy bill is because systems are trying to push the frontiers of their current capability and capacity. Acknowledging this capacity limit and planning for when it is breached makes services robust and resilient.

Load shedding is a strategy to plan for the scenario when the capacity limit for a service is approached or exceeded. It is necessary, therefore, to know the capacity of a service before productionising it. In our effort to prevent a wider incident, knowing a service’s capacity and defining its behaviour when that capacity is exceeded is not just good engineering practice, but it is also returning the commitment users made to us by using our services.

When capacity limits are breached, well-built services should respond to clients with reasonable error messages, protect itself and its dependencies, and have the ability to alert, if needed.

This is a two-part series. In the next post, we implement our load shedding strategy and corroborate it with experiments.