Adaptive Load Shedding

Sushmi Shrestha
Agoda Engineering & Design
8 min readApr 25, 2022

--

This blog aims to understand why services need to be adaptive and talks about the different types of adaptive services.

TL;DR (Too long didn’t read). Any production service must be protected from overload, especially if you cannot control your clients. We avoid overload at Agoda by structuring our systems to scale proactively before they experience overload. Protecting systems, on the other hand, necessitates multiple layers of defense. This includes automatic scaling as well as techniques for gracefully shedding excess load.

In this post, we will show you our approach for adaptive traffic shedding, which allows us to re-evaluate thresholds in runtime to protect high-load services in Agoda.

What is adaptive load shedding?

Adaptive load shedding is a way to protect and handle requests to a service in a preemptive and smart way.

Let us start with the most common customer-facing software development scenario: “build it and ship it.” While this technique works for a while, our service’s performance starts degrading when additional features are added. This model results in production incidents.

Adding more servers is a simple solution to this issue. However, we need to ensure uptime as well. As a result, we built reactive features like timeouts, circuit breakers, thread pools, rate limiters, and more to make our services more stable and reliable.

To do so, we utilized load testing technologies like Gatling or other in-house capacity measurement tools to calculate the capacity and latency of our services and keep track of their service performance.

Our main goal is to determine the maximum number of requests the service can handle while adhering to the service level agreement and the service’s behavior under an overload.
This allows us to estimate how long it will be able to sustain the overloaded state and when it will crash.

We can now start creating various rules and thresholds based on these figures, such as blocking clients’ retries above a certain number. Once we implement these static numbers, we will have a maintenance overhead. From this, we will be able to adjust these numbers repeatedly.

So why are static capacity numbers brittle and need frequent adjustments?

At Agoda, the main reasons that we have noticed are:

  1. Different Hardware used: Using the heat map in figure 1 below, where the x-axis represents time, and the y-axis indicates usage, we discovered that most servers run at 10–15% CPU utilization. As seen by the dark box on top, few servers run at 35–45% CPU usage. After studying this data, we confirmed that the server’s CPU utilization varies by ~50%.
Figure 1. The heat map of CPU usage of a service

2. Minute-level change in the traffic pattern: Inconsistencies in the production environment are caused by traffic changes based on client activities. User activity on our sites is higher during the day when there is an immediate burst of traffic for a brief period. For example, a full cache refresh or bot activity.

3. Regular data growth: Business is constantly growing. As new suppliers get added, new information gets created.

4. Experiment permutations: We conduct several tests that cause the static number to change every day. The permutations of these trials have an impact on the service’s performance.

5. Other factors include:

  • Round-robin per layer introduces additional variance in load (N bins M balls problem).
  • Garbage Collection (GC) pressure can lead to primary overload due to request queueing during GC.
  • Dependencies slowness causes internal concurrencies like contention and bottleneck to go up.

How can we fix the overloaded situation?

We have two options to fix the overloaded situation.

  1. Use Kubernetes or OpenStack to scale out the service elastically. This fixes the capacity problem, but it is not the best solution for traffic spikes caused by bots because spawning a new service box takes time.
  2. By degrading the response of our service. This can be done in two ways. The first step is to implement reactive solutions. The second way is by graceful degradation.

The following are a few examples of reactive and degradation procedures utilized in the industry:

The reactive solutions include:

  • Propagating TTL (time to live, request timeout) per request to avoid unnecessary work. Stop processing requests further if they have already reached TTL.
  • Propagating Deadlines — e2e TTL.
  • Applying retry quota and backoff. Each client can retry up to X% of requests.
  • Applying rate limiters on cluster level or machine levels.
  • Applying circuit breakers.

Graceful Degradation techniques include:

  • Traffic shedding (dropping requests — availability degradation).
  • Traffic prioritization (booking traffic/user traffic over bot).
  • Protection from retry-storm by dropping retries.
  • Back-pressure (final protection based on current buffers / input queue busyness / inflight requests).
  • Cut off functionality (feature degradation).
  • Returning stale results from the cache (stale data) — consistency degradation.

How Agoda solves graceful degradation approach via the gatekeepers.

The gatekeeper acts as ‘good cop, bad cop.’ It keeps track of current service utilization in real-time. When the service’s capacity exceeds some specified limits, it starts rejecting incoming requests. It is a traffic shedding library.

Figure 2: Bird-eye view of how service works with Success Rate (SR)

Let us suppose a realistic success rate under a standard load is 99.95%. The service success rate decreases as the load increases. Now, consider a reduction to 87.46%, as seen in the graph above. If we add additional loads to the service, it will become unresponsive and may even terminate. This is when we need to implement an adaptive strategy.

This library assesses the real-time service capacity and degrades response based on specified rules. The service now provides all traffic within the threshold limits due to the library. However, when above limits, it will use probability rejection to reject everything, making the service more adaptive to changes in traffic volume.

How is adaptive load shedding done in different industries?

Different industries have different approaches to adaptive load shedding

1. Google claims it mostly uses CPU to determine the cost of request:

a) In platforms with garbage collection, memory pressure naturally translates into increased CPU consumption.

b) In other platforms, it is possible to provision the remaining resources in such a way that they are very unlikely to run out before the CPU runs out.

2. Facebook is focused on concurrent (inflight) requests.

3. Amazon (AWS lambda) predicts the number of resources per call with resource isolation.

4. Our pricing service (current backpressure) uses max in-flight requests as a hard cap.

Meta-principles that drive our adaptive logic in traffic shedding library

There are certain checklists that we follow that drive our adaptive logic in the traffic shedding library. This can be summarized in the following points:

  1. Regular workloads should not be harmed. Serve as much as possible, utilize say +95%.
  2. Maintain prioritization among different traffic categories (Quality of Service). If there is sheddable traffic, do not reject vital or routine requests.
  3. Ad hoc solutions and static numbers should be avoided. Applications have different workloads, performances, different traffic types, and volumes depending on data centres. Thus, relying on predefined numbers is not a solution when it comes to dynamic traffic.
  4. Keep internal success rate above threshold while maximizing the external success rate
  • Internal success rate is service level success rate that is driven by SLO. Success rate of service as SRint and can be calculated as,
Formula 1: Internal Success rate
  • External success rate is success rate as perceived by client. This success rate is on the library level. This means the library needs to reject as less as possible to keep up with SLO of the service. Let us denote it by SRext,
Formula 2: External Success rate

Preliminary analysis and validations

Table 1: Traffic Distribution of diverse types of traffics

To show you how distribution differs across data centers, let us take a real example of Agoda’s traffic distribution on two data centers in a single day.

Notice that we need a dynamic number for each data center as we cannot assign a single threshold number for all the data centers. We will now examine how services perform with load test results around the edge.

After plotting load test result for success rate over time (graph 2), we observed four distinct stages: warmup, perform, edge and overload.

  • In warmup, — the service performs as expected. The success rate on graph 2 shows ~100%.
  • For perform — the performing stage where service is closer to the edge and still performs or recovers as expected.
  • For edge — the service will not be able to return to a healthy state (high success rate) without a reduced load.
  • For overload — the service will have low success rates and availability.

Our goal is to keep our services stable, going from performing into edging and back to the performing stage.

If we follow the yellow line that indicates the beginning of the stressful edging zone and correlate it to the RPS graph (graph 1), the total request per second at the edging point is ~125. Now, in the CPU utilization plot (graph 3), the CPU utilization is ~80%, with a +/- 10% deviation. We can also see that the service performance degrades after 90% when we correlate the overload point (red line) in success rate graph (graph 2). This is the point when we lose our service.

Plotting a graph by normalizing RPS measurements against CPU utilization (see graph 4) shows the RPS plateau at 150, indicating that our service is CPU bound. Note the inflight requests (graph 5) is at 35–40 in the performance zone. When we normalize data against CPU (graph 6), it plateaus about 40–45 requests.

We have learned from this preliminary investigation that we can consider RPS/CPU as capacity prediction and cap it at maximum inflight requests. In addition, we also need to consider different utilization signals, queueing theory, concurrency, feedback, throughput, latency, and error rate.

Conclusion

Overloading can occur for many reasons, including customers initiating several retry attempts, an under-scaled service, etc. In this post, we have seen various parameters that affect the overall service performance, and we have also covered our initial investigation. In the future article, we will cover different approaches that we use for adaptive load shedding.

Authors: Sushmi Shrestha, Evgeniy Kalashnikov

References:

--

--