Be a good client: request prioritization

Júlio Zynger
5 min readMar 23, 2020


The story is more common than it seems: a new feature is developed and tested extensively in-house, but only when the client applications are deployed, the engineering team discovers the extra production load and seasonal request rate. A hot-fix is needed! But now it’s too late: we cannot force our users to update their apps or refresh the page on their web browser.

As we’ve discussed in part 1 of this series, clients are in a way bigger number than our servers, and even though we have applied some of the techniques previously described to make our clients smarter, there’s more we can do on the server-side to make the entire system more resilient.

Photo by Clay Banks on Unsplash

Remote Tuning

When communicating to a backend, it is a good idea for clients to expose as much information as possible about their requests. That way, a server will be in a better position to qualify and prioritize the calls it receives.

Keep in mind, the recommendations here apply only to technical aspects of the requests, and if you fully control the infrastructure — do not identify personal user information, and most importantly, do not share these with 3rd-parties.

Client identification

By exposing a client name and its version, the backend can categorize requests and even workaround possible client bugs. It is especially interesting to agree on a structured form for that value, allowing for parsing, sorting and filtering. There is already a standard for web browsers, but it proves very valuable on mobile apps as well.

func serve_request:
if (client_id = android_app & client_version = 1234):
// suppose there is a retry loop bug in that app version
return Response(400)

Another aspect that can be useful from a backend perspective is bucketing of client variability throughout a singular client version. For example, if there are feature toggles or A/B tests in place, sharing variant identifiers will be helpful to identify and isolate arising issues.

Here we can clearly identify a difference in metric collection for a specific feature rollout group.

Serving order

In the event of a partial outage, servers can decide to drop requests to relieve load in the overall system. In fact, the faster a circuit-breaking mechanism triggers, the more resources will be saved. API gateways are good candidates for such logic, but in some cases it can also be applied throughout the service net.

Introducing a company-agreed index of criticality for certain features and severity levels in case of degradation will support ranking which paths have serving priority. The server can then reject requests that fall under the lowest levels of the ranking. As exemplified by Google’s SRE book:

For example, when a system displays search results or suggestions while the user is typing a search query, the underlying requests are highly sheddable (if the system is overloaded, it’s acceptable to not display these results).

The criticality index can be composed by several sub-metrics: for example, if a serving-path goes through usually overloaded services or of expensive scalability or if it is known to have caused retry storms in the past.

We can also take in account UX-aspects to bump the criticality of a request path, for example, whether a request was triggered by a user or happened from an automated job action, or whether the client application was in foreground or background in a mobile device.

As an alternative, besides rejecting requests, one can also decide to ‘gracefully degrade’ the response, skipping the expensive paths that aren’t critical to the overall user experience. For example, suppose fetching an audio track metadata from a saturated backend, a server can decide to skip calculating how many likes the track has and use a sensible default instead:


Additionally to circuit-breaking limits for retries per request or per route, a client can also choose to apply a retry budget to limit the incoming load on the server and prevent request storms. Libraries like Twitter’s Finagle and Linkerd bundle in that concept and allow for customization of the budgeting rules.

In general, that will mean that the application will keep track of the ratio between incoming requests and retries, and define a threshold as a configurable limit, that applies over a limited period of time.

Once a client goes over the threshold for the stipulated time-frame, it has its requests cancelled. In other words, a client can retry as much as it want to, as long as the ratio is maintained.

Since large applications tend to be a stack of services with dependencies on each other, it is key that requests are only retried at the layer immediately above of the rejecting one. On the example above, a rejection on the first service would prevent clients (pictured by the phone icons in the image) to retry further and save all potentially incoming combinatorial load on the database server down the line.

A more generic perspective over that concept is to establish a concurrency limit, which will bound the total number of requests a server will handle concurrently, and optionally provide a queue of waiting requests. All of the incoming requests on top of these values get automatically rejected.

What we’ve seen throughout this small series is that resilience is a characteristic of a system as whole, and it builds on top of all of its individual pieces.

On part 1 we have optimized the requests’ path from the clients’ perspective, on part 2 improved the relationship between client and servers while still providing a good user experience and here on part 3, we got back control from the field by moving toggles, limits and flags to the backend.

In summary, we’ve collected a few techniques that will help scaling products in a way that we can also rely on them even in the event of a full or partial outage scenarios.