Service invocation optimization techniques

Published in

The Hotels.com Technology Blog

9 min readApr 15, 2019

Development teams working in a microservice architecture face a tricky problem: improving performance but also integrating with more and more services. Let’s explore some solutions for optimizing service invocations at the backend side.

Background

Few years back we were working only with few service calls per page request and those were shallow invocation chains as well. Now, we might need to invoke 10+ services per a single page request and some of those have really deep call chains with varying response times. Page performance matters more than ever yet it is getting harder and harder to maintain a consistent response time with that many dependencies.

Also, while most services perform well under the 95th percentile (p95), some of them struggle to keep an acceptable response time above p99. The effect of these spikes multiply with the number of calls so it will affect much lower percentiles at aggregation points.

What can we do?

In general caching can help. However, with a cold cache or expired/evicted entry we will face the original problem. Also, several requests can hit this missing entry at once and will end up with multiple downstream calls for the same element.

Blocking is also a problem we should address with async/reactive execution flows. I/O should be non-blocking and we should defer execution where we can.

We should also prefer shared remote caches to enable scaling out but that also introduces another layer of flaky remote call.

Let’s address these problems one by one with different techniques.

The sequence-diagrams use the following notation:
–> means a synchronous call
-x-> means an asynchronous/reactive call

Async service invocation with double timeout

Caching can happen at multiple service layers. These caches usually need warm-up. Also, the response times heavily depend on whether the values are coming from cache or not. Sometimes, response times are just unacceptable (1+ seconds) yet we would like to have them both cached for next time and return them if they arrive in time.

For such situations we can introduce a caching downstream service wrapper with primary and secondary timeouts. The primary timeout would be the maximum time we wait for the response no matter if it is coming from cache or not. The secondary timeout is the maximum time we wait for the downstream service to respond with the value we can then put into cache (asynchronously).

2 timeouts:

first lower timeout (T1) in ServiceAdapter
second higher timeout (T2) in Service

Request for uncached value with fast service call

In this case the the downstream response time is less than T1 so we can get the value with the first call.

Request for uncached value with slow service call

If the service response time is between T1 and T2 then initially we will get an empty value but the response will be (most probably) cached for next invocations.

Request for uncached value with too slow service call

If the response time for the given request is beyond T2 then it will return an empty result and won’t be cached ever. To avoid always timing out at T1 for such cases we can cache an empty result if T2 is reached but we should be careful with this as it can happen for different reasons (network congestion, garbage collection spike). Though, when the empty result is evicted from cache, it will be retried later.

Benefit of this solution is that it enables cache warmup for slow calls without significantly affecting upstream response times.
Drawback is that is can be only used if a temporary empty result is acceptable because it is either optional or there’s a fallback.

Async reloading cache

The plain and simple caches are synchronous (thus blocking) and have a single TTL (time to live) defined. They evict elements when their TTL is reached or the cache is full. The problem with this solution is that it causes response time spike when it encounters a cache miss even if the element was previously cached.

While on-heap caches are usually very limited with the cached elements, we are usually better with off-heap caches anyway which can have significantly more room so cache eviction caused by size limit is neglectable there.

To mitigate the eviction because of expiry we can define a cache with both TTL and TTR (time to refresh). TTL should be significantly larger than TTR so elements are barely evicted. Some cache solutions will support this out-of-the-box (e.g. Guava or Caffeine) but for remote caches (Redis, ElastiCache) we can implement this easily by storing a timestamp next to the cached value. So when an element is retrieved from the cache this timestamp is checked against the TTR and if it needs refresh then a request is sent to downstream service and the cached value is updated asynchronously.

If the result is optional we can also make the caching wrapper return an empty result if it encountered a cache miss. This will remove the initial spike as well.

The benefits are clear here:

less or no spikes (actually for remote cache calls we might still encounter small spikes)
added resiliency as there will be a cached value for some time even if the downstream service is down

Just like for the previous pattern the drawback might be that this can be used only with limitation if the result is mandatory. Also, because of asynchronous execution tracing the downstream service requests will be tricky for the value-refresh calls.

Deduplicating request collapsing

The previous caching mechanisms solved some aspects of the complex problem we are trying to solve but we can do more. As a good citizen we might also care about the downstream service (and also save some resources on our side) and fire as few requests towards it as we can. By deduplicating similar parallel requests we can decrease the number of calls which will definitely help with resources both in our application (threads, memory, CPU) and on the network (bandwidth).

Request collapsing with release when finished

We can choose to collapse ongoing requests and release them once the collapsed call finishes. First request to a context will fire a downstream service call. While the response is pending every subsequent calls to same context will be put on hold and will get notified once the result is ready. This can be done semi-asynchronously with a promise but it is even better to be done in a reactive-way eliminating blocking at all levels.

Request collapsing with release on TTL expiration

We can also store on-going requests for a while after they have been initiated (few seconds TTL). This will act like a short-term request-promise cache potentially with only few hundred elements to mitigate call spikes under higher data cache layers. The important difference compared to usual data caching solution is that you will have the cached promise of the data even before the data is actually available.

A benefit of this solution is that it will optimize the resource usage of downstream calls.

One can also notice that the above pattern speeds up parallel calls as they can get a result from a former request sooner. For example if a request needs 100 ms to complete then a blocking data caching solution would need to wait 100 ms to get the data even if there’s already another request for that data. With this solution if there’s an on-going request started 60 ms before then a subsequent request will only need to wait 40 ms for the result.

It also comes handy when your logic can be split to parallel subtasks. e.g. resolving placeholders in a template where you will need an entity for multiple placeholders but you don’t know in advance which placeholder will require it and when as the template can be arbitrary. You could do some pre-processing phase to determine this but that would complicate things and would work only for your current task.

With such a request collapser your subtasks won’t need to know about each other. Each can call the service (through the collapser) to fetch the very same data in parallel. Still, the service won’t be stressed as the request collapser will protect it by deduplicating similar overlapping requests to a single one. This can lead to much simpler code design and architecture while also having the effect of this optimization shared for all higher level requests.

This is not a silver bullet though. Real effect can only be observed with high frequency calls with lots of redundancy. If there’s not enough parallel calls we might only add extra overhead.

Fan-out request collapsing

Some services support bulk requests: getting results for more contexts at once. While this sounds great it introduces another problem: if the contexts are varying and overlapping by different upstream calls than the downstream service will get multiple requests for the same context unnecessarily.

To mitigate this we can introduce another layer of abstraction which acts like a middle-man and exposes a single context API upstream while sending bulk requests (multiple contexts at once) downstream and once the result is ready it delivers the correct result to each upstream caller. Might sound complicated but lets see the diagram.

The benefits should be clear: we optimize the calls by always batching the optimal number of contexts in one.

Please note that this solution won’t deduplicate calls on its own. The previous request collapser can be used for this.

The drawback is that the number of in-flight requests most probably won’t match the batch size. There will be always some partial batches. For these we have to introduce some timeout: the maximum time to wait for the next batch to be full. This can add some latency to partial calls but we can at least control how much we are willing to wait.

The pieces together

Now, that we have solution for most of the problematic parts, let’s see how these look like when combined.

So this is a sequence diagram for an asynchronous reloading cache over a request collapser over a fan-out request collapser with batch size of 2.

As we can see they nicely work together to tackle the problem. We could also add the double timeout on top of this if still needed.

For a given case with the above technique we could see 75% reduction in call rates against the downstream service while upstream response times improved by 400% and the same stability (no or minimal spikes) could be achieved at p99 from previous p95. The result of course heavily depends on your use-case and call patterns.

Implementation

Implementing the above patterns might not be trivial. Async flows are challenging on their own. If you also would like to make them reactive that’s another level of complexity. The good thing is that these can be made generic and can serve your architecture in the future without additional investment. Reactive libraries like RxJava and Reactor provide convenient building blocks for these patterns (e.g. buffer, timeout, cache, Subject/Processor).

Also, be sure to have some internal metrics in place to be able to monitor these patterns in practice. This is also important when you need to tweak some of their parameters. e.g. how does adjusting maximum wait time affect overall completion time in a fan-out request collapser.

Next steps?

While we were talking about application level here, some of these patterns can be moved to system integration level as well: e.g. a request collapsing proxy. Whether we should build something like that or not is a question as it introduces additional maintenance complexity and network overhead, and might be against other operational and architectural decisions.