Cut your microservice latency with adaptive double dispatch

Guy Kobrinsky
Outbrain Engineering
4 min readFeb 28, 2019

Serving millions of requests per minute with a microservices environment is not an easy task. Every request is routed to many applications, and may potentially stall or fail at any step in the flow.
Needless to say, not every one of the hundreds of microservices was written by the strongest coder — which means that context switches, gc spikes or just inefficient code could creep up on you at any moment

We always try to find ways to reduce latency, but before getting to that, let’s understand how our system behaves

How does your service latency graph look like?

Regardless of what the microservice is doing, it always ends up looking like this:

Context switches, network spikes, garbage collection — all of those may lead to high latency. The numbers, of course, vary from one use case to another, but high latency at high percentiles is so common that it’s practically given

Instead of trying to optimize the service itself (which can sometimes be hard and not suitable for developers at all levels), let’s focus on the client-side optimization (remember — in our microservice environment the client side is almost always another microservice)

Ok, let’s just set a timeout...

Yes, we have to have timeouts, but those won’t improve our latency without degrading success ratio. Let’s assume our serviceA → serviceB latency graph is the same as above. With timeout = 200ms:

p(timeout) = 0.001

While it may be ok to throw away 1 of each 1000 requests, later on we may add serviceA → serviceC call keeping similar timeout policy:

p(overall timeout) = 1–(999/1000)² =~ 0.002 

In our microservice architecture, we have serviceA → serviceB → serviceC chained calls. The probability for a timeout increases according to the number of services called in a request sequence

p(overall timeout, 5 services) = 1–(999/1000)⁵ =~ 0.005 
p(overall timeout, 10 services) = 1–(999/1000)¹⁰ =~ 0.01

In reality, we may have hundreds of microservices, where each may call more than one. You don’t actually have to do the math to realize that the problem becomes significant as the number of services grows.

Double-dispatch/Speculative-retry to the rescue!

The basic idea of double dispatch (aka speculative retry) is to when a request doesn’t return after some time threshold, send another request and wait for the first that comes back.

Let’s assume latency is 50 at 99%, 150 at 99.7% and 200 at 99.9%

p(latency > 200, single request) = 0.001

Let’s examine the chance to high latency when firing another request at 50

p(latency > 50, 1st request) = 0.01
p(latency > 150, 2nd request) = 0.003
p(latency > 200, double dispatch at 50) = 0.01*0.003 = 0.00003

The chance to have a bad latency reduced by 97% with the price of 1% extra load* on our service; and we didn’t have to touch any server-side code!

*an accumulated effect has to be considered

But won’t the 2nd request face the same spikes as the 1st?

If we are always hitting the same instance of a microservice, most likely that double dispatch trick won’t work. Taking advantage of the fact that in a server array — not all nodes have the same latency spikes at the same time. That’s when the power of many helps

Great! let’s set our threshold to 50

Not so fast… at the time we measured it may be that the 99% latency = 50, but as our system gets more traffic at busy times - latency increases all over and now it’s possible that 50 will be our 75%

With these numbers, setting a constant value to the double dispatch may lead to an extra 25% load!

Adaptivity is critical

Instead of setting a constant threshold — let’s switch to a constant number of extra requests to send. We have to decide what is the extra load our service is willing to put up with in order to improve the latency - let’s say we decided that extra 1% is ok

We have to maintain a latency histogram of the last few seconds — and update it with the request duration upon every response. This allows us to get a decent estimation of the current value of the 99% of our latency and use it to determine the threshold of the double dispatch. Now we are at minimum risk of overloading our system

To sum up

The magic of double dispatch won’t solve all your latency problems, especially if the whole latency graph raises. You will still have to optimize and pay attention to network, context switch, and gc.

However, using it correctly in an adaptive way — you are bound to see surprising results!

--

--