Lessons learned about running Microservices

There are many reasons and benefits related to opt for an architecture based on Microservices, but there is no free lunch, at the same time it also brings some difficult and hard aspects to deal with.

At B2W, we started to work with Microservices in 2014 and motivation behind was related to stability and consequently to scalability aspects. Since that, 3 years passed and currently we have hundreds, or maybe a thousand Microservices running in production.

As we're in a good level of maturity, the idea behind this serie is to share some lessons learned about this process.

Below, some best practices related about Microservices communication that you will find useful.

In a Microservice architecture a service must interact with others, this mechanism is called Microservices communication, or inter-process(service) communication (IPC).

This process occurs through a common protocol, generally the HTTP protocol. There are many styles to promote interactions between two services, such as: request/reply, fire and forget, publish/subscriber, etc. Choose best style are totally related to your technical requirements, architecture premises and even with functional aspects.

Below, regardless of your interaction style, some techniques to consider about how communicate your Microservices.

Retry policies

The way to achieve this in a Microservice architecture is through retry policies. A retry is a possibility to one service call the other service again, in case of some fail.

Avoid retries in case of a service timeout

But your application don't know how many services your dependency calls for each request received. In practice: you don't know the dependencies of your dependency, and that's the problem on making retries in cases of timeout. In your first attempt (or request), even if response didn't arrive in expected time, it wouldn't represent that your request wasn't attend. So, in case of a new request, you probably will flood the whole service chain and create new problems to deal with.

If this situation is frequently, you might consider change some components in your architecture to work in a pub/sub manner, with async responses.

If you really need a retry in case of a timeout, choose the right service

One of this cases, is when YOU KNOW that service you're calling is the last service in service dependency chain.

I'm having this in one of systems that I'm responsible for. This kind of situation normally happens when you are client and provider at the same time.

Use a circuit-breaker

This pattern is a good practice in provider’s point of view and also for their dependencies.

As a provider, you must use some "fallback flow" and still be functional for your clients. About your dependencies, you will protect them, and this behaviour will contribute for your dependencies get back to their "normal" states.

Fortunately, there are good frameworks to work with this pattern in your application. One of this, is Netflix's Hystrix.

Exponential Backoff

The idea is that if the service you're calling is temporally down, it is not overwhelmed with requests hitting at the same time when it comes back up.

If you are using Java with Spring Boot like we are using here at B2W, you can use Exponential Backoff algorithm of Spring Retry project. There are also many options in other languages than Java.

Only for HTTP 5** family errors

On the other hand, 4** errors represents errors that have been caused by client (your service), so in this case, you must change your request and do a new request to obtain a valid response.

Considering those rules, only do a retry for 5** errors, never for 4** errors! Do a retry for a 4** error is Einstein's definition of insanity.

You must combine retries for 5** errors with other techniques, like exponential backoff and circuit-breaker.

Connection Pool

A connection costs in terms of computational resources (memory and cpu) and also for your application in terms of performance. Pooling a connection, is a way to reuse a connection that is already created.

Define a HTTP connection pool

But take care with connection pool configuration. You must ensure that you are not wasting resources. If you create a lot of connections, probably you'll consume all of your machine resources. If you don't create enough connections, your application will waste time creating connections for each request that you receive and you'll pay for this in terms of availability and performance.

There isn't magic recipe for this. You must find the perfect balance.

You must know how your application works, how many threads (process) can run in parallel for a single request. How many service calls a thread can do? How many hosts does your application call? What is the capacity of your dependency? What is your throughput? What is the profile of machine your application is running? These are some questions to answer to find the perfect balance.

A good way to start is defining a small pool and observing your application performance and resources consumption.

Define a dedicated pool for each dependency

As any dependency is different from each other, mainly in terms of availability, capacity and throughput, a dedicated connection pool would be useful for your service day-by-day operation.


In point of view of HTTP, client and server can close a connection through the "Connection" header, so take care with intermediate layers like load balancers and web proxies.

You can test if connection for your dependency can be persistent, just with a simple cURL command. If server respond with a "left intact" message, this represents you can reuse this TCP connection.


Less is (almost) always more

Define short timeouts avoid cascading failures problems. In a Microservice architecture if a service fail, better fail fast, otherwise the whole system will collapse. Just consider a scenario when one service call another, and then call another, and then more and more services… What happens if the last service in this chain is slow to respond, and all services before are configured with long timeouts?

My tip here: take a look at 95% or 99% percentile of your dependencies (individually) and configure a timeout based on what you discovered. Another option would be increasing in a magnitude of 4x, or 5x based on average response time of your dependency.

In the next article, I’ll detail some insights based on our experience about monitoring and operating Microservices.

Currently @OLX. Previously @Amazon Web Services and @B2W Digital.