Lessons learned about running Microservices

There are many reasons and benefits related to opt for an architecture based on Microservices, but there is no free lunch, at the same time it also brings some difficult and hard aspects to deal with.

At B2W, we started to work with Microservices in 2014 and motivation behind was related to stability and consequently to scalability aspects. Since that, 3 years passed and currently we have hundreds, or maybe a thousand Microservices running in production.

As we're in a good level of maturity, the idea behind this serie is to share some lessons learned about this process.

This serie will be divided in four parts. First, I’ll bring some best practices related about Microservices communication


In a Microservice architecture a service must interact with others, this mechanism is called Microservices communication, or inter-process(service) communication (IPC).

This process occurs through a common protocol, generally the HTTP protocol. There are many styles to promote interactions between two services, such as: request/reply, fire and forget, publish/subscriber, etc. Choose best style are totally related to your technical requirements, architecture premises and even with functional aspects.

Below, regardless of your interaction style, some techniques to consider about how communicate your Microservices.

Retry policies

In an architecture where communication among components is a crucial part, is good to have alternatives to deal with communication problems like network instability or components outages.

The way to achieve this in a Microservice architecture is through retry policies. A retry is a possibility to one service call the other service again, in case of some fail.

Avoid retries in case of a service timeout

A common scenario is when a dependency of one service become slow to respond, in most cases this situation represents that service dependency is overloaded.

But your application don't know how many services your dependency calls for each request received. In practice: you don't know the dependencies of your dependency, and that's the problem on making retries in cases of timeout. In your first attempt (or request), even if response didn't arrive in expected time, it wouldn't represent that your request wasn't attend. So, in case of a new request, you probably will flood the whole service chain and create new problems to deal with.

If this situation is frequently, you might consider change some components in your architecture to work in a pub/sub manner, with async responses.

If you really need a retry in case of a timeout, choose the right service

There are some specific cases where do a retry in a timeout situation is acceptable.

One of this cases, is when YOU KNOW that service you're calling is the last service in service dependency chain.

I'm having this in one of systems that I'm responsible for. This kind of situation normally happens when you are client and provider at the same time.

Use a circuit-breaker

Circuit-break is a technique to protect some service dependency that is out of service for a while. Makes no sense still sending requests to a dependency that is failing (errors, timeouts, etc.).

This pattern is a good practice in provider’s point of view and also for their dependencies.

As a provider, you must use some "fallback flow" and still be functional for your clients. About your dependencies, you will protect them, and this behaviour will contribute for your dependencies get back to their "normal" states.

Fortunately, there are good frameworks to work with this pattern in your application. One of this, is Netflix's Hystrix.

Exponential Backoff

With an exponential backoff algorithm, you increase time between retries considering a threshold.

The idea is that if the service you're calling is temporally down, it is not overwhelmed with requests hitting at the same time when it comes back up.

If you are using Java with Spring Boot like we are using here at B2W, you can use Exponential Backoff algorithm of Spring Retry project. There are also many options in other languages than Java.

Only for HTTP 5** family errors

Based on HTTP standard definition, 5** family errors represents server errors, means that server (your dependency) failed to fulfil an apparently valid request.

On the other hand, 4** errors represents errors that have been caused by client (your service), so in this case, you must change your request and do a new request to obtain a valid response.

Considering those rules, only do a retry for 5** errors, never for 4** errors! Do a retry for a 4** error is Einstein's definition of insanity.

You must combine retries for 5** errors with other techniques, like exponential backoff and circuit-breaker.

Connection Pool

Using a connection pool is a good practice when you are dealing with external components, like a service dependency.

A connection costs in terms of computational resources (memory and cpu) and also for your application in terms of performance. Pooling a connection, is a way to reuse a connection that is already created.

Define a HTTP connection pool

As we're talking about Microservices communication and, consequently about HTTP calls, my advice is: use a HTTP connection pool.

But take care with connection pool configuration. You must ensure that you are not wasting resources. If you create a lot of connections, probably you'll consume all of your machine resources. If you don't create enough connections, your application will waste time creating connections for each request that you receive and you'll pay for this in terms of availability and performance.

There isn't magic recipe for this. You must find the perfect balance.

You must know how your application works, how many threads (process) can run in parallel for a single request. How many service calls a thread can do? How many hosts does your application call? What is the capacity of your dependency? What is your throughput? What is the profile of machine your application is running? These are some questions to answer to find the perfect balance.

A good way to start is defining a small pool and observing your application performance and resources consumption.

Define a dedicated pool for each dependency

This is a good way to have isolation among dependencies that your service has.

As any dependency is different from each other, mainly in terms of availability, capacity and throughput, a dedicated connection pool would be useful for your service day-by-day operation.

Keep-alive

Connection reuse is the idea behind a HTTP connection pool, and it will only be possible in HTTP 1.* with Keep-alive feature in a form of a HTTP Header.

In point of view of HTTP, client and server can close a connection through the "Connection" header, so take care with intermediate layers like load balancers and web proxies.

You can test if connection for your dependency can be persistent, just with a simple cURL command. If server respond with a "left intact" message, this represents you can reuse this TCP connection.

Timeouts

Timeout is the time that your application can wait for an answer. It can be an answer to get a new connection, to wait a HTTP call or do a retry for a dependency that failed.

Less is (almost) always more

Wait for a long time can prejudice your application in terms of resources blocking, like threads. This situation can lead for an application degradation scenario.

Define short timeouts avoid cascading failures problems. In a Microservice architecture if a service fail, better fail fast, otherwise the whole system will collapse. Just consider a scenario when one service call another, and then call another, and then more and more services… What happens if the last service in this chain is slow to respond, and all services before are configured with long timeouts?

My tip here: take a look at 95% or 99% percentile of your dependencies (individually) and configure a timeout based on what you discovered. Another option would be increasing in a magnitude of 4x, or 5x based on average response time of your dependency.


In next article of this serie, I'll detail some insights based on our experience about monitoring and operating Microservices.