Error Handling On Microservice

Transient Error is an Evil

Published in

The Legend

6 min readMay 21, 2021

Error handling is a technique for predicting, detecting, and correcting programming errors. If possible, a successful application would be able to avoid errors, recover from errors without having to stop the program, or gracefully stop the program and write error details to the storage media if an error occurs globally and absolutely.

On monolith architectures, error management is simpler than on distributed systems or microservice architectures. Error handling in monoliths can be very straightforward. Meanwhile, we can’t rely on solving microservices error problems that are typical in architectural monoliths.

If a system is distributed as a microservice, issues emerge that do not arise when a monolith is not present. For instance, service connectivity, serialization and deserialization, dead downstream services, bugs, and the most common is higher latency and blocking.

The secret to successfully handling errors on the microservice is quick, uncomplicated error handling that is performed on a continuous basis. The key is to simplify and break down uncertainty. Remember the KISS principle and the Divide et Impera. We may divide errors into two categories: transient errors and non-transient errors.

Transient Error

Transient errors are errors that occur for a short period of time as a result of a problematic or limited device or hardware resource. Link issues, dead services, network routes not accessible for a few seconds or milliseconds, the website experiencing a heavy load and sending sporadic 503 HTTP service inaccessible messages, or the database you’re trying to reach being transferred to a different server and therefore unavailable for a few seconds are all examples of this form of error.

This is a malicious mistake that is difficult to spot and can happen at any time. The argument is that this error appears and disappears at random intervals (intermittent).

Non Transient Error

The term “non-transient error” refers to an error that continues to occur as long as it is not corrected. Bugs and serialization and deserialization issues are examples of these mistakes. Since these errors are apparent and easy to replicate, dealing with them is simple. We should deal with these kinds of mistakes in the same way that architectural monoliths deal with them.

Transient Error Handling

To handle this type of error, a back-off policy and retry mechanism are generally carried out for ongoing operations. The back-off strategy process can be done in the following ways.

Retry Immediately, when a problem occurs we immediately redial it without any time delay.
Retry at a definite time interval, when an error occurs we will wait for a certain and determined time.
Retry with exponential back-off, when an error occurs we will wait for an exponential increase of time, for example, 2, 4, 8, 16, 32, … seconds, and then the process will be recalled.

The best solution for retrying is to use a retry with exponential back-off, you may even have to add Jitter as well. Why is retrying with the back-off important because if a transient error occurs, it is likely that at the same time each service will retry directly and simultaneously so that it will usually make the situation getting worse Sometimes overloaded requests lead to Denial of Service (DoS).

When a service is in trouble, using exponential back-off means that any client who calls allows the service enough breathing space to recover.

Some exponential back-off algorithms also add a randomly calculated delta to the back-off time. This ensures that if multiple clients use the same back-off policy algorithm when they retry they have a lower probability of a crash. For example, instead of just using the raw exponential countdown time retrying at 2, 4, 8, 16 seconds, etc, the formula adds a random +/- 20% delta so that the retry time occurs at 1.7, 4,2, 8.5, 15.4 seconds. This is what we mean by adding jitter to the back-off process.

Exponential back-off policy without Jitter

An exponential back-off policy with Jitter

Several frameworks that can be used for retry and back-off include Spring Retry and Resilience4j for Java programming or https://github.com/avast/retry-go or https://github.com/gojek/heimdall for the go programming language.

When to Do a Retry

We don’t have to redial all errors. Only the transient error forms will be retried, according to the specification above. It is very good and quick for us to detect if the third-party provider or database service that we call already has the TransientErrorException style error. Most devices, on the other hand, do not yet have any specific errors. As a result, we must be cautious when dealing with such mistakes.

According to the definition of the HTTP 2 specification, GET, HEAD, PUT, and DELETE are idempotent operations. So, you may retry the request unless suggested by the owner of the service we are calling. On the other hand, POST & PATCH is not idempotent and if idempotence is not applied it is not safe to try again as it may cause side effects such as charging the customer multiple times.

Dead Letter Events

The way to handle error handling due to the transient error above still allows the incoming data when a transient error occurs to be lost because several times the system retry is still problematic.

One solution is to implement a service dedicated to managing data storage for situations where retries are unable to resolve the issue. When a problem arises, we can divert the data out of the pipeline and direct it to this service. This service will store the problematic data and retry execution once the system has fully recovered, ensuring that no data is lost during disruptions.

The way to implement a dead letter event usually uses a message broker so that the process occurs asynchronously and non-blocking. Suppose we use Kafka in combination with NoSQL as storage media and preferably a scheduler.

Apart from Retry and Dead Letter Events, there are other mechanisms for handling this transient error. One of them is Circuit Breaker and Ratelimiter which we will try to discuss on Side Car Service, Client Side Load Balancing, and API Gateway.

In closing, error handling on microservices or distributed systems generally requires handling that is not easy and needs to be well designed. After it has been designed properly, don’t forget that we have to test or simulate the transient error that occurs in production. However, being able to simulate transient errors is not easy especially in production, Chaos Monkey as part of Chaos Engineering is one way to simulate this transient error. Maybe one day we try to discuss how Netflix and Chaos Monkey.