Event-based Microservices: Error Handling
Simple, Scalable, and Robust
If you‘re not familiar with the basics of an event-based microservice architecture, you should check out this Event-based Microservices: Overview post.
What’s the problem?
When a system is distributed (like a microservice-based system) along with the advantages comes the potential for errors. Examples include: connectivity issues, serialization/deserialization issues, downstream system outages, peer system outages, bugs, and the list goes on. The handling of these errors in a simple and repeatable way is important in the success of a project.
This sounds like a daunting problem, but the key to the solution is to simplify and break down the complexity. We can break down the errors into two types, transient and non-transient errors. Examples of these are listed below.
- Connectivity issues
- System outages
- Serialization/deserialization issues
These two types of errors can be handled in repeatable ways. Transient errors can be handled by retrying, as these should fix themselves given time. Non-transient errors should be handled by alerting and notifying a responsible team to fix the issue, as these require human intervention. We will look at how to handle both types of errors below.
Dead Letter Events
We will start by looking at non-transient errors. What should the system do when something unexpected and catastrophic happens?
As distributed systems handle a number of events simultaneously, the last thing we want to do is block the pipeline of data coming through our system. But we also don’t want to lose this unprocessable data. So once this happens we mark the error event as processed, pull it out of the pipeline, and bump it to the side in the form of a dead letter event. The dead letter event contains all the information of the input event so that no data is lost, and it can be reprocessed at a later point.
Once we have done this, the disaster has been averted. Normal data is still flowing and we have not lost any information. From this point, our goal is to notify someone and point them towards the dead letter event. As we have these dead letters in the event bus, this can be done by using a monitoring tool relevant to the message bus technology being used. All the dead letter queues should alert whenever they receive events. From this point, the cause of the error can be fixed and the relevant dead letter events can then be manually flushed back into the system—picking up where they left off.
What should the system do if the cause of an error will fix itself given time? In the case of transient errors, we should handle them using retry events.
The process for retry events starts off the same as dead letter events. We have a flow of data and when an event errors, we again mark it as processed and bump it to the side, maintaining the flow of valid events.
This is where the difference comes in, the error data is put into retry events which will automatically be reprocessed after a given amount of time. This amount of time may vary depending on the type of error and the number of times it has failed before. As a safety precaution once a retry event has failed too many times, it should be turned into a dead letter event to notify someone, as this is not as transient as expected.
Something to note—for simplicity—is to separate errors (as described above) from expected failures. Expected failures are failures that are allowed as part of the business logic or engineering logic. These are handled by the application-level logic and should have bespoke events which cause the correct downstream actions, no retries or dead letters should be created.
An example of an expected failure could be in a banking system. Like if a user wants to pay for something but doesn’t have enough funds in their account. This should result in a transaction failure, which should be handled by the application logic, potentially resulting in a payment declined notification to relevant users.
The similar terminology can easily become confusing and is why clearly separating expected errors and unexpected errors are so important and should be called out.
Implementation with Kafka
If we look at Kafka as a way to implement handling retries or dead letters there are two simple solutions. For both of these approaches the dead letter events and retry events will be stored in Kafka topics. A topic for dead letters, and multiple topics for each retry delay duration. The first solution is having the error logic in all microservices and the second is handling the errors in new specific retry and dead letter microservices. We will look at the advantages and disadvantages of both these approaches.
The first approach works great, although it comes with a complex implementation. If we can have multiple retries with varying delays, we will need multiple queues; one queue for each retry length. This is due to Kafka queues being ordered—forcing events to be processed in order. Along with the retry logic, we need additional logic to be able to flush dead letter queues on demand. Managing these queues requires logic in each individual microservice, increasing the complexity throughout the system.
The second approach works in the same way, although it removes the retry complexity from all microservices, and places it in a single retry microservice. The retry microservice’s job is to track and action all retries. This microservice receives an event, writing it to its own topics with both the event to retry and the timestamp to retry that event. It then pushes out these retry events once their timestamp has been reached. We can also have a single dead letter service that handles flushing dead letter queues on request.
On top of these two approaches, we also have the option of using a database, instead of Kafka topics, to store the retries. This simplifies the retry implementation (as we don’t have duplicated topics for different retry durations) and allows for more complex retry duration logic (we aren’t bound by creating a topic for each delay duration). This comes with two limitations. If the retry database is down, retries cannot be actioned for that time. And it does not support exactly once delivery of events, as there is potential event duplication through the interface with the database not being transactional.
Here is a table of the advantages and disadvantages of each listed out:
Distributed systems require seamless error handling to work well; a simple solution is to treat transient errors by retrying and non-transient errors by alerting.
If you found this interesting you can read more on event-based microservices here!