Exception Handling in Distributed Systems

Published in
7 min readDec 3, 2021

--

As a (Mendix) developer, we’ve all been there; integrating with systems to handle data or trigger actions. By using the Mendix platform, this has become easier than ever. Nevertheless, it can be hard to cope with complex systems where the other party might become unavailable or unpredictable. Distributed systems are weird because issues can occur in several ways.

For now, let’s say that there are three main types of distributed systems.

  • Offline — Mainly seen for batch processing and data analyses. Fewer use cases, but the most robust option. Highly scalable and self-sustainable. Fewer complex and critical fault scenarios.
  • Soft real-time — Often more critical compared to offline and further connected in the ongoing processes. Acting from the sideline with more time available for performing its tasks. Think about updating a search index. If the update process fails, the previous index is still there and the results will be up-to-date at the next interval.
  • Hard real-time — The type of architecture that is hard to manage. Think about request/reply and transactional systems. Speed is often important because of waiting customers or running processes.

Scenario

Imagine a Mendix application that you are using as an order management system. At a certain moment in time, an employee tries to update the status of an order. Out of nothing, the system becomes very slow and stops responding. At that same moment, a scheduled event caused the application to run out of memory; the application logs are showing that the app will be rebooted. Although the employee did not cause the issue, both processes shared the same system and thereby they share faith.

This single application setup rapidly becomes more complex when the out-of-memory issue was caused by an external request. Think about an external website that is sending new orders to that same application. Because of a bug in the website, the updates are overwhelming the order management system. The website will still be available, but random things will start to fail. The employee might report the issue and your monitoring solution will report an alert, but how to cope with the processes in the consumer-faced web application? The consumer application should be able to handle several scenarios. Has the ‘create order’ request been delivered successfully? If so, did the order management system process the request? Maybe the order has been processed, but delivering the response to the consumer app failed. And what if everything went right so far, but the consumer app is unable to update the order status when receiving the response?

Altogether, you can see that the complexity increases at a high pace. You will need to handle all potential errors one by one. Testing the architecture will be just as hard. Even within the consumer application itself, you might come up with a large number of test scenarios. When testing this in a distributed setup, the number of cases will grow exponentially because of the variety of network and processing issues that might occur. And what to do when an unknown error occurs? Has the order been processed by the order management system at all? If you have the overall picture, it will still be hard to execute these tests. Most often it is not easy to force a system of systems into a certain error state.

Distributed bugs can be hidden for years, but they can have a large impact on your production environment. Most often, a very specific combination (read; the perfect storm) could trigger several failures in an environment where it is hard to pinpoint the exact cause of the occurring issues.

Distributed Exception Handling

When thinking about handling failure in a distributed system, four categories come to mind first. In this blog, we will discuss why one of them isn’t the preferred option and how you can handle this differently.

In short, the categories are retry, parallel attempts, failover, and fallback. Where the first three options can make your environment more robust when implemented correctly, a fallback can cause more issues than it is trying to solve when it’s not fully thought out.

By using the Queue module from the Mendix Marketplace, it is easier to implement a retry policy. In Mendix 9 you can also use the internal Task Queue, but it is currently a bit more cumbersome to set up a retry mechanism here.

Scenario

To make this story more visual, we start with a scenario. You build a solution in which you need to send file documents from system A to system B. You are the developer of system A and delivered a solution in which you use the web service of system B to send the file documents whenever the process requires this. Unfortunately, system B is not always available. In addition to that, system B is not able to cope with larger files at an irregular interval. As the business reported that critical documents are now missing, you implemented a fallback scenario. Whenever system B throws an error of any kind, you send the documents via mail to the relevant business unit. Now they can use the documents directly and upload them to system B themselves.

A possible fallback scenario in which we send or queue an email in case the REST call fails.

Why this is not the preferred option

Although this might feel like a decent solution that will be accepted by the business, there are several reasons to not go with this option. In addition to the possible compliance objections (who can access the mailbox), there are also some technical considerations.

Tradeoff

Potentially, the fallback is not worth the risk. Eventually, you are trying to reach the same end goal with a tradeoff. Is this tradeoff worth it? May the application decide if this is worth it? Will this still be the case at a later moment in time? With the scenario in mind, you are introducing an alternative process that includes more manual action. Imagine that the company will grow over the years and a major outage of system B a few years later is causing the business unit to have flooded mailboxes and too many manual steps to cope with the extra work.

The situation above tells us that most often, it is not worth the risk. The impact of the fallback scenario (the need to upload potentially hundreds of documents from a mailbox manually) can be larger compared to the initial failure.

Predictability

A fallback scenario is likely to be triggered rarely or not at all. Because of this, the impact of the fallback logic is often underestimated. Think about our mail fallback scenario again. Because of the implemented solution, you might expect that the document will always be there. But what if the fallback mechanism fails as well? Your email server could be unavailable too. Additionally, a fallback mechanism can create an unpredictable load on your systems. Imagine that the large outage is causing the fallback mechanism to send all documents via mail. Because this has never happened before, you didn’t take into account that the mailbox was filling up due to a large number of attachments.

While this isn’t a clean example, it shows the idea behind a failover scenario. In a solid architecture, this role is fulfilled by a load balancer.

Testing

Testing a fallback scenario is hard. You need to force the system in a state in which you can predict that the fallback mechanism will be triggered. The fallback scenario is less likely to be triggered, so it might be considered as a harmless piece of logic in your application. The nature of such solutions increases the likelihood of latent bugs that might be hidden for years and are hard to solve when they occur.

With multiple fallback scenarios in a distributed fashion, testing will be even harder. Several pressure points and potential bottlenecks will arise. A potential overload in one of these components (or a combination of them) can be hard to simulate and is often discarded because of the low likeliness.

“At Amazon, we have found that spending engineering resources on making the primary (non-fallback) code more reliable usually raises our odds of success more than investing in an infrequently used fallback strategy.” - Jacob Gabrielson, Senior Principal Engineer at AWS

Alternatives

After reading through the potential pitfalls, you might be looking for alternative options for handling cases like these. Many blog posts can be spent on this alone. Hopefully, this section will give you a head start.

  • Design and build a more resilient and durable non-fallback scenario. For example, you can decide to only use services with high availability.
  • Make the source application responsible for handling errors. Keep things manageable.
  • My personal favorite: create a retry mechanism if possible. Most often, small things go wrong when interacting between systems. A retry mechanism with an exponential backoff can make your architecture more resilient.
  • Create a failover instead of a fallback. Make sure that the output of the failover process is just as reliable and the output of the process is the same without concessions. Test the failover by using both paths in production regularly.
  • Proactively push data to listening parties as soon as it is available. When the data is available already, you become less dependent on the availability of one another. This can be combined with decent retry logic.

Focus on the logical paths in your logic that occur regularly rather than rarely. Strive to make your main systems more robust and predictable. Use push instead of pull whenever you can. If fallback logic is a necessity; make sure that it is testable and guarantee that the alternative is just as stable and reliable as the regular flow.

Further reads

If you would ask me, having a good mentor alongside can improve your architectural way of thinking a lot. In addition to that, many non-Mendix-specific blogs can be very beneficial as well. Regarding this topic about distributed systems, I would like to give a shoutout to the Amazon Builders’ Library.

Read More

--

--

Mendix Expert Consultant at CLEVR | Mendix MVP — Exploring AI, ML and everything else around data on Azure, GCP and AWS