Microservices and Cascading Failures

Floyd May
Floyd May
Aug 18, 2018 · 3 min read

When you’re writing software in a microservices architecture, you will, no doubt, encounter a situation where some bit of processing has become too big, too complex, to be kept all in the same place. This overly-complex bit needs to be split up across two or more services. When this situation arises, avoid creating direct coupling between microservices when you refactor that logic out.

Keep it Asynchronous

Remember that the purpose of microservices is decoupling. When carving off a piece of logic to put into another microservice, remember that this kind of refactoring works differently than a simple extract method refactoring. As soon as a remote call is involved, it behaves differently than a local function call.

This is a bad idea.

If we simply treat the refactoring like an extract method and do a remote procedure call (RPC), waiting for service 2's computation of B to complete before computing C, then service 1 is directly coupled to service 2, meaning that service 1 is guaranteed to fail if service 2 fails. Rather than delivering on the promises that microservice architectures claim (failure resilience, zero downtime deploys, decoupling), RPCs between microservices cause cascading failures.

Let’s consider a different approach.

Yes, it’s more complicated. That’s the price you pay for resilience.

Let’s talk about why this is a smarter solution in a microservices architecture. Let’s say when Service 1 sends that message to Service 2, it’s down because it’s in the middle of a redeploy. In the RPC case, Service 1 fails and the client gets an error. No bueno. In this more sophisticated scenario, the message will just sit in the queue until Service 2 comes back up. Then, Service 2 will happily move things right along.

The downsides? Well, in the simple case earlier, a client can simply treat calls to Service 1 as RPC calls: ask a question, get an answer immediately. In this more resilient case, it doesn’t work quite so simply. Service 1 needs to represent to the client that its work is going to be accomplished asynchronously. If Service 1 is a web app, for instance, the client might be a web browser, and receiving the results might require polling the server or establishing some sort of notification mechanism (SignalR, Socket.IO, etc.). Additionally, your infrastructure needs a message bus if it doesn’t already have one.

TL; DR

Avoid RPCs between microservices if possible — they cause cascading failures. If you refactor an operation out into a separate service, redesign the containing operation to be fully asynchronous. Leverage the message bus to insulate services from one another so that temporary failures, redeploys, or downtime in one service don’t guarantee failures in another.

Happy coding!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade