Distributed Systems are Hard !

Tushar Sappal
Tech Bits
Published in
6 min readAug 12, 2018

“Anything That Can Go Wrong, Will Go Wrong”

Forget Conway’s law, distributed systems at scale follow Murphy’s Law: “anything that can go wrong, will go wrong”.

At a large scale , statistics are not your friend anymore !

The more instances of anything you have in your production eco-system the higher is the likelihood that one or more of them will break . Services will fall over before they’ve received your message, while they’re processing your message or after they’ve processed it, but before they’ve told you they have. The network will lose packets, disks will fail, virtual machines will unexpectedly terminate.

There are things a monolithic architecture guarantees that are no longer true when we’ve distributed our system. Components (now services) no longer start and stop together in a predictable order. Services may unexpectedly restart, changing their state or their version. The result is that no service can make assumptions about another — the system cannot rely on 1-to-1 communication for functioning .

A lot of the traditional mechanisms for recovering from failure may make things worse in a distributed environment. Brute force retries may flood your network and restores from backups are no longer straightforward. There are design patterns for addressing all of these issues but they require thought and testing.

“What We’ve Got Here is Failure to Communicate”

There are traditionally two high-level approaches to application message passing in unreliable (i.e. distributed) systems:

  1. Reliable but slow: keep a saved copy of every message until you’ve had confirmation that the next process in the chain has taken full responsibility for it.
  2. Unreliable but fast: send multiple copies of messages to potentially multiple recipients and tolerate message loss and duplication.

The reliable and unreliable application-level comms we’re talking about here are not the same as network reliability (e.g. TCP vs UDP). Imagine two stateless services that send messages to one another directly over TCP. Even though TCP is a reliable network protocol this isn’t reliable application-level comms. Either service could fall over and lose a message it had successfully received, but not yet processed, because stateless services don’t securely save the data they are handling.Well we could make this setup application-level reliable by putting stateful queues between the services to save each message until it had been completely processed. The downside to this is it would be slower, but we may be happy to live with that if it makes life simpler, particularly if we use a managed stateful queue service so we don’t have to worry about the scale and resilience of that.

The reliable approach is predictable but involves delay (latency) and work: lots of confirmation messages and resiliently saving data (statefulness) until you’ve had sign-off from the next service in the chain that they have taken responsibility for it. A reliable approach does not guarantee rapid delivery but it does guarantee all messages will be delivered eventually, at least once. In an environment where every message is critical and no loss can be tolerated (credit card transactions for example) this is a good approach. AWS Simple Queue Service (Amazon’s managed queue service) or Azure Service Bus Queues are couple of examples of stateful services that can be used in a reliable way.

The second, unreliable, approach involves sending multiple messages and crossing your fingers. It’s faster end-to-end but it means services have to expect duplicates and out-of-order messages and that some messages will go missing. Unreliable service-to-service communication might be used when messages are time-sensitive (i.e. if they are not acted on quickly it is not worth acting on them, like video frames) or later data just overwrites earlier data (like the current price of a flight). For very large scale distributed systems, unreliable messaging may be used because it is faster with less overhead. However, micro-services then need to be designed to cope with message loss and duplication — and forget about order.

What time is it ?

There’s no such thing as common time, a global clock, in a distributed system. For e.g. , in a group chat there’s usually no guaranteed order in which for e.g. my comments and those sent by my friends in different geographies will appear. There’s not even any guarantee we’re all seeing the same timeline -although one ordering will generally win out if we sit around long enough without saying anything new.

Fundamentally, in a distributed system every machine has its own clock and the system as a whole does not have one correct time. Machine clocks may get synchronized , but even then transmission times for the sync messages will vary and physical clocks run at different rates so everything gets out of sync again pretty much immediately. On a single machine, one clock can provide a common time for all threads and processes. In a distributed system this is just not physically possible. In our new world then, clock time no longer provides an incontrovertible definition of order. The monolithic concept of “what time is it?” does not exist in a microservice world and designs should not rely on it for inter-service messages.

The truth is out there ?

In a distributed system there is no global shared memory and therefore no single version of the truth. Data will be scattered across physical machines. In addition, any given piece of data is more likely to be in the relatively slow and inaccessible transit between machines , decisions therefore need to be based on current, local information. This means that answers will not always be consistent in different parts of the system. In theory they should eventually become consistent as information disseminates across the system but if the data is constantly changing we may never reach a completely consistent state short of turning off all the new inputs and waiting. Services therefore have to handle the fact that they may get “old” or just inconsistent information in response to their questions.

Talk Fast

In a monolithic application most of the important communications happen within a single process, between one component and another. Communications inside processes are very quick so lots of internal messages being passed around is not a problem. However, once you split your monolithic components out into separate services, often running on different machines, then things get trickier.

In the best case it takes about 100 times longer to send a message from one machine to another than it does to just pass a message internally from one component to another. Many services use text-based RESTful messages to communicate. RESTful messages are cross-platform and easy to use, read and debug but slow to transmit and receive. In contrast, Remote Procedure Call (RPC) messages paired with binary message protocols are not human-readable and are therefore harder to debug and use but are much faster to transmit and receive. It might be 20 times faster to send a message via an RPC method, of which a popular example is gRPC, than it is to send a RESTful message.

Testing to Destruction

The only way to know if your distributed system works and will recover from unpredictable errors is to continually engineer those errors and continually repair your system. Netflix uses “Chaos Monkey” to randomly pull cables and crash instances. Any test tool needs to test your system for resilience and integrity and also, just as importantly, test your logging to make sure that if an error occurs you can diagnose and fix it retrospectively — i.e. after you have brought your system back online.

Creating a distributed , scalable , resilient system is extremely tough , particularly for stateful services .

The cloud providers like AWS, Google and Azure are also all developing and launching offerings that could do increasingly large parts of this hard stuff , particularly resilient statefulness (managed queues and databases). These services can seem costly but building and maintaining complex distributed services is expensive too. Any framework that constrains one but handles any of this complexity (like Linkerd or Istio or Azure’s Service Fabric) is well worth considering.

The key takeaway is don’t underestimate how hard building a properly resilient and highly scalable service is. Decide if you really need it all yet, educate everyone thoroughly, introduce useful constraints, start simple, use tools and services wherever possible, do everything gradually and expect setbacks as well as successes.

--

--