The Eight Fallacies of Distributed Computing

Thanks to modern computing, the 8 fallacies of distributed computing are being rendered obsolete

Backend developer
Geek Culture
7 min readSep 29, 2021

--

Photo by Alexandre Lion on Unsplash

More than 20 years ago Peter Deutsch and James Gosling defined the eight fallacies of distributed computing. These are false assumptions that many developers make about the distributed systems. These are usually proven wrong in the long run, making hard to fix bugs. So, let us see what are those eight fallacies in detail.

Fallacy 1 : The network is reliable

The network is never reliable. There are packet drops, connection interruptions, and data corruptions when they are transferred over the wire. Moreover, there are network outages, router restarts, and switch failures to make the matter worse. If hardware isn’t enough — the software can fail as well, and it does. The situation is more complicated if you collaborate with an external partner, such as an e-commerce application working with an external credit-card processing service. Their side of the connection is not under your direct control. Such an unreliable network has to be considered while designing a robust Distributed System. Let’s take a simple example :

var creditCardProcessor = new CreditCardPaymentService();
creditCardProcessor.Charge(chargeRequest;

What happens if we receive an HTTP timeout exception? If the server did not process the request, then we can retry. But, if it did process the request, we need to make sure we are not double charging the customer. You can do this by making the server idempotent. This means that if you call it 10 times with the same charge request, the customer will be charged only once. If you’re not properly handling these errors, you’re system is nondeterministic. Handling all these cases can get quite complex really fast.

What does that mean for your design? Does this mean that we need to drop our current technology stack and use a messaging system? Probably not! We need to weigh the risk of failure with the investment that you need to make. You can minimize the chance of failure by investing in infrastructure and software. In many cases, failure is an option. But you do need to consider failure when designing distributed systems. To sum up, the network is unreliable and you as software architect/designers need to address that.

Fallacy 2 : Latency is zero

Calls over internet are not instant. Latency is how much time it takes for data to move from one place to another (versus bandwidth which is how much data we can transfer during that time). Latency can be relatively good on a LAN — but latency deteriorates quickly when you move to WAN scenarios or internet scenarios. Network latency is real, and we should not assume that everything happens instantaneously. Moving data across the transatlantic communications cable is difficult and increases latency.

  • Bring Back All the Data You Might Need : If you do a remote call, ensure that you bring back all the data you might need. Network communication should not be chatty.
  • Move the Data Closer to the Clients : Another possible solution is to move the data closer to the clients. If you’re using the cloud, choose availability zone carefully, depending on your client’s location. Caching can also help minimize the number of network calls. For static content, Content Delivery Networks (CDNs) are another good option.
  • Invert the Flow of Data : Another option for removing the remote calls is to invert the flow of data. Instead of querying other services, we can use Pub/Sub model and store the data locally. This way, we will have the data when we need it. Of course, this introduces some complexity, but it can be a good tool in the toolbox.

Fallacy 3 : Bandwidth is infinite

The bandwidth is not infinite. Neither of the machine, or the server, or the wire over which the communication is happening. Although bandwidth has improved over time, the amount of data that we send has increased too. VoIP, videos, and IPTV are some of the newer applications that take up bandwidth. Downloads, richer UIs, and reliance on verbose formats (XML) are also increasing the load.

There is a tension between the second fallacy, latency is not 0, and the third fallacy, bandwidth is infinite. You should transfer more data to minimize the number of network round trips. You should transfer less data to minimize bandwidth usage. You need to balance these two forces and find the right amount of data to send over the wire.

Although you might not hit the bandwidth limitation often, thinking about the data that you transfer is important. Less data is easier to understand. Less data means less coupling. So transfer only the data that we might need.

Fallacy 4 : The network is secure

No one would be naive enough to assume it is. Many malicious users are constantly trying to sniff every packet over the wire and de-code what is being communicated. So, ensure that your data is encrypted and try to mitigate the security risks as much as possible.

  • Defense in Depth : You should use a layered approach to secure your system . You need different security checks at the network, infrastructure and application level.
  • Security Mindset : Keep security in mind when designing your system. The top ten vulnerabilities list has not changed that much in the last 5 years. You should follow best practices for secure software design and review code for common security flaws. You should regularly search 3rd party libraries for new vulnerabilities. The list of Common Vulnerabilities and Exposures can help.
  • Threat Modeling : Threat modeling is a systematic approach of identifying possible security threats in a system. You first identify all the assets in your system (user data in the database, files, etc) and how they are accessed. After that, you identify possible attacks and start executing them.

The truth is that security is hard and expensive. There are a lot of components and links in a distributed system and each one of them is a possible target for malicious users.

Fallacy 5 : Topology doesn’t change

Network topology changes all the time. Sometimes it changes for accidental reasons — when our app server goes down and we need to replace it. But most of the times it’s deliberate — adding new processes on a new server. Nowadays, with cloud and containers on a rise, this is even more visible. Elastic scaling — the ability to add or remove servers depending on the workload — requires a certain degree of network agility.

When the topology changes, we might see a sudden deviation in latency and packet transfer times. So, these metrics need to be monitored for any anomalous behavior, and our systems should be ready to embrace this change.

  • Stop hardcoding IPs — We should prefer using hostnames. By using URIs we are relying on the DNS to resolve the hostname to an IP.
  • When DNS is not enough (e.g. when we need to map an IP and a port), then use discovery services.
  • Service Bus frameworks can also provide location transparency.
  • Any server can fail (thus changing the topology), so we should automate as much as we can.
  • Stop a service or shut down a server and see if the system is still up and running. Tools like Netflix’s Chaos Monkey take this up a notch, by randomly shutting down VMs or containers in the production environment. By bringing the pain forward, we are more motivated to build a more resilient system that can handle topology changes.

Fallacy 6 : There is one administrator

In the past, it was common to have a single person responsible for maintaining environments, installing and upgrading applications, and so on. However, that approach has changed with the shift to modern cloud architectures and DevOps practices.

Modern cloud-native applications are composed of many services, working together but developed by different teams. It’s practically impossible for a single person to know and understand the whole application, let alone try to fix all the issues.

Put governance in place that makes it easy to troubleshoot any issues that arise. Concepts such as release management, decoupling, logging, and monitoring apply here.

Fallacy 7 : Transport cost is zero

This fallacy is related to the second fallacy, that latency is zero. Transporting stuff over the network has a price, in both time and resources. If the second fallacy discussed the time aspect, fallacy #7 tackles resource consumption. There is always a hidden cost of hardware, software, and maintenance that we all bear when using a distributed system. For example, if we use a public cloud-like AWS, then the data transfer cost is real. This cost looks near zero from a bird’s eye view, but it becomes significant when operating at scale. Also, there is a cost to object serialization and de-serialization. Both operations can be expensive in terms of performance.

We should be mindful of the transport cost and how much serialization and de-serialization an application is doing. This doesn’t mean that we should optimize unless there is a need for it. We should benchmark and monitor resource consumption and decide if transport cost is a problem for us.

Fallacy 8 : The network is homogeneous.

Networks are not homogeneous or of the same kind. Instead, networks are heterogeneous. A homogeneous network is a network of computers using similar configurations and the same communication protocol. Having computers with similar configurations is a hard task to achieve. Also, You can’t assume that the network hardware always stays the same. For example, you have little control over what mobile devices can connect to your app. You should choose standard formats in order to avoid vendor lock-in. This might mean XML, JSON or Protocol Buffers. There are plenty of options to choose from. The key point is to focus on standard protocols so that components can communicate, regardless of the hardware.

In this post, we learned about the fallacies of distributed systems and how to avoid them. We must accept these as facts : the network is unreliable, insecure and costs money. Bandwidth is limited. The network’s topology will change. Its components are not configured the same way. Being aware of these limitations will help us design more robust and secure distributed systems. Want to learn more about Distributed Systems? Here’s a link to another article which I wrote about Distributed Systems. Have a look!

References :

--

--