8 Fallacies of Distributed Computing

Mahantesh Ambali
6 min readJul 16, 2024

--

complex web of network cables
Photo by Kirill Sh on Unsplash

Introduction

We lived in an age where serving a handful of users with simple to moderate ideas from a single server was very much possible. However, as the inevitable happened and internet adoption exploded, people started interacting more and more with the internet. This change in pattern could no longer be served by a single server.

Fast forward to 2024, and every business wants to cater to a user base spread across the globe, whether it be a social media app, e-commerce, audio listening service, video streaming service, and so on. These applications require a complex set of machines, possibly interconnected by high-speed network cables running on the sea bed, which can be present halfway across the world.

In this article, we will take a look at the 8 Fallacies of Distributed Computing, which were assertions made by Peter Deutsch at Sun Microsystems. These are the assumptions that every backend developer tends to make about distributed systems. By understanding these fallacies, we can design and build robust distributed systems.

Overview of Distributed Computing

Imagine a complex task that requires more processing power than a single machine can handle. Examples of such tasks include serving millions of users a homepage, placing orders on a grocery shopping app, or tracking your cab on a ride-sharing app. This is where distributed computing comes in.

We break a complex task into more manageable chunks and distribute them across multiple machines in a network. These machines will, in turn, communicate over the network to achieve a common goal. This helps us achieve scalability by adding more machines during high load and reducing machines otherwise. We can also achieve fault tolerance and parallel processing.

The Eight Fallacies

1. The Network is Reliable
Explanation: As developers, we design our architecture and write our code with the assumption that the network is always available, which is never the case. There will be hardware issues (router restarts, switch failures, power failures, etc.) and software issues (network topology changes, packet drops, request blocking, etc.). These parameters become much harder to diagnose when requests are being passed outside our company perimeter to a third-party service (e.g., credit card validation, fraud detection service). There can be timeout exceptions, server crashes before processing the request, or server crashes after processing the request. These factors need to be considered while designing the software.
Solution: Whenever a network call is made from a service (be it a DB call or a third-party service call), we need to have Connect Timeout and Read Timeout configurations enabled. We need to have Retry Logic for reprocessing failed requests while maintaining the idempotency of the request during retries.

2. Latency is Zero
Explanation: Latency is defined as the delay between sending a request and receiving a response back. In the real world, latency can never be zero because there is network latency, processing latency, queuing latency, database latency, etc. Communication over the network is not instantaneous.
Consequence: Ignoring latency can lead to poor user experience, lost revenue, and reputational damage. Imagine a website with high latency for homepage loading, a stock broking app taking time to load tickers, or cab tracking details not working on a ride-sharing app.
Mitigation: Reduce latency by caching static data as much as possible, which saves the DB round trip. Move the data closer to the client using CDNs, choose an availability zone nearer to the user if utilizing a cloud service, and retrieve only the required information from a service (e.g., using GraphQL).

3. Bandwidth is Infinite
Explanation: We always assume that we have unlimited bandwidth available for transmitting data over the network, but in reality, it’s never the case. Multiple applications compete for transferring information over the network. Even though we have increased bandwidth due to advancements in technology, we also have equally high-resolution videos, photos, and richer UIs being passed over the network.
Consequence: In day-to-day scenarios, a typical transfer between applications might not hit the ceiling of bandwidth limits, but this needs to be kept in mind while transferring data. Less data is easier to understand. For cases where data transfer cannot be limited by fields, such as a video streaming platform, we might need to think of compression techniques and caching data in CDNs.

4. The Network is Secure
Explanation: Designing a software system with the assumption that the network is secure can become a threat to the company. A company might encounter a data breach if sensitive data (e.g., customer information, user IDs and passwords, email addresses) are transmitted over an unencrypted network. This could put the company in legal compliance issues, cyber-attacks, and ultimately ransomware.
Mitigation: Implement robust authentication and authorization for users. Use encryption for data in transit. For services dependent on third-party libraries, scan for possible security vulnerabilities. Implement SAST and DAST scans for applications. Regularly conduct penetration tests for applications. Use HTTPS for communication in and out of the applications.

5. Topology Doesn’t Change
Explanation: Network topology describes the physical and logical structure of the network. It maps how different nodes (switches, routers) are placed and interconnected, as well as how data flows. With the rise of microservices and containerized architecture styles, where application servers are auto scaled based on load, network topologies keep changing.
Consequence: When there is a change in network topology, there might be a sudden rise in latency due to increased packet transfer times. These metrics need to be monitored and necessary changes made, such as:
a. Not hard coding IP addresses, instead relying on hostnames and letting DNS resolve the IP address of the host.
b. Using service discovery if IP: port is required, as DNS might not be enough.
c. Automating failed node startups as much as possible. By stopping a service or server while under load test, we can observe the behavior of the network during spikes. Netflix’s Chaos Monkey takes this to another level.

6. There is One Administrator
Explanation: This was true in the past when a single person was responsible for maintaining different environments, upgrading, patching, and so on to keep systems up to date. This is no longer the case with companies adopting cloud, multi-cloud, and hybrid (on-prem + cloud) approaches. It’s practically impossible for a single person to administer these components.
Mitigation: Implement centralized monitoring and management tools where potential security vulnerabilities and coordination between different administrators can be achieved. Changes need to be tested, reviewed, and approved by multiple administrators for smooth and effective implementation.

7. Transport Cost is Zero
Explanation: The second fallacy (latency is zero) discussed the time aspect of sending information over the network. This fallacy talks about the resource aspect of sending data. At a high level, we might feel that there is no real cost associated with transferring data over the network (at least for on-prem), which is not true. There are hardware costs (network cables, routers, switches), software costs (for connecting the nodes efficiently), power consumption costs, personnel maintaining the devices, upgrading the devices, etc.
Mitigation: Be mindful of transport costs and how much an application is serializing and deserializing. Periodic analysis of network resource consumption can reveal potential savings. Reduce unnecessary data transfers over the network.

8. The Network is Homogeneous
Explanation: A homogeneous network is a network of machines using similar configurations and the same communication protocol. In reality, finding machines with similar configurations is hard. Additionally, applications varying in functionalities cannot stick to the same communication protocols.
Example: A company network might host a chat application that relies on WebSockets for effective two-way communication between users and also a video streaming application where MPEG-DASH/HLS is preferred. Hosting these varying use cases is not possible on a homogeneous network.

Conclusion

These 8 fallacies highlight common misconceptions that, if ignored, can significantly impact the design, performance, and security of distributed systems. It is crucial for developers, architects, and administrators to understand and address these fallacies to build reliable, scalable, and efficient systems.

The idea is to mitigate these fallacies as much as possible by designing applications around redundancy, caching, encryption, and dynamic routing, along with strategic planning and continuous monitoring. By doing so, we can leverage the full potential of distributed computing to build robust and scalable applications that meet the demands of today’s digital landscape.

References: Fallacies of distributed computing — Wikipedia

If you have gotten this far, please give a clap and consider following.

--

--

Mahantesh Ambali

I am working as a Principal Engineer at Landmark Group (Lifestyle).