Foraging for the Fallacies of Distributed Computing (Part 1)
So much of computing is based on assumptions. We design systems operating on a set of assumptions. We write programs and applications assuming certain parts of their systems will work a certain way. And we also assume that some things can potentially go wrong, and we (hopefully) attempt to account for them.
One big issue with building computer-y things is that, even though we’re often dealing with complex systems, we aren’t always capable of reasoning about on a big-picture level. Distributed systems are certainly a great example of this (you knew where I was going with this, didn’t you?). Even a “simple” distributed system isn’t so simple, because by definition it involves more than one node, and the nodes in the system have to communicate and talk to one another through a network. Even in a small distributed systems with just two nodes, we make certain assumptions about how that system is going to work.
Here’s the thing about making assumptions, though: they can be wrong! As it turns out, there are some common, incorrect assumptions that we often make about distributed computing. In fact they are so ubiquitous that they have a name! I’m talking, of course, about the eight fallacies of distributed computing. And what are they? Well, it’s time to find out.
Assumptions gone wrong
The eight fallacies as we know them today actually started off as just four fallacies, which are the ones that we’ll cover in this post. The first four fallacies were created in the early 90’s (sometime after 1991) by two Sun Microsystems engineers, Bill Joy and Dave Lyon.
James Gosling, who was a fellow at Sun Microsystems at the time — and would later go on to create the Java programming language — classified these first four fallacies as “The Fallacies of Networked Computing”. A bit later that decade, Peter Deutsch, another Fellow at Sun, added fallacies five, six, and seven. In 1997, Gosling added the eighth and final fallacy. So much of the work done to codify this list was inspired by work that was actually happening at Sun Microsystems at the time. And while Deutsch is the one who is often credited with having “created” the eight fallacies, clearly it was a combined, group effort (as is so often the case in computing!).
Alright, so now that we’ve got some history under our belt, let’s get into some of the practical aspects of this problem. Let’s first start with understanding what we’re talking about when we say a “fallacy” of distributed computing. What does that mean, exactly?
Well, a fallacy is a kind of assumed fact or belief; however, even though it is assumed to be true, that is usually not the case in practice. A fallacy is just a belief that is a misconception. But what does that mean in the context of computing? Well, a fallacy made about a distributed system is an assumption about the system made by the developers, which often ends up being inaccurate and wrong in the long run. In other words, they are bad assumptions to make.
We can think of the eight fallacies as eight common misconceptions that developers of distributed systems often fall prey to, which we want to avoid.
So, let’s take a look at these eight fallacies, shall we?
The eight fallacies of distributed computing are as follows:
- The network is reliable.
- Latency is zero.
- Bandwidth is infinite.
- The network is secure.
- Topology doesn’t change.
- There is one administrator.
- Transport cost is zero.
- The network is homogenous.
In this post, we’ll focus just on the first four fallacies, which we now know were the original “Fallacies of Networked Computing”! Conveniently, these first four fallacies are actually all anchored around one central, key character: the network. And, as we’re about to see, the network is a tricky little fellow. Of course, it’s kind of difficult to have a distributed system without a network so we have to get used to dealing with this tricky character.
Let’s learn about what makes the network hard to contend with, and why some of these fallacies are easy to assume.
The trouble with networks
Our distributed systems, no matter how complex or involved they are, all exist within a network. The nodes of our system have to talk to one another and send and receive messages to communicate and deliver data. Because the system is distributed, parts of it will live in different places, and the network is the anchoring point by which the components of the system will communicate.
This brings us to fallacy one: The network is reliable. Remember, these are fallacies that we’re dealing with, so the network being reliable is a true misconception; the network is not reliable!
So many things within a network can go wrong. Hardware can fail, the power can go out, the signal could get spotty, or the network could even become compromised! It might seem obvious that a network isn’t reliable, but as designers of a distributed system, we can sometimes forget this when we are writing code that is going to run on a distributed system. We write that code perhaps operating under this assumption, but when the network inevitably becomes unreliable at some point down the road, we realize that we made a bad assumption and it ends up affecting our system adversely. Thus, it becomes our job to keep this fallacy in mind and not forget that the network cannot be relied upon.
Now, before we get too much further into networks — not to mention the fallacies associated with them — there are some important terms for us to learn. When we talk about messages being sent through a network, we’re usually referring to data being sent across the network. But we can talk about data and the way it is sent in two different ways: latency and bandwidth.
Latency is the measure of how delayed some data is to arrive, based on wherever it was sent. We can think of latency as the speed at which our data travels from one place to another, or how long it takes for the data to arrive from node A to node B in our system. Latency is a concept that many developers may already be quite familiar with, because we often refer to the latency of data in the context of milliseconds. For example, if it took a request 400 milliseconds to be served from a server (node A) to a client (node B), we can say that the latency of that data being sent was delayed by 400 milliseconds, or that it took 0.4 seconds for the message travel from the server to the client. Many performance-minded developers are often thinking of latency when they consider what to optimize or improve in their applications or systems.
Bandwidth, on the other hand, is a measure of how much data can be sent from one point to another in a certain period of time. We often talk about bandwidth in the context of internet speed (for example, megabits per second or Mbps) because we’re referring to how much data can be sent over our specific network, or the capacity of our network to send some amount of data in a given span of time.
While latency is the actual speed at which data gets to its destination via the network, bandwidth is the capacity of network to send a certain amount of data to its destination. Understanding the difference between these two concepts is important to wrapping our heads around the second and third fallacies, so let’s learn about those two next!
Some fallacies hurt more than others
Now that we know what “latency” actually means, we are better equipped to understand fallacy two: Latency is zero. Unfortunately, this is also a misconception.
This is perhaps an easy fallacy to fall prey to since, in our local environments or in the development environment, we may experience a much lower latency; in other words, we likely experience a short amount of delay time when it comes to sending or receiving data. However, that low local latency doesn’t accurately reflect the delay of what other parts of the system might feel — especially given different networks.
We’ve already learned about local-area networks (LAN) and wide-area networks (WAN). We know that in a WAN, data has to travel further from one node to another, since the network may span large geographical distances. This is in contrast to a LAN, which is a network with devices in the same building or room, and thus the data being sent around has much less of a distance to travel.
We cannot assume that our system will always operate on a local-area network (in fact, most distributed systems operate on a WAN), and thus we know that our data has a physical distance to travel. By proxy, we should not assume that there will be no “delay” or zero latency between some data being sent and that data being received. For example, a request that has to travel across the world from Portland to Paris will have more latency than a request traveling from a machine in one room to another.
Some latency is a limitation of networks; we should be wise to not build a system that assumes that such a delay is nonexistent. We will certainly be bitten by such a false assumption in the future.
The third fallacy is similar to this false assumption as well: Bandwidth is infinite. Just as latency is not zero, the capacity of a network to send data is not unlimited. As computing technology has improved over the years — and certainly since the early 90’s when the Fallacies were coined — the bandwidth of our networks keeps getting better and better, and we’re able to send more data across our data in a given amount of time.
However, even with these improvements, networks still don’t have unlimited capacity and cannot send infinitely large amounts of data. When we design a distributed system, it would behoove us to not assume that any amount of data can be sent across the network. For example, we may have some user of our system that has a network with a limited bandwidth and a lower capacity; for that user’s experience, sending a large amount of data (such as large images or vendored, dependent files) would mean that we’d be sending far more data than their network was capable of sending over the network in a reasonable amount of time.
Finally, we come to fallacy four, which is personally just my favorite one: The network is secure. There are just simple so many ways that a network can be attacked or compromised, from unencrypted messages to vulnerabilities in dependencies to open source code, or even just bugs in third-party softwares that our system might depend on (not to mention bugs in our own systems, too!).
The misconception that the network is “secure” is a hard one to overcome, since, if we really start to think about it, no network is really secure. Indeed, the only way to truly protect our networks and be 100% secure is to just not connect to the network! Of course, because we are dealing with distributed systems…this isn’t really easy to do, since we know that a network is required for communication between nodes in our system.
So, what are we to do? Well, we can try to remember that true security is a fallacy, and that we can just do what is in our power to try to prevent breaches and attacks when we design, build, and test our systems. Hopefully, at least that’ll keep the security issues at bay for awhile.
In part two of this post, we’ll look at the remaining four fallacies. I promise that the sad little blue network blob will make a return appearance! We’re not done with it just yet.
The fallacies of distributed computing have been fairly well-written about, and you can find some good resources on different interpretations of them, for different technologies. If you’re curious to learn more, check out the resources below!
- Fallacies of Distributed Computing Explained, Arnon Rotem-Gal-Oz
- Understanding the 8 fallacies of Distributed Systems, Victor Chircu
- Debunking the 8 Fallacies of Distributed Systems, Ramil Alfonso
- Fallacies of Distributed Systems, Udi Dahan
- The Fallacies of Distributed Computing Reborn: The Cloud Era, Brian Doll
- Deutsch’s Fallacies, 10 Years After, Ingrid Van Den Hoogen