When most people talk about distributed systems, they can make certain assumptions about how the system functions. For example, we usually can (safely) assume that a distributed system — no matter the intricacies of how it has been set up or architected — is can be used by its users.
This might seem super obvious at first glance, but the deeper that we get into distributed systems, the more we start to come across situations where our system actually isn’t usable to our end users. And this makes a system a little shaky, uncertain, and unreliable. Of course, none of us want that — that just sounds bad. When we design and build a system, we hope (and maybe pray!) that it works and can stand up on its own two feet, so to speak. In the world of distributed systems, the reliability of a system and how self-sufficient it happens to be is closely-tied to it has been built and what situations it is able to handle.
A reliable system is one that can withstand obstacles that come in front of it, which is what all of us strive towards. But there’s a whole language behind reliability and the architecture of a system that we haven’t quite dug into yet, and these terms and concepts come up a lot in this field. So let’s dive into what makes for reliable systems, and how we can talk about them!
Ready, set, available to use
Fundamentally, a reliable distributed system is one that behaves in certain ways. Depending on the kind of system that we’re building and what we expect it to do, “reliability” could be defined in different ways (not to mention the fact that some systems might put more emphasis on certain aspects of reliability as compared to others).
But no matter the idiosyncrasies of what our system is going to do, or how it is built, one thing is for sure: we want our users to be able to access it! In the world of computing, if a system is able to be accessed by its end users (and behaves as expected when it is accessed), it is said to be available.
The more often that a system’s behaves correctly from a user’s perspective is what determines the system’s availability. Oftentimes, this might just be as simple as the user being able to access the system, or being able to retrieve some data and interact with it; ultimately, the availability of a system depends entirely on what exactly the system does. But, as a basic baseline, we can assume that a user has to be able to interact with the system on a fundamental level — for example, they should be able to access their account or a webpage or app — in order for the system to be considered “available”.
Now, as designers and creators of a system, it would make sense for us to take availability into serious consideration when building and maintaining a distributed system. Clearly, we’d want our users to be able to access our system, and most likely, we’d be the first to know if our (perhaps angry?) users weren’t able to access the system as they’d expect to.
Thankfully, there’s a metric that we might have already run into that helps us evaluate and quantify how well we’re doing when it comes to meeting our end users expectations and needs! Uptime is a way for us to express our system’s availability in a quantifiable way.
We can calculate the uptime of a system based on the percentage of total time in a year that the system was available to its users. For example, if a system had an uptime of 99% — also sometimes referred to as “two nines” — through the year, we can deduce that it was not available for 1% of the year, which is equivalent to 3.65 days (we know this because 365 multiplied by 0.01 is equal to 3.65 days). In this example, the quantifiable 1% of the year that the system was not available for users to access is referred to as downtime, or the opposite of uptime.
Downtime is an inevitable part of distributed systems, although we of course want to avoid it! An important part of trying to avoid downtime for our users is understanding what causes it, and then planning around it. So, let’s take a closer look at what factors stand in our way when it comes to achieving availability!
Resisting and tolerating faults
When it comes to creating available distributed systems, we’re all aiming to create available systems that are always up and running, and accessible to our users. But full availability is hard to achieve, mostly because of some annoying, but ever-present factors.
First off, sometimes the things that prevent us from 100% uptime and complete availability are…our own creations! More specifically, we may find ourselves taking a portion of our system offline in order to provide some maintenance. When we are the ones who are performing the maintenance, we almost always have control over when a piece of our system needs to become unavailable, also known as scheduled maintenance.
However, as the internet (and its tooling!) becomes more and more distributed, many systems actually rely on other systems for some portion of their service; when third-party services and external services go offline for maintenance, we might not have any say over it (which makes it unscheduled maintenance), but it will certainly impact our uptime. But maintenance isn’t even the scariest factor standing in our way since, in most situations, we know about the maintenance and potential downtime in advance (and perhaps we are the very ones who are responsible for it!).
Instead, I’d say that it’s the network outages, or failures within the larger distributed system network, that are a bit more intimidating. When things go wrong within the network, we unfortunately have very little control over or notice about it! Oftentimes, this can be a hardware issue, which may even be beyond our ability to fix. For example, if something a data center that is houses a server that is running a process that we depend on happens to loose electricity…well, there’s not all that much that we can do to prevent that from happening.
Another cascading factor tied to network outages is the fact that even when the outage happens and (eventually) resolves itself, there may still be some time after the outage where the system is still restarting, and therefore isn’t fully available again. For example, if a process is killed or terminated because of a network outage, it may need reboot itself or restart in some way once the network eventually does come back — and that is some additional time that will also end up impacting our downtime!
Clearly, there are many factors that could add to our downtime, and prevent us from achieving high availability. Sure, things could fail in the internals of the system from a software perspective. But things could also go wrong from a hardware perspective, too!
So, what can we do? Well, this is where fault-tolerant systems, start to sound really good. A fault-tolerant system is one that is able to handle and account for failures from within our system; even when there is a hardware-related outage and something stops functioning, the system can continue to function as expected! For the purposes of this post, let’s think of faults as hardware related failures (later in this series, we’ll talk more about different types of faults and failures, including ones related to software ).
If we stop and really think about it, we might realize that this eliminates two major factors that were preventing us from achieving full availability earlier: no more hardware (network) outages, and no more time spent trying to recover form them. In other words, no downtime caused from hardware failures!
If this sounds too good to be true…well, that’s because sometimes it really is.
Striking an available, kind-of-tolerant balance
Fault tolerance is hard to achieve and usually it comes at great cost. There are actually some modern-day services which provide hardware that is fully fault-tolerant. However, the internal mechanics of this can be pretty complex, and usually involves detecting whenever some piece of hardware has failed, and immediately coming up with a backup/replacement piece of hardware that is already installed and ready to hop into the failed portion’s place when it actually does fail. If that sounds exhausting to you, don’t worry — it sounds like a lot to do to me, too!
The truth is, creating a fault-tolerant system that is always available and has zero downtime can not just be expensive to implement, but also sometimes might not be worth it. Downtime itself isn’t doesn’t inherently have to be a bad thing, especially if our downtime is minimal compared to our uptime.
Instead of trying to be completely available, we can instead try to build a mostly available system, which is what is often referred to as a highly available system. We can aim to minimize our downtime as much as possible, without exhausting ourselves to the point of implementing a system that guarantees us zero downtime. Most systems don’t actually need “zero” downtime, especially if the amount of downtime is fairly small and the benefit of implementing a fault-tolerant system is not worth the cost of just having a highly available one.
When it comes to designing a highly available system with minimal downtime, the keystone that keeps it all together is the way that our system recovers from failures. When we design a system, we can aim for it to be as close to fault-tolerant as possible; if we know that something could fail, we can plan for what will happen within our system if that something does fail.
This might mean having a plan for one part of the system to step in if another part fails, or accounting for how the failed portion will fix itself and rejoin the rest of the system. If we design towards a fault-tolerant system, we’ll minimize the potential downtime and hopefully end up with a highly available system.
Of course, there’s one caveat to mention here: we can only plan to recover from faults that we know about in our system. If we don’t know that something can fail, then of course we can’t consider a way to recover from it. Unknown failures — and the subsequently unaccounted for recoveries — can add to our potential factors contributing to downtime.
But hey, a little imperfection is okay! We’ll (eventually) figure out a way to recover from it. 😉
Because the concepts of availability and fault-tolerance are so fundamental to distributed systems, these terms come up a lot in academic writing as well as introductory material. Luckily, that just means that there’s a wealth of knowledge on the topic! Below are some great initial places to start if you’re looking for more resources to keep reading.
- Introduction to Distributed Systems, Google
- A Thorough Introduction to Distributed Systems, Stanislav Kozlovski
- Distributed Systems: Fault Tolerance, Professor Jussi Kangasharju
- Distributed Systems for Fun and Profit, Mikito Takada
- What is High Availability?, Erika Heidi
- High availability versus fault tolerance, IBM