Scalability Problems: Hidden Challenges of Growing a System
So far, this series has been focused around the techniques and concepts behind distributing a system. We’ve also explored how to go about keeping a system functioning efficiently after we’ve distributed it and we’ve seen the different dimensions by which a system can grow.
But once we’ve understood why a system even need to scale and once we’ve determined the ways in which our system should scale, how do we go about actually doing that? As it turns out, just as there are different dimensions by which a system can scale, there are different methods of actually going about scaling, too. But in the process of thinking about how we’ll go about scaling a system, it’s likely that some unexpected, hidden problems are going to peek their little heads out from under the rug (so typical, right?).
The trouble with scaling any system is that, once we actually start down the path of growing it, inevitably run into some hidden complexities. Since we’re already familiar with the situations in which scalability can help us out, it’s time for us to dive deeper into what we might have to think about and deal with when we actually start scaling. In the process, we’ll discover that this is a complicated process to consider, with some bumps in the road along the way! Eventually, in a later post, we’ll think about some solutions to these problems; but first, let’s see what we’re dealing with.
Problems with scaling
In a previous post, we learned that there are three main ways to measure the scalability of a system, which we refer to as the three dimensions of scalability: size scalability, geographical scalability, and administrative scalability. As a quick refresher, these three dimensions effectively ensure that, as the number of users and resources grows, as as the physical distance between resources grows, and as the administrative overhead of a system changes in size, the system itself should not slow down or become less performant as a result.
However, what we haven’t yet dived into are some of the problems that these dimensions can present when trying to scale a system. Every system can have its own set of limitations that we have to deal with when we try to scale it, but for today, we’ll talk about two big ones: centralization and synchronous communication.
If we think back to what we know about size scalable systems, we’ll recall that, as the number of users or resources increases, our system needs to be able to handle that influx. But if our system is centralized, or located in/controlled from a single, central location, then we can run into some issues. If a system is centralized, it presents some limitations when trying to make it size scalable.
Similarly, if our system is one that uses synchronous communication, we’ll run into some issues in trying to make our system geographically scalable.
Synchronous communication is a type of communication between nodes or resources; in this situation, whenever one node requests something from another node, it will “block” (which means it waits around and doesn’t do anything else) until it has received a response back. Many of us have dealt with synchronous communication in the form of the client-server model, where the client is the “requester” of information, and waits until the “requestee”, or the server, responds back to it.
We’ll recall that a geographically scalable system is one that performs even as the distance between its users and/or resources increases; however, if a system leans on synchronous communication, it can causes issues when trying to ensure that the system can scale geographically.
But why, exactly, do these two characteristics of a system cause so many bumps in the road when we’re trying to grow a system? Let’s dig a little deeper into the details of what kinds of problem centralization and synchronous communication present when trying to scale, and what makes them difficult to deal with.
When one node has too much control
A centralized system can look like different things: it may be a single server or machine that is responsible for running the system, or (more likely) it is a group of machines that are clustered together in the same physical location (think data center), working together as one, cohesive unit. In both cases, the main control and location of the system is in one “central” place, hence the idea of a centralized system.
But when a single server, or even a group of servers, are located in one place, there are some hidden problems that only start to come out of the woodwork when the system has to scale in size.
A centralized system is going to be limited by whatever the main source of control is capable of handling.
Imagine that our system is just a single server on a single machine. What would happen if our single-server, centralized system suddenly received an influx of requests? Let’s say that our single-server system can probably process 100 requests per second; if it suddenly began receiving 10,000 requests a second, there is only so much that it can really do. No matter what our setup, with enough requests, we’ll run into the computational limitation of our centralized system.
The same could be said of handling requests; if our system had to deal with an influx of data to process — for example, having to make writes or updates to a single database, all at the same time — it would inevitably run into its storage limitations; only so much data could actually be accessed or written to at a given time, and if it exceeds the bounds of our system’s storage limitations, we might not actually be able to save all of our data as we’d expect.
Finally, there’s the issue of the network. When our system is centralized and controlled by one single point, we’re even more reliant on the network between our users and our centralized server to be up and running efficiently all the time. However, networks fail and can be unreliable, sometimes through no fault of our own! The network limitation of a centralized system means that if something goes wrong between our server and our user, the entire system could become delayed — or worse, unavailable.
Given what we already know about scalability, we’d know that we’d like to make our system scale by size. However, even the most advanced and modern centralized systems have their limitations. If we find that we need to make our system size scalable, we might run into some other limitations if the system is designed to be centralized.
When you’re waiting for a node to text back
Synchronous communication is not entirely a terrible thing; in fact, there are times when synchronous communication works perfectly well. Communication between parts of a system are really centered upon what kind of network they are using.
For example, if the nodes of a system are all located in the same building, then the information that they’re passing amongst one another doesn’t need to travel that far. However, if the nodes of a system are further apart (for example, on different sides of the country), then it’s a different story entirely. The two kinds of networks we just described have technical terms we can refer to them with.
A network where all the nodes live within a building or inside of an equivalently small area are known as local-area networks, or LAN. On the other hand, a network where the nodes could be much more widespread and live across the country or on different continents are known as wide-area networks, or WAN. A LAN network is ideal for transferring smaller amounts of information between machines that are already physically close to one another; a WAN, however, is better suited for sending larger amounts of data between machines that are physically very distant from one another.
So how do these two networks tie back to synchronous communication, exactly? Well, we’ll recall that synchronous communication means that a requesting node has to wait for the requestee node to respond to it in some way.
Ultimately, a system built around a local-area network will run into problems when its synchronous communication is used in a wide-area network.
If we think about this more deeply, it starts to make sense: when all our nodes are on a local network and communication between them doesn’t need to travel as far, having one node “block” while awaiting response from another node might not be all that noticeable. When our resources are all functioning within the context of a LAN, then waiting for a process to synchronously complete can be pretty fast. However, the moment our system’s nodes become further apart — for example, if we add a node that is outside of the local network and use a wide-area network instead — the “blocking” time between request and response could become painfully (or even just noticeably!) slow.
Another related issue that compounds upon the slowness of synchronous communication on a WAN is that wide-area networks are generally not as reliable as local-area networks, and may fail, experience interruptions, or have limited availability or bandwidth. If our system uses synchronous communication, then it’s likely that, at some point, the network itself could fail while one resource is blocking and awaiting a response from another.
Given what we already know about geographical scalability, we’d know that if we grow our system by adding another node, it should be able to perform similarly even if the nodes are physically further away from one another. However, if our system is designed to use synchronous communication, then it might not scale as we’d like it to from a geographical standpoint.
So, perhaps what we’ve learned is that maybe scaling isn’t as simple as “add another server” or “increase the size of our database node”. It turns out that we have to think a bit about how our system is set up, and whether or not we need to change something about the way that it was architected when it comes time to scale. Scaling isn’t an easy thing (we haven’t even covered all the issues with scaling — just two big ones!), but it is possible to do with some thoughtfulness and helpful techniques.
In the next post, we’ll learn about some of the tools at our disposal when it comes to scaling a distributed system efficiently and effectively, and how to manage even the most unruly system’s unforeseen growth spurts!
Scalability is a pretty well written-about and researched topic! You can learn a lot more about the problems with centralized systems and synchronous communication. If you’re curious, here are some great resources to start with.
- A brief introduction to distributed systems, Maarten van Steen & Andrew S. Tanenbaum
- On System Scalability, Charles Weinstock & John Goodenough
- Distributed Systems for Fun and Profit, Mikito Takada
- An Introduction to Distributed Systems (Lecture Notes), Dr. Tong Lai Yu