Stacking Theory for Systems Design
In the recent years, I have adopted a method for system design, which I think yields good results. For a lack of better word, I overloaded “stack” yet again, and use it as a metaphor for this design.
As everything else, it isn’t a silver bullet. There are other designs, equally good, with other trade-offs. I come with an Erlang background, so of course my designs are going to be influenced by its design. I do think the methods are widely applicable however, so you could easily use them in your preferred programming language. In order to make some things clear, first a little bit of Erlang terminology from a top-down perspective:
- An Erlang release comprises the Erlang runtime system — containing the interpreter, garbage collector and built-in parts — the standard libraries you need, and your own code modules. A release is completely self-contained, and requires no outside libraries in order to be able to run. Essentially, it is almost its own “docker-container”.
- Inside the Erlang release, we have applications which are either from the standard system: kernel, stdlib, crypto, …, or they are your own applications. When building a release, the release manager figures out what dependencies there are among the applications you wish to run, and it makes sure to include all of them.
- Each application consists of processes, where each process sits somewhere in a supervision tree. A process in Erlang terminology is a point of concurrent execution: they operate independently of each other and communicates via message passing. If you have more than one physical core, they can execute in parallel in the system.
- The code a process runs resides in a module. A single process can run code from many modules, and the same module can be run from many processes. Modules are roughly “static” components insofar they exist at the time the programmer writes them, and are included in applications. A process is “dynamic” in the sense that it exists at runtime, executing code in the modules.
- In Erlang systems, processes also defines the boundary of isolation. The state of a process cannot be scrutinized outside that process, so the only way to interact with a process is to send it a message and wait for a reply.
Modes of operation
When you boot up your Erlang system, it goes through several modes of operation:
First, the runtime is executed. It starts booting an init-sequence. This sequence of initialization begins by loading all modules into the system. This avoids accidentally missing some module later on.
Once every module is loaded, we go through and configure each application and start them up. Configuration at this level usually stops the system if there is anything wrong. Configuration errors are human errors and aborting is often better than trying to cope with a mistake in configuration.
At this point, the system is operational.
However, there is more to it than that. The above is what I tend to call the baseline of the system. Nothing is wrong with the system, but rather than claim the it is operational, I like to cut operational behavior into a “stack”.
The baseline is level 0 in the stack, and now we try to move the system upwards in the stack by adding another level. It is important to stress that transitioning is a best effort method. We make an attempt at increasing the operational level of the system, but if we can’t we stay put at the current level.
We could for instance introduce a database connection pool into our system. At level 0, we start up the pool and we spawn proxy-processes for each connection worker. But we don’t try to connect at level 0. All we can guarantee in the baseline is that the system is up and running. Not that it is able to carry out service for clients. Connections to the database is a transition to level 1. We periodically try to connect and once we have a connection, we proceed at a higher operational level. If we don’t get a database connection for a while, we log that fact and we might raise an alarm in the system in order to tell devops something is amiss.
The key is that had we made level 0 assume connectivity to the database, our system would be far less flexible. Should it suddenly experience an intermittent and transient network error, there is no level below the baseline, so we must terminate the system as a whole. By “stacking” service, we can go back to level 0 and start best-effort transitions to level 1 again. This structure is a ratchet-mechanism in Erlang systems: once at a higher level, we continue operating there. Errors lead to the fault-tolerance handling acting upon the system. It moves our operating level down — by resetting and restarting the affected processes — and continues operation at the lower level.
Note how this is safe: The state in which we have no database connectivity is a stable state from which we can try establishing one. If an error occurs, the internal state of a single database connection is very complex to model, and trying to recover from such a state is insanely complicated. So we terminate the connection and reset it to a known good state: “not connected at all”.
Our system, once at level 1, will then transition to level 2. At level 2 it may connect to our Message Queue broker. The same thing applies as with the database connection: we make a best-effort at getting there and an error will just reset us to level 1: We have the database connection, but need the broker connection.
And once at level 2, we try to go to level 3: enable the Cowboy web server’s listen port. At this point, we can assume level 1 and 2, so the systems underlying parts must be operational, and hence we can give service to the outside world. This is where we introduce a load-balancer callback into our service, and we tell the load-balancer we are able to give service. At this point, we are entered into the service pool in the load-balancer and the system operates nominally.
An error at level 3 in a single process shouldn’t give rise to a total system failure and reset. Many errors are transient and intermittent. Erlang’s fault tolerance principles defines a policy for the threshold at which we deem operation at level 3 a failure due to too many errors in a too short timespan. The solution, of course, is to gradually try resetting more and more of the the web servers internals until the fault is removed. Or in worst case, reset to a lower operating level in the stack.
Once you have a leveled stack of operation, stopping the service becomes easy: stop it by going down in levels. First, you remove the listen socket so no more requests enter the system. Then you drain the requests that are currently operating at level 3. Then you move to level 2. Then 1, and then 0 at which point you can terminate the system.
A common mistake is to botch the close-down procedure of a system. My test is usually to load the system and then try to terminate it while it is loaded. Often you see crashing and burning in the close-down phase. And the requests that are currently running fails in non-standard ways. This is a problem in the modern world where we use elastic computing. Machines are added and removed all the time automatically as load requires, so we can’t have this happening. In some situations, the load balancer can be coerced to participate in the close-down procedure which helps since it can drain connections. But that isn’t always the case.
Development & Production
Another test: say I start the system in a development environment where it has no connectivity to any system it needs: databases, central logging, metrics, brokers, and so on. If the system fails to boot due to a network connectivity problem, chances are that it will fail to boot in production. A system assuming the presence of other systems has an unwritten dependency chain. You have to boot your production systems in a certain order or things will not work. This is often bounds for trouble.
Furthermore, suppose you decide to move a database to another address. One advantage of a stacked design is that you can often add another database pool to the system without it being there yet. It allows you to deploy your system first, and then await the presence of the new database cluster. Once there, your system picks it up automatically and starts using it. By building systems where at least one of the systems are able to handle transitions in the infrastructure, you build systems that are far easier to manage. We often use this trick: add a new RESTful endpoint, deploy it to production and then start using it. This avoids us having to coordinate a client launch where the approval committee of Apple might prove to be a problem. Extending this trick to your infrastructure is nice. Support both versions of a protocol at the same time, so you don’t have to decide when to move from one to the other, but can put that onus on a later decision-maker.
In development, stacked designs are also very nice. You may not need to start a centralized logger og a metrics gatherer while you are developing the system. And if you need to test something on metrics or logging, a simple invocation of netcat (The nc(1) tool) suffices. Again, this helps production. You just lost your metrics server. It shouldn’t coordinate with your main service and take it down as well.
Cloud environments are notoriously flaky. We have weekly disconnects among services, and small disruptions are common. You need to build your systems such that they can tolerate a small amount of noise. Stacked designs are excellent at tolerating noise. If for instance you just lost a single database connection from the pool, you can just pick another and try replenishing the lost connection. If you just lost all of them, it is back to level 0.
Another advantage of tolerating small amounts of noise is that you can often tolerate larger amounts too! If the connection error rate is 1:1,000,000 and suddenly rises to 1:1,000, then your system can cope with it. Had you picked an absolute path where everything has to be entirely correct all the time, you just made a reboot and failure a thousand times more likely. It will hurt your service.
Use an alarm handler! Erlang has one built in where you can set and clear alarms. If you’ve had no connection to the outside world for, say, 15 seconds, you can raise the alarm. In turn, your system is going to tell the world that there is a problem, rather than having some poor devops person figure out what is wrong by themselves. A system saying: I’m broken because I have no connection to the database and thus I can’t give service, is far better than one that is rolling around in a reboot/restart loop all the time.
Stacked design, one level up
Finally, the stacked design doesn’t stop with the service/system itself. Use it one level up in your architecture as well. It is far better to deploy machines by starting with an empty machine and configuring it, installing software and starting it. Contrast with mutation of an existing machine. By rebuilding your environment, you essentially build your whole data center from scratch every time. It makes sure you are safe even if most of the system gets hosed.
When errors occur, find what you can: core dumps, Erlang crashdump files, log files and so on. Ship it elsewhere. Then wipe the machine and build a new one to take over. This allows post-mortem analysis of the error, while keeping your system operational. Post-mortem analysis is paramount if you want stable systems. You need to distinguish between an error which is benign and due to a rare event chain, and errors which needs programmers to fix them. This can only be done by analyzing errors as they occur. Note that some benign errors in a system must never be fixed. Fixing certain errors requires so much change in the code base, that the changes poses a greater risk than the benign error itself! So unless you can find a more elegant approach to the problem, don’t fix them.
Also, don’t make ordering assumptions in your architecture. Most of us operate in environments where errors occur. When they do, they break our ordering assumptions about what is running and what is not. So build your systems to cope with this. In a stacked design, as long as there is always one stack that can increase in level, the system will eventually “un-tilt” itself. This is especially important in micro-service architectures, where the dependencies tend to be so complex nobody has actually tested all possible interactions.
Finally, stacked designs can cover for bad system designs. As long as some of your systems can cope with failure, you can often have them be the saving grace in the architecture. By deploying some systems able to cope with trouble, you can avoid total failure. Nygard’s “Circuit Breaker” pattern comes in handy here in your software.
 Save for debugging calls.