Throughout all my books and postings one of the overarching themes is how to create highly available systems. In this quick post I want to clarify what I mean by high availability, because my focus is different from how we normally think about it.
High availability is a key driver for moving to cloud computing. We can easily create multiple instances so that if one dies the others can carry the load while the auto-scaling policies kick in and a new instance rises from the ashes. We spread these instances across multiple availability zones so that if one zone goes offline the others will keep the system running. We should even run our systems in multiple regions, so that if an entire region goes offline we are not dead in the water.
But these are only the infrastructure related concerns. The remedies as I just listed are well understood. If we take a serverless-first approach then we get most of this high availability for free. Deploying multi-regional, active-active services is even straightforward. We can hand these infrastructure concerns off to the cloud providers and focus on higher-order problems.
It is these higher-order concerns that I prefer to focus on. Take the most high profile cloud disruptions as an example. The ones that make headline news. These disruptions by and large are the result of honest human error. Someone working under pressure made a mistake. The same thing can happen to the services we run. You can have unlimited infrastructure and all the right HA techniques and policies in place and everything is working great until someone makes a lethal mistake. The whole thing will come tumbling down because new instances cannot come up as fast as the others are falling down. It has happened before and it will again.
We have to take honest human error into account if we truly value high availability.
I have written, here and here, about how cloud native is lean and serverless-first, which enables self-sufficient full-stack teams to deliver new features quickly. At this pace the likelihood of honest human error increases and we need to limit the blast radius and recover even more quickly when we slip.
To this end I have written, here, about how microservices do not provide sufficient protection and we instead need to create autonomous services following an event-first approach to ensure high availability in the face of honest human error. In these additional posts, here and here, I go into further detail about creating autonomous services using system wide event sourcing and system wide CQRS to provide the bulkheads needed for protection.