High Reliability Organization Through Operational Excellence
By Alexandre Agular, VP of Engineering
Socure’s mission is to verify every identity on the internet in real time and completely eliminate identity fraud.
We’ve previously mentioned how critical it is to be accurate and inclusive when verifying identities, and why these things are core to Socure’s mission. In this article, I would like to address what it takes to achieve the level of system reliability that is required to fulfill this goal.
As more and more people’s activities are moving online, the basic services provided on the internet, such as hosting infrastructure or payment APIs, for example, have become part of the infrastructure of the global economy. Identity verification is now such a service — no one will be able to conduct business online without it. To successfully eliminate identity fraud means achieving a level of dependability such that the economy can rely on it to function.
Thus the question of how to build an organization that can repeatedly deliver complex software while maintaining a high level of reliability is an intriguing one. Looking at how other SaaS companies do this is certainly useful, but it has always left me somewhat unsatisfied. While one might argue this problem has been solved, companies often end up having their own specific challenges and constraints. Simply because a company opted for a particular practice that worked for them does not mean replicating that same idea will yield the same effect for a different organization.
What’s even more interesting to me is trying to understand what we have in common with organizations operating in a completely different environment than Socure’s, yet with similar objectives of high operational excellence and reliability. There is more wisdom to gain this way as the commonality must be the very essence of operational excellence.
During my research, I read about aircraft carriers, hospitals, and even nuclear power plants. As it turns out, these very different High Reliability Organizations face a common set of problems that has been thoroughly studied and documented, and from which SaaS companies can learn a lot.
Successful High Reliability Organizations tend to exhibit five characteristics:
- Preoccupation with failure
- Reluctance to simplify
- Sensitivity to operations
- Commitment to resilience
- Deference to expertise
Interestingly, while studies of High Reliability Organizations never seem to encompass SaaS companies, it is quite fascinating to see how intuitive and relevant the five characteristics are to the challenges faced in running internet service companies.
In the rest of this article, I will describe these five principles of High Reliability Organizations, but from the perspective of a SaaS company.
Preoccupation with failure
While complex systems fail, they rarely fail without warning. Often it is the result of a build up, such as a memory leak in a Java Virtual Machine, or a supply chain delay for a critical piece of equipment for a hospital.
How do you prevent such an issue from turning into a serious incident? The team needs to be encouraged to be curious about odd events, investigate them and raise concerns.
In the software industry, companies invest a lot of resources into observability and encourage teams to review their metrics. Often the cause of the next outage is there to see, and it only takes a little curiosity to prevent it from happening.
Reluctance to Simplify
In complex systems, systemic failures are never caused by a single error or mistake. Instead, they tend to be the consequence of a cascade of errors that haven’t been caught in time.
In the context of software, let’s think about a defect that is deployed to production. There is the one defective line of code, and the engineer who wrote it that’s now in the spotlight, but to be thorough with the postmortem, a few more questions need to be asked:
- Did the requirements address the particular aspect of the change being proposed?
- Was the test code coverage and integration tests sufficient for that particular function?
- Was the code review conducted properly, in full understanding of the change being proposed?
- And so on…
A fundamental aspect of a High Reliability Organization is its ability to look beyond the “root cause” and ensure that for every incident, no stones are left unturned. Each process, practice, or tool that was involved in the failing activity needs to be reviewed and improved if necessary.
This is typically done through post mortems, where the team needs to review everything from the software, tools, processes, and practices that were involved in operating the system.
Sensitivity to Operations
Organizations with multiple cross functional groups involved in building and running a platform face the common challenge of confronting the inherent isolation created by the difference in discipline and perspective on the work.
Successful High Reliability Organizations have overcome this challenge by turning a distributed group of isolated individuals performing tasks into what Karl E. Weick and Kathleen M. Sutcliffe call a collective mind.
In a collective mind, every individual contributes to part of a workflow and understands the actions and work of every other team member, the impact their individual performance has on downstream elements of the chain, as well as the impact upstream dependencies can have on their own performance.
This complete situational awareness of each individual relative to the team and the workflow they are a part of allows the team to act as one, thus forming a collective mind.
This collective mind, coupled with a preoccupation with failure mentioned earlier, is what allows these teams to interrupt a cascade of mistakes, or raise alerts before complete failure occurs.
Sensitivity to operations is certainly the hardest principle to implement at scale. Small teams are often naturally adept at this, but struggle to maintain it as they grow. This happens because a natural tendency of corporations is to scale their organizations by hiring specialized skills to integrate them into a large and complex process, then trust the process.
In reality, however, hiring isolated experts and trusting the process is the exact opposite of what should happen when striving for a collective mind environment. Watching out for problems, performing thorough postmortems, and investing in training to ensure all team members fully understand their role in the success or failure of the organization is what makes High Reliability Organizations operate without fault.
Commitment to Resilience
Failure will occur, but strong teams sustain high levels of operational excellence by committing to and incorporating these principles in their organizational culture.
By refusing to oversimplify the explanations of their failures, constantly seeking to understand anomalies and ensuring they are not the first sign of a problem, building a collective mind, and refining their adoptions of these practices incident after incident, teams learn to operate without fault.
Deference to Expertise
To successfully build and operate a complex system, timing of decisions is of the essence.
As discussed above, incidents typically don’t happen suddenly; incidents evolve from anomalies that occur over time. Organizations that are capable of detecting these anomalies early, and react to them rapidly, however, will be able to prevent failures.
Moreover, the magnitude of an incident is not only a function of what failed, but also how long it failed for. Organizations that are capable of recovering from failure faster will generally have better performance in maintaining their systems running.
High Reliability Organizations follow a traditional hierarchy to make decisions in normal time. But, because speed matters in times of crisis, these organizations defer authority where the expertise is when problems occur. This allows people that are close to the problem to rapidly self organize to respond to incidents and prevent / limit system failures.