Fault Tolerance and High Availability — New World

Fault tolerance and High Availability are very subjective when it comes to environments. In many cloud environments, High availability is often celebrated and sold upon to customers. For example, managed services by cloud providers often say, there is no need for managing the infrastructure as its highly available and if one piece of the infrastructure goes down, there is a redundant piece of infrastructure.

Can fault tolerance and high availability be taken for granted then?

As we move to infrastructure being software defined, a lot of implementation and management is done via code. Infrastructure is built via code, and a lot of work is done on the software side more than the big chunky server code we were used to a decade ago.

How do we redefine high availability and fault tolerance?

In the case, from my experience, we look at distributed and decoupled systems; and also a serverless environment where possible. In a highly complex environment, a distributed and decoupled environment can help reduce fault tolerance because of an easy decoupling when a section of the environment can be pulled out of a system and an alternative be used.

For example, lets say, if we had a CD system using a SaaS product to deploy our code into the infrastructure, and it broke down; we can cut-over to a simple serverless CD to deploy into our infrastructure. Of course, you may ask, why not just use the serverless CD as our primary system? In certain spaces, we might have licenses we might have inherited to make use of and we cant just use a quickhack .

Did you just see what I did there 🙂 ? People have always associated high availability and fault tolerance to cost more. In my opinion, it is how we do it. If we had a primary service running on servers, can we utilise containers as a secondary system or even serverless? This gives a great impetus for infrastructure/platform teams to build redundant services for less. Of course there is initial cost and learning curve that engineering teams need to pick up; but in the long run, cost drops as we can run a hot-hot environment, being a servers-container environment, or servers-serverless environment.

Originally published at http://blog.vigsivapragasam.io on May 31, 2020.