“Fault” vs “Failure” and fault tolerant systems

Published in

pyankit

2 min readSep 24, 2017

Fault vs failure is one of those concepts which is well understood in theory classes but hard to grasp in the real world. Especially when in the real world teams are siloed into a horizontal and vertical matrix.

In simple terms : Fault is when a component of a system deviates from its spec, Failure is when a system doesn’t provide its service to the user or provides an unacceptably degraded service.

In an ideal world, there are no faults but in real world faults will be there but failure is still unacceptable. Imagine if railway switch doesn’t get triggered on time because garbage collector had paused the thread*. Systems that can cope up with large number of the possible defects are called fault tolerant systems. Although, this name is misleading because a system can guarantee tolerating only known faults not all possible ones. In other terminologies, failure would be a bug and a fault would be just a defect.
For keeping a system fault tolerant today’s software teams have two hard problems, classifying faults which are a failure and containing faults to avoid becoming failures.

For the first problem, the easiest solution is to have someone from the engineering team understand the business or have product (business) and engineering teams have very cohesive communication. If there is a seperate QA team, it becomes utmost important for this team to classify a bug and a defect. Good QA teams get this bug classification right.
For the second problem, there is no easy fix. It involves careful deliberation on how the various components of the system interact with each other, assumptions different components have regarding the others, thorough testing of “fault lines” ie areas of high coupling or areas where assumptions mismatch, measuring and monitoring each component. Regular retrospective analysis and proactive audits are a few good ways to keep an eye on evolving system.
Apart from this prevention first approach, another important step in making systems fault tolerant is making each component recover fast from faults. For this approach, every component has to be rollbackable (ie have ability of rollback to its previous non-faulty state), and the system need to detect fault fast and trigger the rollback to self correct itself.

What I have seen in recent years, is that it is becoming hard to catch the faults by automated tests (unit, integration and E2Es) as the system grows in number of components. Following these three strategies and developing processes around them helps in building fault tolerant systems.

*This is a hypothetical scenario, in the real world such switches will not be using common Linux and will have much higher guarantees from the OS and the compilers.

“Fault” vs “Failure” and fault tolerant systems

Written by Ankit Mittal