The Importance of Software Resiliency (and best practices to improve user experience)

Harry Ware
Vodafone UK Engineering
4 min readJan 12, 2024

Things will always go wrong. The chaos of the real world will always compromise our systems in new and frustrating ways, breaking something or other at the most inconvenient of times. Even the most robust system in the world is prone to the occasional failure, which is why we must also focus on the resiliency of our solutions.

Where robustness minimises the risk of failure, resiliency maximises the ability to bounce back. The key metrics to consider when measuring system resiliency are the RTO (Recovery Time Objective) and the MTTR (Mean Time To Repair/Recover). RTO is the target amount of time in which a feature must be restored to avoid unacceptable disruption, this will vary wildly between services and features depending on their importance and can be a useful metric in determining what level of disruption classifies as an incident. MTTR is the average amount of time it takes for a service to recover from failure.

Using these metrics, we can better understand what the required resilience for a solution should be, and better measure the resiliency already in place.

THE IMPACT OF SOFTWARE RESILIENCY

Software resiliency is crucial for customer satisfaction. Setting RTO and measuring MTTR ensures swift issue resolution, which safeguards customer retention. In a competitive market, rapid recovery from issues like long load times, hanging requests, and persisting error/holding pages is paramount, especially during peak activity days such as the yearly iPhone launch. Enhanced resilience for customers also benefits developers by minimising unplanned incidents, reducing on-call interventions, and streamlining performance testing. Integrating resiliency into the development process, rather than patching it later, ensures smoother operation for our services.

BEST PRACTICES FOR INTRODUCING RESILIENCY TO CODE

There are a variety of tools and methods that are considered best practice when improving software resiliency. Each may suit different use cases and can be implemented in different ways, but the important starting point is adopting an attitude towards baking resiliency into the development process — once the mindset exists the most appropriate tooling will soon become apparent. Still, here a few good examples of practices you might adopt:

Redundancy — One of the simplest ways of maintaining the availability of a service is to introduce levels of redundancy. Let’s suppose we need two instances of a specific microservice to maintain the expected load on a system. If we maintain only two instances, and one fails, then the user experience will start to deteriorate. However, if we maintain three instances then we’re protected against one failing before the user experience is impacted. This AWS Whitepaper uses a simple formula to demonstrate how redundancy can improve our resiliency and decrease our risk of failure with each new instance.

The more redundancy we have the lower the chance of disruption. But redundancy is more effective, cost efficient, and easily implemented when working with a loosely coupled application.

Loosen Couplings — Separating your application into logically independent microservices, ensuring that a single service is responsible only for a single process, connected through APIs removes the single point of failure from your application. Traditionally, software projects were built using a model where all functions of an application were contained within a single codebase and deployment, keeping all services within that application tightly and often deceptively connected. By moving away from monolithic architectures, you can protect your application from failing entirely if a single service goes down. Microservices and anti-monolith architectures have widely been accepted as best practice where possible, and there are loads of great articles concerning how to do make steps to do this and the wider benefits in greater details, like this one. The use of containers and container orchestration solutions like Kubernetes then makes creating scalable redundancy inside the application far easier.

Once the software has been successfully decoupled, we must consider how to handle failures within independent microservices.

Fault Tolerance in Code — We all understand the importance of shifting left (if you don’t, you can read all about it here), and so we should also look left to our code for improving resiliency. There are various design patterns such as the dead letter queue which can be used to improve resiliency. A DLQ shifts data that’s erroneous or corrupted out of the processing pipeline to avoid clogging up the application in the case of temporary failures from connected services, while also providing better visibility for debugging.

Similarly, fault tolerance libraries like resilience4j and Netflix Hystrix (which is no longer in active development) allow you to decorate your code with circuit breakers, retries, rate limiters, and bulkheads with relative ease compared to implementing these patterns yourself. These design patterns all ensure that your code is resilient to failures both within the wider application and in upstream dependencies.

Chaos Engineering & War Games — It’s difficult to prepare for known unknowns and unknown knowns, and impossible to prepare for unknown unknowns. But one of the best ways to overcome this and learn more about your system and the different possible failure conditions and weaknesses it has is to adopt a chaos engineering mindset. War games, also referred to as ‘Game Days’ in some organisations, specifically are designed to test how resilient the application and team are to failure, and then identify how resiliency can be improved based on the results of the day.

Within Vodafone Digital UK, we’ve been actively fostering our chaos engineering community. As well as hosting several war games, we develop and maintain our inner sourced Khaos framework which allows teams to run experiments on their microservices and thereby add a quality gate concerning system resiliency to their release pipeline.

What doesn’t kill you makes you s̶t̶r̶o̶n̶g̶e̶r̶ more resilient, and by adopting the above practices we can sleep easy knowing that customers are getting the best experience possible from our applications, regardless of whatever chaos the world tries to throw at them.

--

--