Designing for Failure in Software Development and Testing

Peter Okorafor
5 min readMay 2, 2023

--

Photo by Nubelson Fernandes on Unsplash

As software becomes more complex, it is increasingly important to design systems with failure in mind. In the past, it was common to design software assuming that everything would work as intended and that any failures would be rare and easily recoverable. However, this approach is no longer sufficient, as even the smallest failure can have significant consequences.

This is one of the reasons DevOps measurements moved from “mean time to failure” to “mean time to recovery.” It’s not about trying not to fail.

Designing for failure means acknowledging that failures will happen, and building systems that are resilient to those failures. This approach involves identifying potential points of failure, designing systems to minimize the impact of those failures, and implementing testing strategies that simulate failures to ensure that the system can recover from them.

Building resilient applications is a critical aspect of software development. Resilient applications are designed to handle failures gracefully and recover quickly when things go wrong. This is important in today’s fast-paced world, where downtime can lead to significant financial losses and reputational damage. In this article, we will discuss four patterns that are important strategies for designing for failure and building resilient applications.

1. Circuit Breaker Pattern

The circuit breaker pattern is a well-known design pattern that is commonly used in modern software development. The circuit breaker pattern is used to prevent cascading failures in a distributed system. It works by wrapping calls to remote services with a circuit breaker object that monitors the number of failures. When the number of failures exceeds a certain threshold, the circuit breaker trips and redirects calls to a fallback service or a cached response. This pattern helps prevent a single point of failure from bringing down an entire system.

Image credits: @dineshchandgr

2. Retry Pattern

The retry pattern is used to handle transient failures that occur when calling remote services. Transient failures are temporary failures that can be resolved by retrying the operation. The retry pattern involves retrying the operation a certain number of times before giving up and returning an error. This pattern can be used to handle network outages, timeouts, and other transient errors that may occur when calling remote services.

For example, I have heard developers say “You have to deploy the database first because my app expects it to be already provisioned and running.” This is not a good design pattern

The key is to improve the number of retries exponentially.

3. Bulkhead Pattern

The bulkhead pattern is used to isolate failures in a system by dividing it into smaller, independent components or partitions. Each partition has its own thread pool, database connection pool, and other resources. This pattern helps prevent a single point of failure from bringing down an entire system by limiting the impact of failures to a specific partition. This pattern is commonly used in microservices architecture and is an effective way to improve resilience in distributed systems.

Imagine you are building an e-commerce website that consists of multiple services, such as a product catalog service, a shopping cart service, a payment service, and a shipping service. Each service runs in its own container or virtual machine, and each service has its own thread pool, database connection pool, and other resources.

If the website experiences a sudden surge in traffic, the product catalog service might become overwhelmed and start to fail. Without the bulkhead pattern, this failure could cause the entire website to crash or become unavailable, even if the other services are still functioning normally.

With the bulkhead pattern, however, the product catalog service is isolated from the other services, and its failure is contained within its own partition. The other services continue to operate normally, and the website remains available to users.

4. Graceful Degradation Pattern

The graceful degradation pattern is used to handle situations where the system is under heavy load or experiencing failures. Instead of returning an error or crashing, the system degrades gracefully by disabling non-critical functionality or reducing the quality of service. This pattern helps maintain the availability of the system under adverse conditions and ensures that critical functionality remains available.

Another scenario — imagine you are building a ride-sharing app that allows users to request rides from nearby drivers. The app relies on a number of services, including a user authentication service, a ride-matching service, a payment service, and a driver tracking service.

If the ride-matching service experiences a heavy load or is otherwise unavailable, the app could become unresponsive or crash, preventing users from requesting rides and potentially leading to lost business.

To prevent this, you can implement the graceful degradation pattern. When the ride-matching service is under heavy load or experiencing failures, the app can gracefully degrade by disabling non-critical functionality, such as the ability to view driver ratings or request premium features, and reducing the quality of service, such as increasing the estimated wait time for a ride.

By degrading gracefully, the app can continue to function and provide basic ride-hailing functionality, such as matching users with nearby drivers and allowing them to request rides. This helps maintain the availability of the system under adverse conditions and ensures that critical functionality remains available.

Photo by Leon Liu on Unsplash

In today’s complex and rapidly changing software landscape, designing for failure is essential to building resilient applications that can handle unexpected challenges and recover quickly. The circuit breaker pattern, retry pattern, bulkhead pattern, and graceful degradation pattern are all proven strategies for designing for failure and improving the resilience of modern software applications. By incorporating these patterns into your software design and testing strategies, you can create applications that are better equipped to handle the challenges of today’s fast-paced, highly interconnected world, and provide a better user experience for your customers. Remember, designing for failure isn’t just about preventing downtime, it’s about building applications that can withstand and recover from any failure or unexpected event, ensuring that your systems continue to operate smoothly and reliably in the face of adversity.

--

--

Peter Okorafor

Software Engineer | Development, Design, & Architecture. In love with pair-programming, JavaScript, and remote work.