Crafting Resilient Applications

7 min readSep 12, 2023

Errors in software applications are both unavoidable and expected. However, it is our responses to these errors that ultimately determine the reliability of our systems. As the saying goes —

‘It’s not what happens to us, but how we respond that defines us.’

This article aims to equip you with the knowledge and skills necessary to handle errors effectively. An ideal moment to address error scenarios arises during the tech solutioning phase, as sometimes we require stakeholder input to finalise error-handling decisions.

Before we delve into the depths of this subject, let’s first explore some fundamental principles that will serve as valuable guides when designing scalable systems.

Errors & Exceptions

An error is a broader term used to describe any unexpected problem that occurs during the execution of a program. Errors can be categorised into two main types — Compile time and Runtime.

Runtime errors are also called Exceptions. They are unexpected events that disrupt the normal flow of the program.

Exception handling allows developers to gracefully respond to errors.

Why should we care?

Proper exception handling ensures that the application can gracefully handle unexpected situations, preventing crashes and data corruption.
It improves user experience by providing informative error messages and logs.
It aids in debugging and maintaining the codebase.

Coupling & Cohesion

Micro-services encourages loosely coupled and highly cohesive architecture.

Coupling refers to the degree of interdependence between software modules and cohesion indicates how well the individual elements within an application work together.

Strong & Weak Dependency

Strong dependencies are vital components that ensure our service functions optimally, delivering expected results.

Weak dependencies, on the other hand, are non-essential elements that allow our service to run, albeit with reduced functionality, when they are unavailable.

We use solid lines for Strong dependencies and dashed line for Weak dependencies

Exercise: Try to think which of these are strong dependencies and which are weak

Authentication Service
OTP Service
User Profile Pictures
Analytics Service
Recommendation Engine

Now that we know about some basic concepts, lets see how we can make make our system more resilient.

Timeout

Whenever we make an external call (to any system), the probability of getting a success reduces as we wait more.

After a certain period, it doesn’t make sense for the client to keep on waiting for the server to respond, consuming its own resources.

Timeouts are the way to stop client from waiting for server indefinitely.

How to decide on value of timeouts?

There are 2 aspects that come into the picture when deciding timeouts — Business and Technical.

Business Aspects

Fund Loss
If the error from the service for sure result in “fund loss”, then usually I wait a bit more. Example: For Payment flows
Whether the dependency is a Strong one or weak one?
For a strong dependency, the timeout usually is more than for weak dependency

Technical Aspects

SLA for the service
I consider the 95 and max latencies to decide the value of timeout
Current Infra for my service
Whenever we make an external call and wait for the resources, there are some resources (threads/files) that gets blocked on our side, waiting for the response

Another thing worth mentioning when designing a system where you are working with multiple services and you can control the timeouts is that

The timeouts should lessen as we move to the next component

Lets understand the above statement in a little bit more details.

Scenario

“App” calls the “Gateway” (I’m excluding some components in between, to make it simpler)
“Gateway” calls the “Service A”
“Service A” has a dependency on “Service B”
“Service B” has a dependency on “Service C”

In each of the step in the above flow, a timeout is usually configured on each layer.

The rule says —

Timeout(App) > Timeout (Gateway) > Timeout (Service A) > Timeout (Service B) > Timeout (Service C)

Should you retry?

When an error surfaces from an external system, it’s an opportunity for a crucial evaluation: Is a retry attempt the right course of action?

The “external system” in the above statement can refer to any other service (internal or external), database, cache, message broker etc.

This decision shouldn’t be taken lightly as it doesn’t just impact the stability of your own system; it resonates with the external system as well.

Retryable Errors?

Retryable errors are a type of error that occur in software or network systems and can be resolved by simply retrying the operation or request that initially failed.

Exercise: Try to think which of these are retryable errors and which ones are not.

503 Service Unavailable
429 Too Many Requests
Connection Timeout
Database Deadlock
404 Not Found
401 Unauthorised
400 Bad Request
500 Internal Server Error

Retry policies

Important things to consider:

retry limits
retry delays
backoff strategies to avoid overwhelming the system with excessive retries in case of prolonged or persistent issues

Retry Backoff Strategies

Fixed Interval: Retry after fixed intervals; 0, 2, 4, 6, 8…
Exponential backoff: increases the delay between retry attempts exponentially; 1, 2, 4, 8, 16…
Randomised backoff: This approach helps prevent synchronisation issues that can occur when multiple clients retry simultaneously.
Full Jitter: Variation of randomised backoff. Here, the retry interval is chosen randomly within a predefined range, example between 0 and 60
De-correlated Jitter: De-correlated jitter is a more advanced backoff strategy that aims to reduce congestion in retry patterns. It randomly selects intervals, but it also avoids repeating the same delay value too often.
Adaptive Backoff: Adjusts the retry strategy based on the observed behaviour of the errors. For example, if errors continue to occur, the backoff intervals may increase, and if errors decrease, the intervals may decrease as well.

Circuit Breaking

Sometime, the retry mechanisms are not ideal to recover from failure scenarios.

A few scenarios where “retry” might not work

Repeatedly calling a service that might be down or having a low success rate is wasting your service resources
It is possible that the remote service is unable to come back up because of the bombarded requests. Maybe it needs some time to cool off

A circuit breaker can help you in these scenarios and help with some other problems along the way.

How Circuit breaker pattern works?

When implementing a circuit breaker pattern, our service doesn’t call the remote service directly, Instead we call a proxy and that proxy in turn calls the remote service.

The proxy can be a different service, or it can be added as an SDK, or a sidecar. Each with their own pros and cons.

Circuit breaker works on the same principle as “Circuit breaker in Electrical Systems”.

The proxy keeps on monitoring the failure rates for the downstream services and if the number of failures reaches a configured threshold, it breaks open the circuit allowing no more requests to go through.

I will not go into the implementation for the circuit breaker as there are multiple ways to do it. That being said, we usually don’t need to implement our own “Circuit Breaker” as there are many available open source solutions that “just work”.

Obvious next question — once circuit opens, how do we close it again?

We don’t need to do anything manually. We just need to make some configurations on the “Circuit Breaker” that we are using and it does the job for us.

Circuit Breaker slowly starts sending some requests to the remote service and monitors the response. It continues increasing increasing the traffic as it gets more and more successes.

Fallback Actions

Circuit breakers also allows us to define “fallback actions” in case the the circuit had to be opened. This allows us to handle the service failures more appropriately. We could -

Log the errors
Call a secondary service
For example: Lets say your app can use multiple payment gateways. Now if one of the PGs go down, you can configure the secondary gateway as primary as a “fallback action”.

More things to make your system more resilient

I won’t delve further into these topics here, as I’ve already covered them extensively in previous articles

Congratulations, and heartfelt thanks for staying until the end. I genuinely hope this article has contributed to enhancing your understanding.

If you found value in what you’ve read, you can express your appreciation by giving it a ‘clap’ (LinkedIn’s equivalent of an upvote), helping this article reach a wider audience. Thank you!

For more insights, feel free to connect with me on LinkedIn here

Until next time, take care. Peace out! ️