Efficiently Handling Transient Errors

Bilal Emre Gulsen
hepsiburadatech

--

Applications can encounter temporary faults due to some of services when they try to make contact with services. If faults are not related with the external services and it is related with the network or other environment resources, it is called Transient Failures. This article will show that what are the reasons of the failures and three patterns to handle this problem.

Reasons of Transient Failures

Transient faults can occur in any environment or any platform.

  • Traditionally, these failures are occurred by database connections, network issues or service calls.

Todays, the cloud can be used for hosting applications and it brings some other reasons.

  • Resources are shared in the cloud with limited access for protecting source. Some services will refuse connections when a maximum throughput rate is reached.
  • Cloud environment is a distributed system that it dynamically shares the load into infrastructure components, in dynamic environment, temporary connection failures may occasionally occur.
  • Cloud ecosystem uses load balancer and some other network structure to connect the resource and application. They can cause the latency.
  • The connection between sources and clients depends on the Internet and heavy traffics can cause the transient faults.

These are the reasons of transient errors. If the application is desired to be more fault tolerant, there are three patterns can be applied.

Retry

Efficient retry mechanism usage is a good way to increase fault tolerance of services. It is simply to wait a bit and try again when request failed. It must be decided retry count and waiting period between each attempt. Keep trying until reach the foretold failure value or get successful response.

For the best practices, do not implement more than once immediate try and endless retry mechanism.

Throttling

Throttling interested with the consumption of resources. Firstly, define a throttling limit for source-comsuming applications. If the limit is exceeded then throttle these applications. This pattern helps in controlling and prevents overuse the resources and it allows to system continue to function when the increase in demand places.

Use this patterns:

  • To prevent a single tenant from monopolizing the resources provided by an application.
  • To build cost optimize system.
  • To handle burst activity.

Circuit Breaker

Circuit breaker last defensive pattern that increases and improves stability and resiliency of system. It can help to maintain the system response time by quickly rejecting incoming request for an operation that’s likely to fail. When client-side requests start to failed then increase fault counter. After the counter has exceeded the fault limit, start the timer until timer is timeout. Try calling service again and monitor the response to failed or not. If it is failed the application should exit otherwise the application may continue as if nothing happens.

In addition to circuit breaker, after timeout timer expired and getting successful response from server, application can go to half-open state which is limited number of requests is allowed and invoke the operation. If these all requests are successful then application can go to closed circuit state. It is assumed that the transient failure has been fixed. The half-open state is very useful to prevent lots of request pass through to service which is currently trying to recover.

Consequently, the causes of errors must be well specified to decide the fault is transient or not. In order the implement best practices, the waiting duration between each tries and the number of tries need to be determined well for the retry and circuit-breaker patterns. Also, it should not be forgotten that well-monitoring solutions which includes these resiliency patterns with unit tests and integration tests together, it can handle transient errors easily and/or it minimizes the negative effects of transient faults.

--

--