Prevent Cascading Failure: Hystrix Thread Pool
Hello everyone, in this article, I will explain how did we solve our problem in Dolap.
Dolap is a C2C (customer-to-customer) marketplace platform allowing users to sell or buy items even if they are not professional sellers. Users can sell or purchase both first-hand and second-hand items.
And We have a service called Heimdall which is implemented by the Moderation & Fraud team within Dolap. Heimdall which is an API gateway for most of the Dolap services(24 services) and the daily maximum rpm is 425K. It includes Zuul, Ribbon, and Hystrix technologies, Zuul handles request routing and filtering, Ribbon provides load-balancing capabilities for services, and Hystrix offers fault tolerance and resilience mechanisms. These components work together to enable scalable, reliable, and resilient microservice architectures, however, we needed to improve the overall resilience and fault tolerance of a system to solve our below issues.
Cascading failures: One of the gateway services had a fail or high load, it was propagating to other components or services, resulting in cascading failures.
Difficulty in troubleshooting: When services had a failure, it was really challenging to identify which service was a murderer and which one was a victim.
Scalability limitations: In Dolap, every service has different loads so should set different resources, according to service needs.
So we needed isolated thread structure or adopt patterns such as thread isolation, to solve our problems.
I will explain what we did to solve our problems;
We set zuul ribbonIsolationStrategy to thread. When configuring the ribbonIsolationStrategy as THREAD, you are instructing Ribbon to use thread isolation for individual service instances. This means that each service instance will have its own dedicated thread pool for handling requests from the API gateway.
zuul:
ribbonIsolationStrategy: THREAD
threadPool:
useSeparateThreadPools: true
We set thread pool properties for our services according to their needs.
coreSize: Sets the core thread-pool size.
The general principle is to keep the pool as small as possible because of the number of threads running concurrently, which helps conserve system resources, and keeping it small helps prevent overloading your system with too many concurrent requests.
The basic formula for calculating the size is:
requests per second at peak when healthy × 99th percentile latency in seconds + some breathing room
maximumSize: Consider peak traffic scenarios; expected growth in workload or any sudden spikes in requests. This setting only takes effect if you also set allowMaximumSizeToDivergeFromCoreSize as true
maxQueueSize: Sets the maxQueueSize to a value that can accommodate temporary surges in traffic without overwhelming the system or causing excessive delays.
If you set this to -1 then SynchronousQueue will be used, otherwise a positive value will be used with LinkedBlockingQueue.
hystrix:
threadpool:
service_name:
coreSize: 2
maximumSize: 5
maxQueueSize: 5
allowMaximumSizeToDivergeFromCoreSize: true
Here’s the thread pool pattern for the below configuration;
Also, it’s recommended to perform load testing and monitor the performance of your system to determine the appropriate values for your service.
Conclusion
With these configurations, the Dolap services are now isolated. When one service encounters an issue, it will no longer affect the other services.
We tested the solution by conducting a comprehensive load test on Dolap(load testing to simulate various traffic scenarios and observe the behavior of the thread pools). Previously, we would frequently encounter circuit break exceptions even beginning of the test, but this time we managed to operate without any circuit break problems. And also checked response times, throughput, and error rates to be sure the thread pools can handle the load without causing thread starvation.
To check out other articles from Trendyol Tech below
If you want to be part of a team that tries new technologies and want to experience a new challenge every day, come to us.
Thanks for the reading 🎭