Key patterns for resiliency in Microservices Architecture

vinay kumar
Techartifact-Technology learning
8 min readSep 9, 2024

A few years ago, I wrote application architecture principles when I was heading the application architecture group. I found it in my old Miro board.

If you look at the abovementioned principles, most are still valid and effective. Most organizations implement microservice architecture in some of their projects, primarily when they focus on modernizing their landscape from legacy systems.

Similarly, a few years back, I wrote patterns for microservices architecture and categorized them into different categories. Look at some of the significant microservices patterns below.

This needs to be refreshed now. I want to detail resiliency patterns more with my learning by implementing a microservices architecture.

Resiliency in microservices architecture refers to the system’s ability to handle failures and disruptions gracefully without affecting overall performance or availability. Since microservices are distributed, independent services communicating over a network are more susceptible to failures such as network outages, service crashes, or performance bottlenecks. A resilient microservices system can continue functioning or degrade gracefully when some components fail.

Critical Aspects of Resiliency in Microservices:

  • Fault Tolerance: The ability of a system to continue operating correctly in the event of failure of some of its components. In a resilient microservices architecture, the failure of one microservice should not bring down the entire system.
  • Graceful Degradation: When specific components fail, the system does not crash entirely but continues to provide limited functionality or simplified services.
  • Self-Healing: Resilient systems can recover from failures without human intervention. This might involve restarting failed services, rerouting requests, or retrying operations after a brief delay.
  • Isolation and Containment: Resilient architectures use patterns like bulkheads to isolate services so that failures in one service don’t propagate to others. This prevents cascading failures and allows unaffected parts of the system to continue operating.
  • Redundancy and Failover: Microservices systems often use redundancy (duplicate services or resources) to ensure that another can take over if one service instance fails. This provides failover support, ensuring continuous service.
  • Monitoring and Observability: A resilient system requires real-time tools to monitor health, performance, and failures. Distributed tracing, logging, and health checks help detect problems early and facilitate rapid recovery.

Importance of Resiliency in Microservices:

  • Minimizes Downtime: Resilient systems help avoid downtime, ensuring high availability and better user experience, even when parts of the system fail.
  • Handles Dynamic Traffic: Microservices need to handle traffic spikes and fluctuations. Resiliency patterns ensure that the system doesn’t crash even during heavy loads.
  • Mitigates Complex Failures: Distributed systems introduce complexity, and failures in one part of the system can have unforeseen impacts. Resiliency ensures that failures don’t cascade through the system.
  • Improves Scalability: A resilient architecture can more easily handle the scaling of individual microservices without compromising the entire system’s stability.

How to Achieve Resiliency:

Resiliency in microservices architecture is achieved through various patterns and practices, such as:

  • Circuit Breakers: Stop failing services from being overwhelmed by blocking further requests.
  • Retries: Automatically retry failed requests due to transient errors.
  • Timeouts: Set limits on how long to wait for a service response to avoid excessive delays.
  • Bulkheads: Isolate services to prevent one service’s failure from impacting the rest of the system.
  • Fallback Mechanisms: Provide alternative responses when a service fails.

1. Circuit Breaker Pattern

  • Detailed Purpose: In a distributed system, if one service is failing or under heavy load, continually sending requests can worsen the problem. The Circuit Breaker pattern helps avoid this by stopping unnecessary requests to a failing service and giving it time to recover.

How it works: The pattern has three states:

Closed: Requests usually flow, assuming the service is healthy.

Open: After several consecutive failures, the circuit is opened, blocking requests for a set amount of time.

Half-Open: After the timeout, the system allows a few requests to check if the service is back to normal.

Real-life Example: Netflix’s Hystrix library (now retired but still widely discussed) used the Circuit Breaker pattern extensively. For instance, if a video recommendation service was slow or down, Hystrix would stop routing requests, and users would be presented with a default list of trending content.

2. Retry Pattern

  • Detailed Purpose: Many failures in distributed systems are transient (e.g., network glitches, temporary unavailability). The Retry pattern mitigates these temporary issues by automatically reattempting failed requests after short delays.

How it works:

A configurable delay (or backoff strategy) is used between retries to avoid flooding the service with immediate retries.

The number of retries is usually capped to avoid overloading the service or the system itself.

Exponential backoff is a common strategy where the time between retries increases progressively.

Real-life Example: In payment systems like Stripe, if a transaction processing service times out due to network instability, the system retries the request a few times before giving up. This helps in cases where temporary outages are expected but should not cause a permanent failure.

3. Timeout Pattern

  • Detailed Purpose: Long-running requests in microservices can lead to resource exhaustion (e.g., holding up threads or connections). The Timeout pattern ensures that requests that take too long are aborted.

How it works:

Each service call is assigned a maximum duration for how long it can take. If the service doesn’t respond within that window, the request is canceled, and the system can either retry or use a fallback.

Timeout values are typically tuned based on service-level agreements (SLAs) or expected response times.

Real-life Example: In an online ordering system like Grubhub, if a request to the payment service takes too long, the system times out the request and tries an alternative payment method or notifies the user about the delay. This prevents users from waiting endlessly.

4. Bulkhead Pattern

  • Detailed Purpose: In microservices, failures in one part of the system can lead to cascading failures. The Bulkhead pattern prevents this by isolating resources (like threads, memory, or database connections) between different services, ensuring that failure in one service doesn’t overload others.

How it works:

Resources are partitioned into separate “bulkheads.” For example, each service might be assigned a separate pool of threads or connections.

If a service’s bulkhead is overwhelmed (e.g., it runs out of threads), only that service is affected, while others remain operational.

Real-life Example: In a hotel booking system, a surge in requests to the room availability service (e.g., during holiday seasons) might overwhelm its resources. By applying the Bulkhead pattern, the hotel search and payment services won’t be affected, allowing users to continue searching or making payments even if availability checks are delayed.

5. Fallback Pattern

  • Detailed Purpose: In cases where a service fails, it’s often better to return a degraded response (like cached data or a default value) than to throw an error that impacts the user experience.

How it works:

When a service call fails, a pre-defined fallback response is returned.

Fallback responses can be default values, cached data, or other backup services.

Real-life Example: In a ride-hailing app like Uber, if the fare estimation service fails, the system might show an average fare for similar routes instead of leaving the user without any information.

6. Fail-Fast Pattern

  • Detailed Purpose: The Fail-Fast pattern aims to detect problems early and terminate processing immediately when a failure is inevitable. This prevents the system from wasting resources on doomed operations.

How it works:

When the system detects an issue, like invalid input or a misconfigured environment, it immediately fails the operation rather than continuing to process the request.

This is particularly useful for spotting configuration errors or service dependencies early in the execution chain.

Real-life Example: In an e-commerce platform, if a user tries to purchase an out-of-stock item, the system immediately shows an error message instead of letting the user proceed with the transaction and fail later at checkout.

7. Service Mesh

  • Detailed Purpose: In microservices, managing communication between services, such as handling retries, failures, and routing, can be complex. A service mesh provides a dedicated infrastructure layer to manage inter-service communication.

How it works:

A service mesh, like Istio or Linkerd, intercepts all communication between microservices.

It provides centralized control over how services talk to each other and handles resilience mechanisms like retries, circuit breakers, and load balancing.

It also provides observability features like metrics, logging, and tracing for service communication.

Real-life Example: Lyft uses the Envoy proxy as part of its service mesh to route traffic and manage inter-service communication. This setup ensures that services are automatically retried if a failure occurs without developers needing to add retry logic manually.

8. Rate Limiting Pattern

  • Detailed Purpose: This policy protects services from being overwhelmed by limiting the number of requests they can handle within a certain time period.

How it works:

A service can only handle a defined number of requests (e.g., 100 per second). Requests beyond this threshold are either queued, rejected, or delayed.

Rate limiting can be applied per user, per API, or across the whole system.

Real-life Example: Twitter’s API applies rate limiting to protect its servers from being overwhelmed by high requests from individual users or applications, especially during high-traffic events.

9. Dead Letter Queue (DLQ)

  • Detailed Purpose: In event-driven architectures, some messages might fail to be processed. A Dead Letter Queue (DLQ) captures those messages for later analysis or retry.

How it works:

When a microservice repeatedly fails to process a message, it is moved to a dead letter queue.

The DLQ allows developers to analyze the failed message later or implement logic to reprocess it after the underlying issue is fixed.

Real-life Example: AWS Simple Queue Service (SQS) provides a DLQ to handle messages that cannot be processed after multiple retries. This prevents message loss in systems that rely on messaging for critical communication.

10. Idempotency

  • Detailed Purpose: Ensures that repeated requests to a service have the same result, avoiding unintended effects like duplicate transactions.

How it works:

Idempotent operations ensure that calling a service multiple times with the same input will always result in the same outcome.

This is particularly important in systems with frequent retries, as it prevents side effects (like billing the same transaction twice).

Real-life Example: Payment processors like Stripe implement idempotency keys to ensure that if a transaction request is accidentally sent multiple times, only one charge is made.

11. Distributed Tracing

  • Detailed Purpose: In complex microservice architectures, tracing how a request flows through various services is crucial for identifying bottlenecks or failures.

How it works:

Tools like Jaeger or Zipkin attach a unique trace ID to each request, which is propagated through all services that handle the request.

This allows teams to see the complete lifecycle of a request, including timing and failures, across multiple services.

Real-life Example: Uber uses distributed tracing to monitor how ride requests are processed across its various microservices, enabling the engineering team to detect where performance bottlenecks occur.

12. Health Check Pattern

  • Detailed Purpose: Periodically check each service's health status to ensure that it is operating correctly in a microservice architecture.

How it works:

Each service exposes a health check endpoint that returns the status of its dependencies (e.g., databases, external services).

If a service is unhealthy, the load balancer can stop routing traffic to it until it recovers.

Real-life Example: Kubernetes uses health checks to monitor microservices in its clusters. If a service becomes unhealthy, Kubernetes removes it from the pool and tries to restart it or spin up a new instance.

When implemented effectively, these patterns can transform microservice architectures into highly resilient, fault-tolerant systems that minimize downtime and maintain service availability even when things go wrong.

In the following posts, I will talk more in detail about these patterns with examples. Happy learning….

--

--

vinay kumar
Techartifact-Technology learning

Chief Enterprise Architect, Head of API/Integration & engineering,Author ,#api #apimanagement #azure#productdevelopment #kafka #Architecture #DataMesh