FMEA & Resiliency Design Pattern for a tier-0 service with Third-Party Service Integration

Rajneesh Dubey
Walmart Global Tech Blog
9 min readAug 19, 2024
Resilience

High-volume systems are complex and often involve numerous dependencies, which can lead to potential failure points. If not managed effectively, these failures can cause service interruptions and result in a poor experience for end users.

In Walmart’s customer cart checkout process for cross-border orders, multiple services must work together to ensure a seamless experience. Given that cart checkout is a critical feature that can prevent customers from completing their orders, it is classified as a tier-0 service. This classification highlights the importance of system reliability in this context. We faced stringent SLA (Service Level Agreement) requirements for this feature, during the development.

Specifically, the duty provider service in the checkout process was required to meet a response time SLA of under 30 milliseconds (at p95), with an availability target of approximately 99.9%.

Walmart Checkout Service Interacts with Third-Party Provider over the Internet

In this setup, the duty-provider service relies on interactions with several components: an upstream service, a database, and an external third-party service. The database is used for meta-data lookups, while the external third-party service, accessed over the Internet, provides necessary information to calculate import duties for cross-border traded items.

Given the complexity of these interactions, each network connection and node introduces potential points of failure. Analysing and understanding the causes of these failures is crucial for developing a robust and resilient system. Identifying and addressing these failure points helps in building a system that can better withstand disruptions and maintain a high level of reliability.

Failure Mode and Effect Analysis (FMEA)

One of the techniques we employed, to make our system resilient, is — the ‘Failure Mode and Effect Analysis’ (FMEA). FMEA is a methodical approach used to identify potential problems within a system, assess their impacts, and prioritise them based on severity and likelihood. This process involves examining various ways a component or process might fail, understanding the root causes and potential consequences of these failures, and evaluating their risk. By ranking failure modes according to their impact and probability, FMEA helps in addressing the most critical issues first. The goal is to proactively find and resolve potential problems before they escalate into serious issues, thereby improving system reliability and performance.

Failure Mode and Effect Analysis (courtesy: boardmix.com)

We analyzed all major failure modes at the architectural level and assessed their effects. We then compiled a table outlining these key failure modes, including their potential causes and typical mitigation strategies.

By identifying and addressing these failure modes with appropriate mitigation strategies, distributed software architectures can be designed to be more resilient and reliable. We also analysed the Risk Priority Number (RPN) for each item to prioritise and manage potential issues effectively.

Actions on FMEA

As a result of the analysis, we compiled a list of failure modes ranked by their Risk Priority Number (RPN). We categorised the challenges into two broad areas: response time (latency) and system availability for the duty provider service.

Response Time (Latency) Improvisation

In addition to unit tests and JaCoCo for code coverage, we integrated automated performance and functional tests into the Continuous Integration/Continuous Deployment (CI/CD) pipeline for the duty provider service. This integration offers several key advantages:

  • Performance Optimisation: Regular performance testing fosters a culture of continuous improvement, encouraging developers to write more efficient code.
  • Benchmarking: Performance tests provide data for benchmarking, allowing teams to set performance goals and track progress over time.
  • Immediate Feedback: Developers receive prompt feedback on the performance impact of their changes, facilitating quick issue resolution before problems escalate.
  • Automated Testing: Automated performance tests ensure consistent and repeatable evaluations with every code change, deployment, or release.

Despite these measures, we observed that latency was still higher than required. We addressed this issue by dividing it into two areas: Internal Interactions (e.g., database) and External Interactions (e.g., internet calls).

Internal Interactions:

The duty provider service interacts with a Cosmos database for lookups. We observed that the response time for the database was 60 milliseconds (at p99) for an average document size of 4 KB, which did not meet our performance requirements. To address this issue, we implemented several techniques to optimise our database read calls:

  • Point Reads vs Queries: We replaced queries with point reads whenever possible. Point reads (using the partition key and item ID) are significantly more efficient than queries.
  • Network Considerations/ Proximity to Data Centres: We deployed our application in regions close to your Azure Cosmos DB instance to minimise network latency.
  • Preferred Regions in Cosmos Client: Cross region calls are very costly and must be avoided. In Azure Cosmos DB, “Preferred Regions” is a feature used in the Cosmos DB client configuration to specify the order of regions to be used for read/write operations. This allows you to optimise latency and ensure high availability by prioritising certain regions for your database operations.
  • Modes of connection: Azure Cosmos DB offers two modes of connection: Gateway Mode and Direct Mode. For most high-performance applications, Direct Mode is preferred, while Gateway Mode is suitable for simpler configurations and environments with network restrictions. We moved to Direct Mode as our system allowed us to leverage it.

After implementing these optimisations, the database response time improved significantly, reducing to approximately 15 milliseconds (at p99) for an average document size of 4 KB.

External Interactions:

The third-party service and, consequently, the duty provider service initially had a response time of around 300 milliseconds, which was unacceptable for the customer checkout experience.

  1. To address this, our first approach involved deploying the third-party service in the same regions as our duty provider service and using a geo-proximity-based load balancer. Although this proof of concept achieved a slight reduction in latency, it was not sufficient due to significant delays caused by internet transmission between the services.
  2. Our second approach was more effective: we deployed the third-party service within our local Walmart network, integrating it with our in-house database to eliminate external interactions. This setup significantly improved performance, reducing the p95 response time to approximately 30 milliseconds.
Third Party Provider moved inside Walmart network as a local deployment

Now, the duty provider becomes a system with a good response time of <30 milliseconds (p95). The next challenge was AVAILABILITY!

Availability Improvisation

The availability of the duty-provider service is crucial since it provides essential import duty and charge information to customers, and any failure to respond can significantly impact the user experience.

To ensure high availability and maintain system stability, we implemented several key resiliency and fault-tolerance strategies:

  • Retries: Automatic retries for network calls handle transient issues and random partial failures by resending requests.
  • Circuit Breakers: Circuit breakers monitor network calls and automatically halt them if a failure threshold is reached, preventing system overload from repeated failures.
  • Timeouts: Appropriate timeouts for network calls and operations help avoid indefinite waiting and allow for prompt failure detection.
  • Load Balancing: Distributing workloads across multiple servers prevents any single server from becoming a bottleneck or point of failure.
  • Auto-Scaling: Dynamically adjusting the number of instances based on demand ensures efficient handling of load variations.
  • Rate Limiting: Controlling the flow of requests prevents system components from being overloaded.
  • Bulkheads: Isolating system components to ensure that failures in one part do not affect others.
  • Failover Mechanisms: Automatically switching to standby systems or components when a primary one fails ensures continuous service availability.

Despite these standard strategies, the major challenge was managing Bulkheads and minimising downtime to avoid impacting upstream systems, such as the cart-checkout service. To address this, we implemented a robust failover mechanism to ensure seamless operation and minimise disruptions.

Designing For Failover Mechanism and Bulkheads:

To isolate the service failures and its impact on the upstream stream, we worked on improvising our failover mechanism to ensure the high availability of duty-provider-service.

  • The first thing we did for this was to have an in-memory fallback logic that can be invoked if there is some issue while calling third-party service or the database.
  • Fallback service logic works on the application context and cached information. No interaction with external systems.
  • And it returns a default attributes for each request which may not be as precise as the actual attributes from the third-party API (Application Program Interface), but it solves the availability problem.
  • In addition to the failover mechanism, we designed bulkheads to isolate service failures, ensuring that issues in one component do not cascade and affect upstream systems, such as the cart-checkout service. This isolation helps in managing failures more effectively and protects the overall system from widespread disruptions.
Challenge: Failure points in the system. X indicates the failure points.

Challenge:

There is a challenge with this setup too. We can only invoke this inside duty provider service. But there are failure points —

How should we handle situations where:

  1. The duty provider service fails to connect to its database?
  2. The duty provider service itself is down?
  3. The cart-checkout service is unable to connect to the duty provider service?
  • If the duty-provider-service is down, the upstream (i.e., cart-checkout-service) will not receive the import duty and it fails the customer cart checkout event.
  • It will be a bad customer experience and can lead to customer dissatisfaction and potentially losing out customers to other competitors.

Approach:

  • To address the issue of duty-provider-service unavailability, we extracted the fallback logic into a separate JAR file and shared it with the cart-checkout service.
  • This approach allows the cart-checkout service to invoke the fallback logic if it encounters connectivity issues with the duty-provider-service.
  • With this setup, the cart-checkout service can handle failures gracefully by using the fallback mechanism, ensuring that import duty information is still provided to the customer even when the duty-provider-service is down.
  • This solution helps prevent downstream unavailability issues and maintains a consistent and reliable customer experience.
Extracted in-memory fallback logic as a jar

This approach ensures that duty information is consistently returned to the cart-checkout service, even if the duty-provider-service encounters issues. As a result, the impact on customers and the potential blast radius of any service disruption are significantly reduced.

And the duty provider becomes a system with the availability of ~99.9 %.

Performance Improvements before and after the implementation of the resiliency design patterns:

Performance Improvement Metrics (*p95, upto 1000 TPS)

Data Sync for the Local Deployment

Deploying services locally presents several challenges, one of the primary ones being maintaining up-to-date dynamic data, such as changing configurations. To address this, we developed a data synchronisation setup for local deployments. This system ensures that any updates or changes to configurations and other dynamic data are accurately and promptly reflected across all local instances, thereby maintaining consistency and reliability in the deployment environment.

Data Sync Process for Local Deployment

The process for keeping the local deployment of the third-party service up to date with dynamic data involves several steps:

  1. Data Update Feed: The third-party data source system sends update feeds to Walmart’s blob storage.
  2. Data Importer Service: This service polls the blob storage for updates and transfers the data to the database used by the local deployment of the third-party service provider.
  3. Data Sync Listeners: These listeners, running within the third-party local deployment, monitor the database for changes.
  4. Cache Refresh: Upon detecting updates, the data sync listeners build a new cache within the third-party local deployment service using the updated data.
  5. Switchover: The system then switches from the old cache to the newly built cache, ensuring that the local deployment uses the most current data.

This setup ensures that the local deployment of the third-party service remains synchronised with the latest data, maintaining accuracy and reliability.

Future Adoption Plan

The entire process will be documented as a guideline and published across the organisation, serving as a reference design pattern for addressing similar architectural challenges in software design.

This architectural design pattern will support teams in conducting Failure Modes and Effects Analysis (FMEA) with a structured approach, ensuring that potential issues are systematically identified and addressed. It will also facilitate necessary actions to sustainably improve the software development lifecycle by applying best practices and relevant design patterns.

By implementing this reference design pattern, teams will gain access to reusable solutions for common architectural problems. This will aid in creating robust, resilient, and maintainable systems. Ultimately, adopting this guideline will help engineering teams achieve operational excellence, enhancing their ability to manage complex challenges and improve overall software quality.

--

--