Improve system resiliency using Failure Mode and Effects Analysis

Global Technology
McDonald’s Technical Blog
5 min readJun 27, 2023

McDonald’s is using technology for its operations more than ever before, and it's critical that we are proactively building our large-scale global systems to be reliable and resilient.

By Vamshi Komuravalli, Principal Architect

With McDonald’s size and scale, it should be no surprise that it takes large-scale systems to operate restaurants, run the business, and engage with customers and crew.

Large-scale systems can have many dependencies and points of failure, and if these failures are not handled well, it can cause service disruption to the end users.

For example, our Global Mobile App is backed by a robust platform, built to handle thousands of requests per second. It offers numerous APIs and leverages several cloud and third-party systems, adding to the potential points of failure across the tech stack. For McDonald’s, system reliability is a key focus area, and we look at different ways that help improve resilience and build fault-tolerance into our systems.

One of the techniques McDonald’s USA recently employed is ‘Failure Mode & Effects Analysis’ (FMEA). This is a systematic approach used in various industry domains for identifying possible failure points, understanding the effects of the failures, and then building mechanisms to be resilient.

In this post, I will walk through the analysis approach and how McDonald’s applied the same technique to our systems.

Effect of a failure
First, we gauged the impact of a workload’s failure from a customer perspective — what’s the disruption to our customer? We defined key business functions that our customers use, such as login, search for a restaurant, browse menu, view deals, place orders, redeem offers, earn rewards, etc. APIs supporting these business functions depend on many workloads — a workload could be a microservice, a batch job, a serverless function, etc. This categorization helped us in applying different weights to the failures based on the business function impacted.

Failure modes
Failure mode* is the manner by which a failure is observed, and describes the way the failure occurs and its impact on equipment operation.

In context of distributed applications, a workload could fail due to a multitude of reasons. To bring structure to our analysis, we categorized our points of failure broadly into four failure domains:

  1. Shared services: Failures in the shared services used by many workloads, such as configuration stores, secret management tools, etc.
  2. Intrinsic/internal APIs: Failures in the workloads themselves or internal APIs.
  3. Infrastructure services: Failures in the infrastructure or platform services provided by our underlying cloud providers.
  4. Partner APIs: Failures in the third-party APIs provided by our enterprise providers (payment gateways, delivery vendor APIs, etc.)

Resiliency best practices
Now, let’s look at the techniques and best practices at our disposal for resiliency and fault tolerance. Similar to failure domains, we broadly classified these best practices into four categories.

Below is a non-exhaustive list of patterns we looked for/applied.

Build fault-tolerant code

  1. Circuit Breakers: A technique used to fail fast and avoid a cascading impact. They detect when the workload is failing repeatedly, and temporarily stop making requests to it. This prevents the system becoming overwhelmed with too many requests that can cause it to crash.
  2. Retries: Incorporate retries for handling transient errors from downstream systems, with exponential backoff between retry attempts, and spreading out the retry arrival rate.
  3. Timeouts: Ensure resources do not wait indefinitely for a response and are released in a timely fashion.
  4. Idempotency: Ensure write operations are idempotent to avoid side-effects when retried.
  5. Failure Isolation: Use this pattern to isolate failures and minimize the failure impact on other components.

Design infrastructure for high availability

  1. Multi-availability zone (AZ): Ensure the cloud infrastructure services are configured to use multiple availability zones to be resilient to AZ-level failures.
  2. Automated failover: Have automated failover mechanisms so that when a failure occurs, the system can automatically switch over to the redundant resource without manual intervention.
  3. Read replicas: Have multiple read replicas for read-heavy workloads for better availability and performance.
  4. Dead letter queues: Employ dead-letter queues to avoid data loss due to failures in messaging.
  5. Point-in-time recovery: Configure databases to have an ability to restore to a point in time in the past, helping to mitigate impact due to corrupted writes.
  6. Autoscaling: Configure infrastructure for autoscaling to efficiently handle changes in the request rate.

Test for reliability

  1. Chaos engineering: Ensure resilient mechanisms are working as expected by simulating faults and test the ability of the workloads to recover from these failures.
  2. Backup/restore: Document and test data backup/restore procedures.
  3. Fail-over/fail-back: Simulate failures and test fail-over and fail-back mechanisms.

Detect failures early

  1. Proactive workload monitoring: Monitor application performance metrics, error trends, service-level indicators/service-level objectives, for alerts and notifications when anomalies are observed, or certain thresholds are exceeded.
  2. Service-level health checks: Employ health checks in each of the workloads for early detection of failures.
  3. Infrastructure monitoring: Monitor usage metrics and failures of infrastructure resources.

End-to-end approach
Finally, we brought together the failure modes and resiliency best practices in a structured manner and followed a multistep approach to analyze our workloads and remediate gaps:

  1. Identify failure points across the platform and the workloads deployed on it.
  2. Identify the impact of a workload’s failure on our key business functions.
  3. Assess how resilient are the workloads and platform to these failures.
  4. Prioritize gaps using custom scoring mechanism.
  5. Remediate and repeat the process periodically.

Tools used

  • AWS Well Architected Custom Lenses for capturing information from the analysis
  • AWS Quick-sight for visualizing and drawing insights from the findings
  • Source: Department of Defense

Interested in reading more from this author? Check out Vamshi’s other blog posts:

--

--