Intuit Runs Gameday Simulations to Test Resilience of Critical Business Systems and Apps at Scale

Venkatesh Rangarajan
Intuit Engineering
Published in
6 min readMay 1, 2024

This blog is co-authored by Venkatesh Rangarajan, Group Product Manager, Emily Pike, Senior Product Manager, and Deepthi Panthula, Senior Product Manager, at Intuit.

Over the past several years, Intuit developers have tested the resilience of critical business systems and apps at scale through “gameday” exercises to assess how our people, processes and tools perform in response to a major incident or outage. Similar to practicing safety procedures in a fire drill, participants must respond to a simulated “failure injection” of a real-world scenario.

Like other global enterprise companies of our size and scale, Intuit strives for >99.99% availability to ensure our ~100 million consumer and small business customers can access their go-to business apps — TurboTax, CreditKarma, Mailchimp and QuickBooks — without fail. For example, ensuring a smooth taxpayer experience on peak traffic days each year in tax season is of the utmost importance to our company for consumers in the U.S and Canada.

In this post, we’ll share strategies, principles and key learnings from bringing gamedays to life here at Intuit.

Why are gamedays important?

The goal of the gameday program is to proactively test and improve the resilience across our platform, with no team left behind;

  • Practicing our response to incidents in a low-risk setting
  • Assessing how well our monitoring, documentation, on-call processes hold up
  • Identifying gaps in our incident response ahead of time
  • Continuously improving through practice and iteration

While our major technology shift to Kubernetes & microservices back in 2019 has accelerated development velocity, it’s also led to new levels of complexity and potential failure modes.

As we continue to scale, gamedays are critical for testing distributed infrastructure reliability and resiliency, and measuring improvements over time. Using historical incident data, we recreate incidents as gameday exercises (e.g., for a specific incident in 2021, we successfully reduced our time to recover by 94%!).

Key principles for an effective gameday program

Following are core principles to keep in mind when designing a gameday program, based on industry leading practices (e.g., Amazon) and our first-hand experiences here at Intuit:

  • Focus on company-wide resilience, leaving no team behind: The program should improve resilience for the entire company — not just one team or service. Often the highest priority services get the most redundancy. Gamedays ensure even lower priority services get testing opportunities, too.
  • Make participation simple and clear: Participants should understand expectations, performance metrics, and the value of the program. Define success criteria upfront based on automated assessment.
  • Select failure scenarios based on real incidents and data: Scenarios should come from past incidents, industry trends, and resilience requirements. The gameday team acts as an impartial third party when selecting events.
  • Develop automated monitoring and alerting: Build tooling to inject failures at scale and monitor response. Leverage existing observability tools to measure availability, recovery time, and other key metrics.
  • Use incident playbooks & standard procedures: A key benefit of gamedays is that they provide teams the opportunity to simulate executing their incident response playbooks. During a gameday, participants respond to the simulated incident by following the same playbook steps they would during a real incident.
  1. This allows teams to validate that their documented playbooks are complete, accurate, and workable during a stressful incident scenario. By going through the motions of executing the playbook in a simulated environment, any gaps or issues can be identified and addressed.
  2. For example, a team may realize a key troubleshooting step is missing from their runbook, or that an on-call contact is outdated. By surfacing these issues in a gameday, the playbook can be updated so it is reliable during an actual incident.
  3. Gamedays provide assurance that playbooks will work when needed most. The exercise of regularly executing and stress testing playbooks through gamedays helps ensure they remain living documents that enable effective incident response.
  • Celebrate successes: Post-event reporting should highlight wins and learnings without placing blame. Feedback surveys help teams improve.

Key elements for quarterly, company-wide gamedays

At Intuit, we host company-wide events once per quarter, enabling us to test organization preparedness for large scale incidents, and identify gaps in incident management, operations, and tooling. Critical success factors include:

  • Strong executive sponsorship to reinforce importance and drive participation
  • Cross-functional steering committee to help design scenarios and promote within their orgs
  • Relevant, risk-based scenarios such as past real incidents or major architecture changes
  • Detailed runbooks for injection scripts and participant instructions
  • Communication plan to set expectations before, updates during, and results after
  • Success criteria based on automated monitoring of key availability and performance metrics
  • Dry run with small group before company-wide event to test logistics
  • All-hands participation across engineering & operations
  • Zoom room setup with key stakeholders present to mirror real-life response
  • Clear action plans and owners to drive improvements across people, process, and tools
  • Feedback survey to participants on what worked well and what needs improvement
  • Executive readout after each event to evaluate progress on resilience

Self-serve gameday challenges

In addition to company-wide gameday events, enabling teams to run smaller-scale gameday challenges themselves is critical. These self-serve exercises provide several benefits:

  • Increased frequency — Teams can test resilience weekly or monthly vs quarterly.
  • Tailored scenarios — Teams can create gamedays specific to their services and risks.
  • Faster iteration — Teams can easily retry exercises if they fail the first time.
  • Developer autonomy — Engineers learn how to safely test distributed systems themselves.

To drive adoption of self-serve gameday challenges:

  • Provide Chaos Engineering tooling and guardrails that make it simple to simulate failures.
  • Develop reusable injection patterns for common scenarios like CPU hog, pod kill, or networking failure.
  • Capture learnings centrally so teams can learn from each other.
  • Incentivize with resilience metrics and gameday badges for active participants.
  • Highlight teams successfully completing self-serve gameday challenges to motivate others.
  • Incorporate self-service gameday challenges into reliability objectives and expectations.

The screenshot below represents the self-service chaos tooling that we have built to empower the teams to run self-serve gameday challenges.

Reporting & recognition

The gameday program is Intuit-wide and we expect all teams and services to participate, so reporting and recognition are key to the program’s success. Our reporting strategy includes the following:

  • Share a detailed report of all participants and their performance on each gameday.
  • Present summarized results and learnings to Intuit leadership.
  • Recognize and reward participants with exceptional performance in a gameday.
  • Track learnings and action items to completion.

Takeaways

Gamedays are a powerful way to proactively test and strengthen organizational resilience, ultimately to instill confidence in technology leaders that a company can withstand and recover from crises.

By rehearsing incident responses in simulated environments, they can improve the reliability of their systems, processes, and teams. By exercising critical incident playbooks, they can surface gaps and opportunities for improvement. While developing a gameday program requires a significant investment, it pays huge dividends: reduced customer impact, faster recovery times, and more robust services.

If this post has piqued your interest, please join our talent community to explore opportunities here at Intuit!

--

--

Venkatesh Rangarajan
Intuit Engineering

Product lead for operational excellence, observability , reliability engineering, performance engineering, data platform & AIOPs @intuit’s core platform team.