Resiliency in Action: Managing Failovers in Peak Tax Season with a Content Delivery Network (CDN)

Venkatesan Murugesan
Intuit Engineering
Published in
5 min readApr 13, 2021

This blog post is co-authored by Girdhar Malhotra, Product Manager at Intuit.

At Intuit, we built an application infrastructure platform that hosts product offerings including QuickBooks®, TurboTax®, and Mint, thereby powering millions of users. Availability is not only critical to our business but ensures that TurboTax user experiences aren’t hindered during peak traffic days during tax season. Even a few minutes of downtime can negatively impact customer productivity, retention and overall satisfaction, while increasing demands on call center personnel.

That’s why more and more organizations like Intuit strive for resilience with the goal of a >99.99% availability to ensure that users or customers can almost always get access to their offerings. Despite the clear importance of availability, 96% of organizations surveyed across the globe experienced at least one outage in the past year, revealing that downtime issues reach far and wide, regardless of company size, vertical, or location[1].

The definition of downtime means that ‘the system is not accessible’. A 2019 Server OS Reliability Survey found that one hour of downtime costs: [2]

  • At least $100,000 for 98% of companies
  • $300,000 or higher for 86% of businesses
  • $1 million to over $5 million for 34% of surveyed companies

An average hourly cost of enterprise server downtime, worldwide, from 2019–2020:

Solving for high availability at Intuit

Here’s how we devised a failover mechanism to eliminate the chances of downtime occurrences due to startup failures.

Dynamic single-page applications

Last month, we described our AppFabric platform in a Intuit Medium Tech Blog: how we separated our monolithic application into frontend, backend, and even split the frontend into multiple micro frontends for independent release and roll back by various teams. The index page of a micro frontend SPA (single page application) application is usually dynamically composed by a backend service (Node.js® dynamic application service in the diagram below).

The dynamic application service handles all the incoming web traffic for 200+ applications, including TurboTax. The purpose of this service is to compose the initial SPA HTML index page dynamically. This is based on the various application configurations like analytics, logging, experimentation, etc. The composition also includes micro frontend artifacts (aka plugins) and other application-specific contexts through dependent services as well.

Single point of failure

If the application service fails to compose the first SPA HTML and respond to the browser’s request, users cannot access the application. Performing isolation and HADR (High availability disaster recovery) measures to ensure the service is available in multiple regions will not suffice. Even efficient auto-scaling mechanisms to cover for high availability are not enough, as there are chances that the service can fail to respond due to various downstream dependencies or potential bad code.

CDN fallback mechanism

What is our solution to ensure that the user is not impacted by any initial system misconfigurations? It is our CDN fallback mechanism, an architecture that behaves like a time machine. It puts the user through a product experience they would have received before the failure began..

The CDN fallback mechanism has two parts:

  1. An automated job that periodically calls the Node JS application service for an application (e.g https://myturbotax.intuit.com/), captures the dynamically generated time-lapse SPA index HTML page of the application, and uploads that snapshot to CDN.
  2. When a real user request occurs, the gateway detects any signs of failures at the application service by checking for both 5xx/429 errors and also for a throttling threshold (e.g., 4 secs) by which the service should respond.

In a normal scenario, the application service will serve the SPA index page to the user directly. However, when a failure is detected by the gateway for the incoming traffic, it automatically redirects the traffic to the matching application-specific CDN URL where the time-lapse copy of the index HTML page is available. The SPA index HTML is structured in a way that the skeleton application can start without any user context; all the user-specific context data is fetched from the browser side. The SPA web app is fully functional for the end-user to navigate even when the application service is down.

When the gateway starts receiving successful responses again (status code 200) within the threshold time (4 secs), the traffic will resume the normal flow automatically.

Designing for a graceful user experience

The CDN failover mechanism enabled 100% first-page availability for our TurboTax users avoiding blank/unexpected server error screens. It also enforced our design thinking to make sure we allowed the user to see a meaningful first page and bootstrapped all the necessary user-specific context/content information progressively on the browser side.

Learnings

During the February 2021 Tax filing start period, we released the failover infrastructure for TurboTax and were successful in preventing thousands of internal downstream service failures that could have impacted user access to TurboTax services, with the potential for tens of thousands of dollars in revenue loss.

We’re proud to have delivered on the promise of a resilient system that solves for high availability for such a critical business application during peak tax season. It’s a win-win for our customers and Intuit’s business.

--

--