Automating Resiliency: How To Remain Calm In The Midst Of Chaos
By Shan Anwar and Balaji Arunachalam
The Case for Change at Intuit
When any company decides to migrate to the public cloud in order to better scale its product offerings, there will be challenges, including those involving manual testing. For Intuit, the proud maker of TurboTax, QuickBooks, and Mint, this meant breaking down the monolith, going to hundreds of micro-services, and requiring everything to be automated and available via pipelines. A proof of concept to automate manual resiliency testing needed to be created in order to scale exponentially and support dozens of micro-services across multiple regions. During this proof of concept, several homegrown tools were created by the Intuit team to embody the resiliency culture and thinking amongst the developers, preceding the Software Development Life Cycle (SDLC) approach.
In this blog, one such resiliency tool, called CloudRaider, helped accelerate Intuit’s goal to become highly resilient and highly available during this journey to the cloud.
Resiliency Testing at Intuit
As Intuit moved from single data center to dual data centers, HA DR (Highly Available and Disaster Recovery) testing became incredibly important. The team started with a well structured process. This involved a variety of engineers (developers, QE, App Ops, DBAs, network engineers, etc) to conduct long sessions, identifying various failures for the system, and documenting expected system behaviors, including alerts and monitoring. After appropriate prioritization (based on severity, occurrence frequency and ease of detection), the team then executed these failures in a pre-production environment to prepare the system for better resiliency.
This approach generally helped to identify system resiliency defects although it still had a lot of restrictions and gaps. This was a time-consuming, manual process requiring multiple engineers’ time and could get very expensive, only to be repeated as regression tests for future system changes. The FMEA (Failure Mode Effect Analysis) testing was conducted after the system implementation so it worked counter to the shift-left model used to uncover system resiliency issues early in the SDLC process.
In moving to the cloud, the teams started adopting chaos testing in production; this however did not help to solve these gaps either, given this test occurred post production and could not be run as a continuous regression testing. It was discovered that chaos testing was a nice complement to FMEA testing, but it was not necessarily a replacement. Chaos testing, being an ad-hoc testing methodology, required a structured approach of testing, and meant preparing systems prior to invoking chaos into production.
Requirements included are listed below:
- Resiliency testing had to become part of the system design, not an after-thought.
- Shift left resiliency testing would be for developers to enable test driven design and development for system resiliency.
- Tests (including pre and post validation) would need to be fully automated and available as part of release pipeline as regression tests.
- Reverting failures would also need to be automated as part of testing.
- The ability to write the test code in natural language was needed so that the same could be used as a system resiliency design requirement document.
- A 100% pass on the automated resiliency test suite would be a prerequisite for chaos testing in production.
This led to creating an in-house resiliency testing tool called “CloudRaider”.
How Intuit Innovated with CloudRaider: D4D (Design 4 Delight)
During Intuit’s migration to the public cloud, the challenges of manual FMEA testing continued and a proof of concept to automate FMEA tests was created by applying Intuit’s Design for Delight principles.
Principle #1: Deep Customer Empathy
Our systems needed to be resilient; in case of failures we could not impact customers.
Principle #2: Go Broad to Go Narrow
An ideal state was fully resilient systems with automated regression to validate.
Principle #3: Rapid Experiments with Customers
During our experimentation, we involved teams to use our automation. At first, we tried to automate a few specific scenarios to confirm the value of automation. We were unable to scale and had to go back and try out new ideas of how to make it easier to write and execute scenario.
After experimentations, we solved the problem by applying a behavior-driven development process which involved writing a scenario first. This process helped us identify common scenarios and led to develop a domain specific language (DSL). The DSL provided a way to dynamically construct new scenarios and utilize more general code definitions to invoke any failures.
The automation of failures reduced execution time significantly but the question about effectiveness remained. This opened up ideas about automating the process to verify the impact of the failures and to measure the effectiveness of system recovery (see end-to-end design diagram).
CloudRaider in Action
Example: Simple login service
Let’s look at an example of very simple login micro-service that consists of a frontend (web server) and a backend service running in AWS. Even in this simple architecture there are multiple possibilities of failures (see table):
All of the above scenarios are very general and can be applied to any service or system design. In our example, we could have the same failures executed for either frontend or backend. We created these scenarios via CloudRaider (see sample code).
In the scenario above, the implementation details were all abstracted and the test was written in natural language construct. Furthermore, it was all data driven where the same scenario could then be executed under different criteria thus making it reusable.
A slightly modified scenario was highlighted where a login service was unavailable due to very high CPU consumption (see code).
This high CPU consumption scenario varied slightly from the first one where only failure condition was different and easy to construct.
In reality the login service architecture would have many more complexities and critical dependencies. Let’s expand to include authorization of OAuth2 tokens and a risk screening service. Both are external (see diagram).
This new approach introduced resiliency implications such as slow response time or unavailability of critical dependency. In CloudRaider, we could include scenarios to mimic such behaviors by injecting network latency or blocking domains (see code).
We discussed simple failure scenarios, but in reality, modern applications are more complex and run in multiple data centers and/or regions. Our previous example could be enhanced to multiple regions scenario (see diagram). Applications could be highly available as they ran in multiple regions and still maintain auto recovery process if one of the regions went down.
In CloudRaider, we could write code to terminate a region as previously achieved but we could also assert our region failover strategy with the help of AWS Route53 service (see code).
CloudRaider is an Open-Source library written in Java that leverages Behavior Driven Development (BDD) via Cucumber/Gherkin framework. The library is integrated with AWS to inject failures.
Github link: https://github.com/intuit/CloudRaider/
Benefits of an Automated and Structured Resiliency Testing Process
What used to take more than a week of heavy coordination and manual test execution with many engineers, became a three-hour automated execution with zero human resources. This process enabled us to test the system resiliency on a regular basis to catch any regression issues. Having these automated tests in the release pipeline also enabled very high confidence in our product releases and caught resiliency issues before they turned into a production incident. This also gave us more confidence to execute ad-hoc chaos testing in the production. This tool enabled developers to think about resiliency as part of design and implementation and own the testing of their systems’ resiliency.
Product adoption suffers if it is not highly available for customers to use. With increasing complexity and dependencies in the micro service architecture world, it would be impossible to avoid failures in a system’s ecosystem. We learned that our systems needed to be built in such a way to proactively recover from failures with appropriate monitoring and alerts. Testing the systems’ resiliency in an automated/regular way was a must; the sooner the test happened in the SDLC, the less expensive it would be to fix the problem. With well structured, fully automated and regularly executed resiliency tests, our team gained more confidence to execute ad-hoc chaos into production.
Principles of Chaos Engineering
Last Update: 2018 May Advances in large-scale, distributed software systems are changing the game for software…
With so many interacting components, the number of things that can go wrong in a distributed system is enormous. You'll…
Failure mode and effects analysis
Failure mode and effects analysis ( FMEA; often written with "failure modes" in plural) is the process of reviewing as…
DESIGN FOR DELIGHT
"D4D IS OUR #1 SECRET WEAPON AT INTUIT. THERE IS NO #2." - Scott Cook, Founder, Intuit Inc. Design for Delight, aka…
Balaji Arunachalam - Director of Engineering - Intuit | LinkedIn
View Balaji Arunachalam's profile on LinkedIn, the world's largest professional community. Balaji has 3 jobs listed on…