Chaos Engineering: How We’re Building Resilience in Adobe Experience Platform
Authors: Soumya Lakshmi, Eddie Bernard, and Jenny Medeiros
This post describes how Adobe leverages chaos engineering to build resilience and confidence across Adobe Experience Platform services and applications.
With the growing complexity of large-scale, distributed systems, it has become increasingly difficult to predict how deployed services will behave under stress in production. In most cases, failures lead to unexpected outages, which are not only frustrating for customers but can cost organizations over $100,000 within a single hour.
As a result, organizations are challenged with building resilient systems that can withstand turbulent conditions in production. To meet this challenge head-on, at Adobe Experience Platform we turned to chaos engineering to uncover weaknesses in our systems and fix them before they affect our customers. Through this proactive approach to testing, we can continually strengthen our development process to ensure a higher level of confidence, service availability, and performance for our customers.
In this post, we detail our approach to chaos engineering and describe the learnings from our first GameDay. We also give a glimpse into what we’re working on next and what we’re planning to release to the open-source community.
Chaos engineering builds resiliency in our systems from inception
With Adobe Experience Platform, resiliency is at the heart of every production deployment. To achieve this, we promote a culture of learning and an in-depth understanding of how our systems behave in real-world conditions through the regular practice of chaos engineering.
Chaos engineering is a discipline that facilitates experiments, hypothesizes results, and analyzes outcomes to reinforce the development process. This pre-emptive practice enables developers to build resilience into their systems and verify the effectiveness of existing fallback measures, increasing availability and reducing downtime in case of failure.
Integrating resiliency, however, works best when built into a system from the ground up. It must start at the infrastructure level and be woven into the entire application lifecycle — from the network and data layer to the application’s design.
To encourage this powerful practice across Adobe Experience Platform teams, we began conducting Chaos GameDays, an event where we induce a series of controlled failures to identify weaknesses in a team’s system.
Our approach to GameDays
To run a GameDay, we typically schedule four to five hours with a team to inject pre-defined stressors that will trigger their system. Then, we observe its current reaction to each injection and work with the team to redefine how their system should behave in the face of failure.
To ensure an in-depth analysis of a system during each GameDay, we adopted the following general structure:
1. Understand the system
The first step is to meet with the team and understand how their system is built, how it’s architected, identify its dependencies, and at which points and levels we can inject failures.
2. Define a “steady-state”
A steady-state is how the system behaves under normal conditions. For this, we define baseline technical and business metrics to monitor throughout the GameDay. The harder it is to disrupt the steady-state, the more confidence we have in the system’s behavior.
3. Determine injected failures
In this step, we define the number and type of injection. Typically, we limit a GameDay to just four to five injections so we have enough time to properly observe and understand the system’s reactions to each one.
Currently, we consider two types of injections:
- Basic injections play around with resources, state, and network. For example, we may kill one or multiple processes, shut down a region, adjust the disk space, or decrease the CPU.
- Custom injections target failures tailored to a specific system. In this case, we work with the team to identify and build these failures prior to GameDay.
4. Establish a hypothesis
We document what we anticipate will happen to the system after injecting each failure. During the GameDay, we compare this hypothesis to what’s happening in reality.
5. Set the blast radius
An initial GameDay should start with a small blast radius, such as shutting down a container or increasing network latency. As the system builds resiliency, the blast radius should escalate to more “severe” injections, such as powering off an entire rack of servers or disabling a system’s primary network path. As the blast radius increases, so does the level of confidence in how the system will perform under stress in production.
6. Share key takeaways
The final step is about learnings. We analyze key takeaways from the GameDay that can be turned into best practices or actionable items, then share them with the team so they can strengthen their system before those unintended behaviors manifest in production at large.
Our first GameDay: Testing a provisioning service
To test the practice of chaos engineering, we conducted a GameDay to analyze an internal provisioning component within Adobe Experience Platform. Here’s how it went and what we learned from it.
In keeping with the structure outlined above, we began by meeting with the team to understand their system. Their architecture was roughly comprised of an API layer, a caching layer, and a database layer.
We planned which services we would be targeting, defined the participants and their responsibilities, and created an “abort plan” to halt an experiment if key business metrics dropped below a pre-defined threshold. We also notified the teams responsible for the system’s upstream and downstream dependencies so they could be prepared for any potential impacts during GameDay.
When the day arrived, we held a conference call with the team members involved with the systems being tested. Together, we ran a load test to establish a steady-state and set a benchmark for comparison after injecting the failures. With our hypotheses in hand and the team on the call, we proceeded to inject three failures, one by one, and observe the system’s behavior through quantifiable metrics using a combination of commercial and in-house tooling.
For the first injection, we shut down 50% of the system’s containers. Immediately, we noticed changes in the system as well as in dependant systems, including Adobe Ethos and the database. Most notably, we observed an unexpected nine-second delay for our database’s backup systems to kick in (see image below). This significant failure was not anticipated in our hypothesis for this injection.
For the second injection, we added 500ms of network latency. This revealed, among other things, a few issues with the way the provisioning component’s engine and API worked in conjunction.
For the third and final injection, we shut down all the containers. At this point, the system spiraled into absolute mayhem.
Sharing our learnings
After reviewing the results on our monitoring system, we identified the areas for improvement and shared them with the team. One recommendation, for example, was building resiliency on database dependencies.
As a result of this first GameDay, the team created a second version of the provisioning component, in which they incorporated the learnings and successfully addressed major shortcomings found in the previous version.
As for the chaos engineering team itself, we left with three lessons learned of our own:
- After testing the resilience of the main system, the next step should be to test dependent systems or third-party APIs, which can impact the system and its customers.
- Considering the unique needs of internal Adobe systems, we needed system-specific monitoring tools to accurately measure performance and adequately test for characteristic failures.
- To streamline the planning process, we can create an internal Wiki page outlining the information for GameDay then share it with the team to minimize meetings and make discussions more efficient.
While this initial GameDay was far from perfect, it showcased beyond question the importance of leading proactive efforts to learn from failure. As we reinforce our own process to run GameDays, we also help Adobe technologists iteratively build immunity in our systems and increase our confidence in every deployment.
Through chaos engineering, we have enabled teams working on Adobe Experience Platform to strengthen our systems and build confidence in our ability to innovate faster while providing highly resilient, available, and performant applications to our customers.
Based on how our systems are architected, however, we determined that existing commercial and open-source solutions lack the specificity required to test our purpose-built services. As a result, we’re currently developing a platform-agnostic tool capable of automating these tailored experiments. Eventually, we will make this tool open-source and welcome contributions from the open-source and chaos engineering communities.
Thanks to the work put by Chaos Engineering Team, David Davtian, Angad Patil, and Mehdi Laouichi. Special shout out to Aaron Bartlett who was instrumental in moving this project forward.