Resiliency Doctor — A tool to achieve resiliency in hybrid cloud application ecosystems
We detailed our journey to achieving resiliency and some of our lessons learned during the transition in the previous post. But resiliency cannot be achieved by one-off efforts. It has to be a continued check on the system to ensure we do not meander from our goals. From scheduling and executing game days to defining multiple levels of resiliency, the application teams are constantly focusing on testing resilience to all sorts of failures and chaos in the system.
In the previous post, we spoke about a tool we built in-house to start diagnosing and providing the status of our applications deployed in the hybrid cloud — Resiliency Doctor (or DocRX for short). This tool provides the ability for teams to keep a check on their resiliency goals. It assists teams in measuring resiliency via non-invasive static checks and also by running active checks to verify if goals are being achieved as expected. It also monitors and informs users when their systems are in violation of the requirements, so teams can act to fix vulnerabilities.
For an e-commerce giant like Walmart, the holiday season is a critical opportunity. We usually receive extremely high traffic during holidays when there will be an impending rush of customer requests and managing such high web traffic is a challenge for any online retailer. Applications serving the user requests must be resilient — there need to be smart policies for disaster recovery and support for 24/7 customers’ purchases. Failure to meet these expectations could result in degraded service and a huge revenue loss. However, before any holiday season starts, the engineering teams run checks to understand if applications are “healthy” enough for various kinds of failures. These check-ups serve as a qualifier and a gatekeeper before anyone attempts large scale resiliency testing. To enforce this practice of performing “regular health check-ups” on applications we designed the Resiliency Doctor.
Initially, DocRX was designed around Oneops; an open-source cloud lifecycle management platform by Walmart Labs. It was developed to save our engineers time from verifying the checkpoints for an application deployment manually. It started with some basic health checks and inferences and as it grew popular its usefulness was revealed. We then decided to make it platform agnostic and use it as a diagnostic check to verify the health of our applications. With more and more applications getting deployed to OneOps and public clouds, we had a use case to support hybrid cloud deployments.
DocRX does all the heavy lifting for us with a simple one-page dashboard that displays the stats for an application with a report that describes the state of resiliency from multiple cloud platforms where it could be deployed, updated in near-real-time. DocRX provides a unified report to audit our deployment strategies used for hybrid cloud applications.
Core aspects of Resiliency and docRX
We began using DocRX to determine the resilience state of the application as a pre-check for performance and load testing activities. Two key deployment health checks that we implemented initially were HADR and ECV check.
HA — High Availability — Checking the availability of applications during planned and unplanned outages. E.g. During a system upgrade, an application must have the ability to withstand outages gracefully and provide continuous processing for business-critical applications. Therefore, High Availability is all about avoiding single points of failure and ensuring that the application will continue to process requests. For our application to qualify for this check, it has to be deployed on multiple different cloud regions to continue processing requests at times when a region is down.
DR — Disaster Recovery — There have been situations where an entire data center can run into a catastrophic interruption like Microsoft did last year. However, the application should continue to process requests with minimal or zero business impact. In such cases, application footprint must exist across multiple data centers to continue serving requests.
Why do we need ECV checks?
ECV stands for Enhanced Content Verification check. ECVs are crucial to have the ability to determine if an application is truly functional — not just answering pings. For a load balancer to understand that a compute/service is available, an ECV check should have been configured properly.
Consider a scenario where one cloud region is overloaded with traffic and the load balancer decided to overload the same cloud region because it was unable to find another because the ECV check wasn’t in place. In this case, although the infrastructure supports high traffic, the load balancing deployment isn’t handled efficiently.
If the HADR and ECV checks aren’t in place, then the entire pipeline of the “user requests flowing in through the load balancer down to the compute which serves the request” would be broken. These checks are critical to ensure that the load balancer would redirect the requests to a stable data center during a catastrophic disruption. Failure to meet these minimum requirements could result in degraded service and a huge business loss during high-traffic times.
Active checks and slack-bot
There are many teams at Walmart like the Central ops, Cloud audit, Site Reliability teams who leverage the diagnostic report provided by docRX. The visibility it provides into configuration data in both public and private clouds act as a one-stop-shop instead of searching for information all over the place during a service downtime. It also came in handy when teams wanted to proactively monitor and ensure application performances.
To encourage the use of this tool, a slack-bot was built focusing more towards the active checks. An active check could be scheduled to “poll” a service for status information every so often. It would periodically check the status of the applications; which meant checking if the instances were healthy, the metrics were flowing in, the monitoring is in place and so on. This turned out to be more helpful for newly deployed applications and also for applications recovering from a downtime, providing them a historical trend of the resilient state. It was useful not only as a stand-alone tool, but we also started using them with other performance testing tools at Walmart. A scenario where we could say — “don’t run a perf test when the system isn’t resilient or at least expose the inherent risks when doing so”.
We built DocRx to bring awareness amongst all the development teams at Walmart about the weaknesses which may become potential failures in the deployment of applications. Diagnostic checks as pre-requisites make our game days stronger and give us the confidence to move quickly in a very complex system. If this has piqued your interest and you have some thoughts to share on how to advance the state of art in this field, feel free to leave a comment. If you are interested in working with us, please visit the careers page.