Part-2: Evaluating Application Resiliency with Keptn and LitmusChaos (use-case and demo)
A Quick Recap
In Part-1 of this blog series, we discussed the need for chaos engineering within continuous delivery pipelines and how LitmusChaos integrates with Keptn via a ready-made integration to facilitate the implementation of chaos stages. In this blog, we shall illustrate how this integration works with the help of a real use-case involving the popular CNCF demo application “podtato-head”!
The content has been adapted from demonstrations made to the cloud-native community during the CNCF SIG-App-Delivery & Keptn Webinars. Having said that, we will try to focus more on the “why” & “what” part of this demonstration than the “how” (steps, commands, and manifests involved). You can find details of the latter in this excellent tutorial from the Keptn team or find your way in the Github repository for the litmus-service.
Before we proceed further on the use-case, here is a quick refresher on Keptn.
Keptn Application Lifecycle Orchestration
Keptn is an open-source application life-cycle orchestrator. The goal of Keptn is not to replace all your tooling, but actually to save your investments and to orchestrate your existing toolset without having you writing 1000+ lines of pipeline or integration code.
Communication to and from Keptn is done via CloudEvents, providing a clear and rich interface built on open standards. Tool integrations can be added to Keptn by connecting to the Keptn control-plane and subscribing for events, such as events that trigger deployments, tests, or remediation tasks that can go far beyond pure runbook automation. For the use case of this blog, we are mainly focussing on Keptn’s delivery and test orchestration capabilities, as well as the SLO-based quality gates of Keptn.
Use case: Examine the Resilience of a Hello-Service App Deployment
Automated deployment of an application via Keptn is typically chained with CI pipelines that generate the images or resource artifacts. In our use-case, the goal is to examine the resilience of one such helloservice application: the CNCF podtato-head. The hypothesis as a developer or user of the helloservice is that (a) the app is nearly always available and (b) accessed within a desired latency. We shall use a pod-kill chaos experiment to disrupt the state and verify if our resilience hypothesis holds true (i.e., whether the service has been built/deployed to meet our expectations). This action is performed when the application is busy serving requests, as this is a real-world case.
This process involves multiple steps: Deployment of the app (into a pre-prod/staging environment), Testing its resilience via chaos injection followed by Evaluation of SLOs that have been defined for this application. A successful evaluation is usually configured to trigger the promotion of the application into the “next” environment (say, production). Let us take a quick look at how this is implemented.
How to build a continuous evaluation workflow
In Keptn, environment definitions (stage) along with task sequences that have to be executed in these stages are defined declaratively in a so-called shipyard definition.
One example of a simple shipyard is shown below. It defines a “chaos” stage with a “delivery” sequence that consists of a deployment, test, and evaluation task.
The LitmusChaos integration with Keptn ties into the “test” task defined in the shipyard, meaning that it will be triggered via a cloud event when Keptn launches this task. The power of the event-based approach is that we can simultaneously trigger different tools, which is especially useful in the case of testing: we want to trigger chaos experiments while our service is undergoing performance tests — this way we can evaluate the resilience of our applications when simulating real-world scenarios.
How to evaluate the resilience
Once we have built the sequence that we want to run for each new version of our application to have its performance & chaos resilience tested, we need a way to measure the impact of chaos on our application under test (AUT). For this, we are going to use two built-in concepts of Keptn: Service-level indicators (SLI), and Service-level objectives (SLO).
For Service-level indicators (SLI), we can think of them as metrics like response time, failure rate, throughput, or any other metrics that are relevant for our applications or even organizations. To measure their impact, we are going to use Service-level objectives (SLO), which are goals and thresholds set upon SLIs. An example can be as simple as the “error rate of login requests has to be less than 1% in the last hour”, but SLOs can get more complex in real-world situations. They are an excellent way to measure quality and use this as a control mechanism if new builds should be allowed to reach production. In our use case, we are going to evaluate the resilience of applications with SLOs and evaluate them with Keptn to get insights if our applications are ready for production or not.
In our case, we are defining our objectives in an SLO.yaml file as follows:
In the file, we define that the probe_success_percentage has to be higher than 95% ((a.) nearly, always available) during an evaluation period for Keptn to give it a full pass. If that is not met but the value is still higher than 90%, Keptn would give it half the points for this objective. We also evaluate the probe_duration_ms which measures how fast the probe is responding. This has to be faster than 200ms ((b.) accessible within desired latency), otherwise no points are given by Keptn. In total, the quality evaluation of all objectives has to score 100% for Keptn to give a green light (pass), or higher than 75% to receive a warning, everything lower will result in a failed quality evaluation.
Note that the actual data retrieval is abstracted in this file, which means it is transparent to the user of the SLO file, which makes it easy to define and reuse SLOs even between different data providers. The data retrieval itself is managed by Keptn via SLI providers and defined in a dedicated file.
The Demo Environment
Now that we are familiar with the high-level steps involved in the continuous evaluation procedure, let us have a look at the various components in the demo environment, which are employed in the execution of this use-case.
Needless to say, the components described are all deployed/contained within any standard Kubernetes environment, with publicly resolvable IP addresses/hostnames or loadbalancer, that have Keptn installed already. A tutorial on how you can replicate the setup of this blog article can be found in the Keptn tutorials hub.
Podtato-Head Helm Chart: The helloservice app is maintained as by in a CNCF Github repository. In our use case, it is deployed via Keptn using a Helm chart onto the cluster as part of the service onboarding process (via the Keptn CLI or API). Keptn uses a GitOps approach, with an internal git source, to manage the application on the cluster. The helloservice is initially deployed with a single replica (to highlight a resilience issue) and eventually updated to use multiple replicas (to verify the “fix”).
Jmeter: Jmeter is used to generate load on the helloservice app based on a predefined config (in this case, a performance profile), to simulate real-world traffic. As mentioned earlier, Keptn allows for the execution of parallel tasks by triggering multiple tools simultaneously via its event-driven approach and dedicated control plane services. The tool is invoked via the Jmeter Service. The intent here is to simulate real-world traffic to set the app under stress and simulate standard conditions for SLO evaluation and relative benchmarking.
Litmus: The Litmus service triggers and monitors the chaos experiment, its execution being carried out by the pre-installed platform components (chaos-operator, experiment custom resources & suitable service account). The duration of the experiment is typically set to match the Jmeter run to aid accurate evaluation but nevertheless is Keptn taking care of correct evaluation, even if those two are not aligned. The Litmus service packs in a verdict/result in the test-finished event that it generates based on desired “checks” burned into the experiment.
Prometheus: As a natively supported SLI source within Keptn (called SLI-provider in Keptn), we are going to use Prometheus, which will hold the data exported by the blackbox exporter for the evaluation process. The Keptn control plane will then reach out to Prometheus and query the data for the app under test for the timeframe that needs to be evaluated. To ease setup and maintenance, Keptn creates necessary configurations such as scrape jobs or alerting rules automatically as part of the Prometheus service integration.
Blackbox Exporter: The blackbox exporter is a standard Prometheus exporter that allows probing of endpoints over http/https and generates useful metrics indicating service health and performance. In this use-case, the blackbox exporter is set up to probe the helloservice, with the metrics being used in the SLI configuration. The intention here is to not check for the response time or throughput of the application, but rather to probe if the application is available at all.
Putting Things Together
Bringing all pieces together results in a process outlined in the following image. The podtato-head helloservice application will be deployed by Keptn. Once deployment is finished, the Litmus chaos experiment as well as the Jmeter load tests are triggered by Keptn. Prometheus is constantly executing probes and gathering data about the availability of the application under test. Keptn in turn queries this data to make use of it in the SLO-based quality evaluation.
To trigger the whole task sequence defined earlier, we can make use of either the Keptn CLI or API. Keptn will then apply the parameters passed to the Helm chart of the application that is managed by Keptn.
Before we are going to execute a first test run, let’s have a look at the setup of our use case and the demo.
What Happens During the Chaos Experiment
The following image illustrates how the components for this use case interact with each other. JMeter test execution as well as LitmusChaos experiments are orchestrated by Keptn, while Prometheus is observing the state of the application under test using the blackbox exporter.
The blackbox exporter provides a couple of useful metrics that relate directly to our hypothesis requirements: probe_success indicating the success factor for the helloservice availability & probe_duration_seconds indicating the time taken to access the helloservice. As part of the Keptn SLI definition, Prometheus queries are executed to obtain their avg/mean in percentage & microseconds respectively.
The LitmusChaos Operator executes a job to identify the helloservice pods (via namespace & label filters) and deletes a replica/pod. Being a deployment, the pod is rescheduled. However, since the helloservice deployment is configured with a readiness probe (a general best practice that allows for all startup routines to complete and prepare the app to serve requests) the endpoint corresponding to this pod is available after a brief delay (30s in this case). The Litmus chaos experiment job looks for the successful rescheduling and readiness of the helloservice before completing its run.
The extent to which the availability & latency metrics deviate during this process determines the success of the subsequent SLO evaluation process.
Results with a Single Replica Deployment
With a single pod configured for the helloservice deployment, the endpoint is not going to be available for 30s, during which the probe_success and probe_duration_seconds metrics plummet to a low value and stay there until readiness, causing the SLO checks to fail.
The whole sequence can be seen in the following image, highlighting the deployment of the version of our application under test, the start and finish of the tests, as well as the evaluation itself. Since the evaluation failed (indicated in red), let’s have a closer look at this.
As can be seen in the image below, Keptn evaluated that this run did not meet our SLOs we have defined earlier. Both our probe_duration as well as the probe_success_percentage failed for this test run. The reason is, that the application could not be reached for a couple of seconds, resulting in only ~81% of availability (probe_success_percentage) within the evaluation period, and due to the cold start the probes took too long to finish (probe_duration).
A suitable solution to increase the availability of our application if one instance of it crashes is to add more instances (i.e., replicas). Let’s give this a try.
Results with a Multi Replica Deployment
We are now triggering a new deployment of our podtato-head application under test. Again, we can do this via the Keptn CLI or API. This time we are triggering it via the API and adding the information about the desired replica count to the Cloud Event that is sent to Keptn.
Now this time, if one instance of the application gets deleted by the chaos experiment, we still have two other instances available. With this multi-replica deployment, injecting the same chaos (and the same JMeter load), the probe traffic will be served by the alternate, available instances ensuring that the availability and latency requirements are successfully met.
The following image proves our assumptions. All probes have been finished successfully: First, we see that all probes have finished within our desired criteria (probe_duration) and, second, we achieved a 100% success rate of probe_success_percentage, meaning that the application was available 100% of the evaluation period.
But there is more: In addition to the SLO checks, the test tasks can provide useful insights into the overall success and behavior of the application. The Litmus Service generates a test_finished CloudEvent that includes information on whether the Litmus experiment was deemed successful or not. This information is gathered from the verdict flag in ChaosResult custom resource. Without any explicit checks defined, the experiment only looks for the “Running” status of all the pods & “Ready” state for all containers for a given AUT (Application Under Test) before and after executing the chaos. However, the verdict can be controlled by defining additional “Litmus Probes”. For example, the stability of the Jmeter pod (which can be considered as a downstream service or “consumer” of the helloservice) can be factored in as a mandatory requirement for a “Pass” verdict and, thereby a successful test_finished event.
This use-case was drawn up to provide insights into how you can build Continuous Evaluation Pipelines in Keptn & include Chaos experiments with the Litmus Service integration. Though the applications used (podtato-head helloservice) and SLOs configured (built off metrics from blackbox exporter) are illustrative in nature and the process highlights a simple deployment inefficiency (replicas in a deployment), the blueprint remains the same when constructing more complex scenarios. The ease of Keptn’s declarative pipeline definitions & unique Quality Gate feature combined with the diversity of chaos experiments offered by LitmusChaos ensures that you are confident about the applications shipped to production.
The Litmus integration within Keptn is set to get even better, with the transition to LitmusChaos 2.0, which will allow users to run multi-fault chaos workflows over standalone experiments. This, along with the ability to visualize chaos progress on a dedicated Portal, that one could leverage alongside the Keptn Bridge for increased observability. Also, amongst the roadmap items is the effort around leveraging LitmusChaos experiments as a means to verify the success of policy-based auto-remediation actions triggered by Keptn on the deployment environments.
Go try this out and share your feedback on what you like about this integration and what you’d like improved. Feel free to create issues, engage in discussions on the Keptn Slack and Litmus Github repository.
Acknowledgments: This blog builds upon a joint work between the Keptn & LitmusChaos team and has been jointly written by their respective maintainers Karthik Satchitanand (LitmusChaos) and Jürgen Etzlstorfer (Keptn).