Closed-loop Remediation with custom Integrations

Johannes Bräuer
keptn
Published in
7 min readJun 19, 2020

Keptn 0.7 lifts the automation of remediation workflows and the integration of custom remediation providers (aka. action providers) to the next level. A level where you can configure multiple remediation actions per problem type and the effect of each remediation action is validated based on the SLO/SLI validation Keptn offers. Consequently, fast feedback on executed remediation actions is given, providing better visibility into entire remediation scenarios. Additionally, the event-based architecture of Keptn allows plugging in custom action providers to integrate the automation tool of your choice.

Remediation workflows for cloud-native architectures

Automating remediation workflows comes into play when you want to automatically react to a problem in order to keep a microservice online and to avoid any user impact. Although the goal is clear, multiple challenges occur in this regard:

  • There might be multiple versions of a microservice deployed what makes it difficult to identify the root cause;
  • Highly dynamic environments require a constant adaptation of the remediation workflows to stay up-to-date;
  • Many remediation actions have never been tested before.

You should use Keptn for automating remediation tasks since:

  • You define remediation workflows not for an entire application but rather on microservice-level;
  • Remediation workflows are composed of re-usable atomic actions;
  • Actions allow integration of any automation tools/frameworks;
  • Actions accompany a microservice throughout the delivery workflow and should be part of integration tests in production-like environments;
  • Keptn provides feedback on the execution of each action. This allows adapting and optimizing your remediation workflows depending on changed environmental settings.

Core principles for closed-loop remediation

Before exploring the Keptn 0.7 enhancements for the use-case of operation automation, three underlying core principles are highlighted and then used to explain the new features. Please read the article, Micro operations — A new operations model for the micro services age for more details about the principles:

  • Declarative operations as code: Keptn follows a declarative approach for configuring remediation workflows as code on the level of individual microservices rather than on applications. Consequently, this declaration is versioned next to the operational config and deployed with each version of the microservice.
  • Atomic build blocks: In Keptn, a remediation action is implemented as micro-operation but of course can be re-used for multiple microservices. Such a micro-operation is reduced to the max, meaning that it is designed to execute a single action. This action is implemented for a single microservice rather than an entire application.
  • Event-driven choreography: Keptn choreographs the remediation workflows by first acting on a problem event and then sending out events derived from the remediation configuration. The SLO/SLI based validation of micro-operations is integrated into Keptn 0.7; hence, it provides automatic and fast feedback on executed actions. By validating the effect of an executed action, Keptn 0.7 closes the loop and can make decisions whether the remediation workflow needs to continue or if it already resolved the incident.

Declarative operations as Code — How to define a Remediation Configuration?

The remediation config describes a remediation workflow in a declarative manner. It only defines what needs to be done and leaves all the details on how to achieve it to other components.

Below is an example of a remediation config understood by Keptn 0.7:

As shown by the example, you can define a list of remediations. Each remediation maps to a problemType, which triggers the corresponding actions specified by the actionsOnOpen property.

ProblemType

The problemType maps a problem to a remediation by a matching problem title. For the case of triggering a remediation based on an unknown problem, the problem type `default` is supported — think of the default case in a switch-statement.

The left example shows remediations configured for the problem type Response time degradation and Failure rate increase as well as any unknown problem.

ActionsOnOpen

The actionsOnOpen property declares a list of actions triggered in the course of the remediation workflow. An action itself is specified by the following four properties:

  • name: A name used for display purposes.
  • description: A description to provide more details about the action.
  • action: A unique name required by the action-provider that executes the action. Based on this information, the action-provider can decide whether to execute the action.
  • value: An optional property for adding an arbitrary value object to configure the action.

If multiple actions are declared, Keptn choreographs the remediation workflow by sending out action events in sequential order and according to the list. Given the below example, which adds two actions to the problem type Response time degradation the event for triggering scaling is sent out before the event for featuretoggle is fired.

Atomic build blocks — How to write an action-provider?

An action-provider is an implementation of a Keptn-service with a dedicated purpose. This type of service is responsible for executing a remediation action and therefore might even use another tool. An action-provider starts working when receiving a Keptn CloudEvent of type: sh.keptn.event.action.triggered.

When receiving such an event, a provider must perform the following tasks:

  1. Process the incoming event to receive meta-data such as project, stage, and service name. Besides, the action and value properties are needed.
  2. Decide based on the action property whether the action is supported. If the action is not supported, no further task is required.
  3. If the action is supported, send a start event of type: sh.keptn.event.action.started. This CloudEvent informs Keptn that a service takes care of executing the action.
  4. Execute the implemented functionality. At this step, the action-provider can make use of another automation tool.
  5. Send a finished event of type: sh.keptn.event.action.finished. This informs Keptn to proceed in the remediation workflow.

For the technical requirements on how to implement the event subscription mechanism and the Kubernetes manifests for an action-provider, please take a look here. Besides this documentation, a keptn-service template with pre-defined event handlers in GoLang is provided. Hence, this template reduces the development of an action-provider to its main functionality without worrying about receiving, building, or sending CloudEvents events. Just check out the repo keptn-service-template-go and kick-off the implementation of your custom action-provider.

Event-driven choreography — How to integrate a custom action-provider?

Once you have implemented your custom action-provider, it must be deployed in a Keptn deployment and the supported action must be declared in the remediation config of the microservice.

  • To deploy an action-provider in a Keptn deployment, apply the manifests for the Kubernetes Service and Deployment, as well as the Deployment for the CloudEvent distributor. Please take a look here, to learn more about those three resources:
kubectl apply -f action-provider-service.yamlkubectl apply -f action-provider-deployment.yamlkubectl apply -f action-provider-deployment-distributor.yaml
  • To declare an action, first add an action object (consisting of name, description, action, and optional value object) to the actionsOnOpen property in the remediation config. For example, let’s add the remediation action featuretoggle to the remediation config of the carts microservice, as shown by the above example. Afterwards, the new remediation configuration must be added to the config repository managed by Keptn. Therefore, use the Keptn CLI command keptn add resource that replaces the old remediation config:
keptn add-resource --project=PROJECT --stage=STAGE --service=SERVICE --resource=FILEPATH --resourceUri=remediation.yaml

Do not forget about testing!

At this point, we have configured a remediation action for a microservice, implemented a custom action-provider, and integrated the action-provider into a remediation workflow. Now, providing a testing concept for the remediation workflow is important in order to verify that it works as expected.

As mentioned above, it is recommended to test a remediation workflow as part of an integration test in a pre-production environment (i.e., staging or hardening stage). For an end-to-end test, you should simulate problem patterns that are detected by the monitoring solution that informs Keptn with a problem event. This event kicks off the closed-loop remediation workflow. If you have monitored this pre-production environment and you can simulate the problems, you just need to set remediation_strategy: "automated" for the stage in the shipyard file. This is shown by the hardening stage below:

If you cannot simulate problems but still want to test the remediation workflow, you can send a problem event to Keptn as part of an integration test. For this step, either the Keptn REST API endpoint /event, or the CLI command send event can be used. The event payload should look like the following example (Please note, the example contains the least amount of properties):

Conclusion

To summarize the enhancements for the remediation automation use-case, Keptn 0.7:

  • supports the integration of custom action-providers for plugging-in any automation tool of your choice,
  • allows declaring multiple remediation actions for one problem type,
  • and validates each action based on the SLOs defined for the microservice.

Those enhancements are derived from community feedback — many thanks! To go even a step further and to explore your use-case around automated operations, the Keptn team would really appreciate your feedback by clicking through the attached questionnaire and leaving a comment. Many thanks in advance! Link to Questionnaire

Last but not least: If you have not used Keptn before, then it is time to go to keptn.tutorials.sh where you find a set of tutorials. These tutorials provide a great starting point and guide you through the different use-cases.

--

--