Intern Stories: Building an auto remediation system for Box

freniel zabala
Box Tech Blog
Published in
5 min readOct 8, 2019

Box systems are deeply integrated in our customers’ business-critical workflows and demand strict SLA guarantees for performance and uptime. Any degradation to user-experience must be repaired as quickly as possible to fit into the SLA allowance. The incident management process at Box ensures that issues are detected and remediated quickly, usually involving engineers running some restoring actions. Some of these actions are repetitive or predictable and highlight an opportunity for automation.

For my summer intern project at Box, I built an auto-remediation framework that allows apps to define custom signals that can be interpreted to run pre-defined recovery actions eliminating the requirement for human intervention. This not only reduces the remediation time for these systems, but also reduces the toil of repetitive work for engineers.

Here’s a diagram to give a visual of what the auto-remediation framework would look like:

The auto-remediation framework can be extended to support a variety of signal sources and execute IFTTT-like logic internally to run predefined set of actions. The actions that could vary all the way from a simple shell command to a comprehensive python script. For my intern project, I took-up the challenge to develop the core logic of the framework that works with simple signals and actions.

A large volume of Box’s production services already run on the Kubernetes platform and more are being on-boarded. It was therefore agreed to implement this Auto-Remediation feature on top of Kubernetes in order to provide as much value as possible to engineers at Box. We identified some use-cases where certain kubectl commands were used to restore services when alerts were triggered on service degradation. At Box, most teams use metrics to detect service degradation, which integrates with PagerDuty to alert appropriate personnel. The auto-remediation system can consume this alert from PagerDuty to execute configured actions.

Flow

There are two major components in the design:

  1. The workflow of engineers creating configurations where alerts and actions are specified, and
  2. The flow of executing the remediation action, when configured alerts fire,

Developer Flow

As shown in the developer flow diagram above, whenever a developer wants to automate an action, they simply have to create an alert configuration and store it inside Kubernetes. The recovery controller captures the alert configuration and stores it inside an In-memory Alert Store.

Alert Firing Flow

The alert firing flow starts with the polling service collecting firing alerts from PagerDuty at a set interval. The polling service sends these fired alerts to a REST endpoint provided by the Recovery Controller, which runs the corresponding actions for the specified alert ID.

Implementation

We used Golang, as the programming language for the system because of its efficient libraries that allows interaction with Kubernetes APIs. It made implementing the developer flow, specifically watching for alert configuration changes easy.

Polling Service

One of the core tasks of the Auto Remediation System is knowing when an alert is fired and relaying that information to where it can be handled and resolved. This task of relaying information is delegated to the polling service. Ideally, we would want a notification to the system when an alert fires. Although, we created the polling service as a mitigation of security concern related to an external system initiating a request to a Box internal service. The polling service accomplishes this task by taking advantage of the PagerDuty REST API service.

Recovery Controller

The second core task of the Auto Remediation System is to consume alert signals and run configured actions based on those signals. To keep the system flexible we kept the the polling service separate from the recovery controller. The recovery controller consumes the alert signals fired through a REST endpoint. The server aspect of the recovery controller was made using Gorilla Mux, a popular and reliable routing library for Golang. The action controller executes kubectl commands given by the developer in a new process and returns an error code if the action was not successful.

Another task the Auto Remediation System has to do is to listen for alert configurations to be created and stored in Kubernetes. The alert configurations utilize Kubernetes Custom Resource Definitions. The Custom controller listens for custom resources creation, update, and deletion. Golang has a library called client-go that allows you to programmatically interact with Kubernetes, which includes listening on custom resources. As alert configurations are created the Custom controller listens to the updates and stores them in the In-memory Alert Store. The store is mostly a hash map that lives inside of the recovery controller.

Alert Configurations

The alert configurations contain the specifications required by the Auto Remediation system. They contain IDs that correlate to alerts or issue IDs, the kubectl action command that will resolve the problem, and a cool-down time. The cool-down time allows for the recovery actions to run and take effect before the auto-remediation system runs the actions again for the same alert.

Because alert configurations are custom resources created from Custom Resource Definitions, it gave us two major benefits. Firstly, by configuring the alerts inside of our Kubernetes deployment configuration, we were able to keep the alert configuration alongside the rest of the service configurations. And secondly, we were able to listen on new alert configuration events through our custom controller without building our own event handling system.

Onwards

An auto remediation framework that restores systems and services based on configured signals and actions will help Boxers [what Box engineers call themselves :)] save time and focus on solving more meaningful problems. It will also help reduce toil from manual work and alert fatigue by automating recovery actions.

The POC focuses on certain specific signals and actions, but the auto-remediation framework is built such that it can be extended to handle a variety of signal sources, a decision making workflow, and a bigger set of recovery actions in the future.

--

--