Alerting Framework at Airbnb
At Airbnb, we do not have an engineering operations team (as of 2017), so individual teams are responsible for configuring monitoring and responding to problems for their service. We use Datadog to monitor our infrastructure and alert on its health. While Datadog works well and provides many features, we had some specific requirements around alerting:
- Generic alerts need to alert different people depending on the host or role.
- Alerts definition changes are automated so that the alert stays up to date as our infrastructure changes.
- Teams have insight into which alerts they are receiving and are able to discuss creating, modifying, or deleting alerts using a standard code-review process.
- When an alert is triggered, we want to make it easy to determine if it was caused by a recent update and the reason for a threshold being set at a certain value.
These requirements meant taking alert configuration out of the Datadog UI and into a configuration repository. Fortunately, Datadog does provide an API which we could build the tools that we need to easily templatize our alerts based on our infrastructure and track changes to them.
Interferon is our solution to the alerting requirements we described. It uses a Ruby DSL to define alerts and interacts with an alerting system such as Datadog. Interferon reads host information from various pluggable sources and make the data about the hosts accessible to the Ruby DSL. You can, therefore, write a host source that will dynamically read your infrastructure data (such as querying information from the AWS API) and create alerts based on the various attributes you have available.
In addition, you can configure your host sources to include metadata such as ownership information along with each host. This allows you to write a generic alert and create multiple instances of the alert which route to the owners of each host.
For example, here is our standard memory alert:
Here, we have
@hostinfo which is a Ruby Hash that contains information dynamically generated by one of the host sources (in our case, Optica). The hash contains
:role which is the name of the Chef role of the host and :owner_groups and
:owners which contain the ownership metadata.
The name attribute corresponds to the Datadog monitor name and is used as the primary key for Interferon. We can leverage the
@hostinfo attribute here to create a memory alert for each different role.
message for this particular alert is generic; however, it is helpful to be able to write alerts with descriptive messages and actionable steps. This aids in first responders being able to quickly begin to triage the issue when they are alerted, especially for obscure problems or edge-cases.
With a little bit of Ruby code, we can filter out hosts that are exempt from this alerts. The
applies attribute tells Interferon to only include hosts where expression evaluates to true. In this alert, we only want hosts that have roles attach to them but skipping test hosts.
Using the metadata provided in
@hostinfo, we tag the users in
notify.people in the Datadog message so they are notified when the alert goes off. Interferon also ships with a simple way to define groups using YAML. Those groups in
notify.groups are expanded into people to also tag in the message.
Finally we have the
metric.datadog_query attribute which corresponds to the Datadog query syntax for defining the metric and alert parameters.
The alerts framework consists of the alerting gem, Interferon, as well as an alerts repository where the actual alert definitions as well as custom host sources, group sources, and custom destinations are kept. Teams can contribute to the alerts repository to modify alerts and to add custom sources.
The repository provides an audit trail for changes made to the alerts as well as a quick way to revert erroneous changes made. On top of that, we have our repository configured to require peer review to help ensure new alerts have clear messages and reasonable settings. We added a Datadog syntax checker to the pre-commit check to shorten the alert development lifecycle by providing almost immediate feedback on malformed Datadog syntax.
In our infrastructure, commits to the alerts repository create a new build artifact containing the alert definitions and custom code. A deployment ships the artifact to an instance and invokes Interferon to synchronize the latest definitions with Datadog. Interferon downloads the current definitions from Datadog then compares the existing definitions with the ones it generated. In order to cut down on the amount of traffic, it will only modify the definitions that have changed.
Interferon is also scheduled to run every hour to pick up infrastructure changes in order to keep Datadog in sync when new hosts and roles are introduced.
Interferon also has dry-run functionality which allow teams to determine what changes were going to be made. We encourage people to deploy their branch to an instance which runs Interferon under dry-run before deploying to production. The output from dry-run displays the delta between the branch and what is stored in Datadog.
Sharing the Love
We have been using Interferon in production since late 2014 and have been very satisfied with the impact it has made. Both interferon and a sample alert repository are open source and available on Github. If you are interested in using Interferon, you can start by cloning the example repository. The example repository contains a few example alert definitions as well as a few custom host sources. It is especially convenient if you are running on AWS because Interferon includes built-in AWS host sources.
There is no need to submit PRs if you want to define custom host sources, group sources, or definitions because you can place them into your own custom alerts repository for Interferon to find. For example, our internal alerts repository includes Airbnb-specific host sources like the names of our Resque queues along with who owns them and how full they should be before we start alerting about them. When writing custom sources, we recommend experimenting by writing static host sources then rewriting it to pull dynamic information as necessary.
If you have written interesting general-purpose host sources, group sources, or destinations, we welcome contributions back to the upstream project: https://github.com/airbnb/interferon.