At Airbnb, we do not have an engineering operations team (as of 2017), so individual teams are responsible for configuring monitoring and responding to problems for their service. We use Datadog to monitor our infrastructure and alert on its health. While Datadog works well and provides many features, we had some specific requirements around alerting:

  • Generic alerts need to alert different people depending on the host or role.
  • Alerts definition changes are automated so that the alert stays up to date as our infrastructure changes.
  • Teams have insight into which alerts they are receiving and are able to discuss creating, modifying, or deleting alerts using a standard code-review process. …

