At Airbnb, we do not have an engineering operations team (as of 2017), so individual teams are responsible for configuring monitoring and responding to problems for their service. We use Datadog to monitor our infrastructure and alert on its health. While Datadog works well and provides many features, we had some specific requirements around alerting:

  • Generic alerts need to alert different people depending on the host or role.
  • Alerts definition changes are automated so that the alert stays up to date as our infrastructure changes.
  • Teams have insight into which alerts they are receiving and are able to discuss creating, modifying, or deleting alerts using a standard code-review process. …

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store