How We Migrated Datadog Alerts to Sentry and Reduced Production Monitoring Overhead

Ran Mochary
Gong Tech Blog
Published in
6 min readJun 19, 2022

At Gong, we develop an AI-based business solution that improves performance for sales teams. Our software captures and analyzes communication across multiple channels and provides insights to help users win more deals.

Sales teams rely on our platform day in and day out for their calls and deals, which means that it has to be up and running well, and we have to be able to respond to issues in a timely way.

The Challenge

Since we don’t have an infrastructure team (read more about that here), infrastructure duties are handled on a rotating basis by developers.

Every week, one of the developers serves as the “on-call duty owner,” in charge of monitoring production and making sure there aren’t any issues affecting end users.

Being the on-call duty owner ends up being a full-time job with tremendous overhead and fatigue, primarily because of the high number of alerts that have to be dealt with.

We’ve traditionally managed our production alerts using Datadog, which then forwards them to OpsGenie. It’s up to the on-call developer to review system alerts in OpsGenie and delegate them to the appropriate team members to troubleshoot.

There were several issues with this setup:

  1. Too many non-urgent and/or recurring alerts were making their way onto the on-call duty owner’s plate. These alerts were distracting the duty owner from the issues that actually needed immediate attention because they were affecting our customers and users in real-time.
  2. All alerts were being funneled into OpsGenie from Datadog.
    We found that:
  • Many alerts are triggered by transient circumstances, such as a temporary load, and automatically close when these circumstances change. The duty owner often missed such alerts, especially when they occurred at night.
  • OpsGenie is a better fit for urgent alerts, and it was difficult for us to manage non-urgent alerts with OpsGenie.

To illustrate this, we have 371 Datadog monitors that are configured to send alerts to OpsGenie. Most of them (316) are not urgent. In one week, we got non-urgent alerts from 150 different monitors, some of them recurring alerts.

DataDog alerts in OpsGenie

To reduce the overhead on the on-call duty owner and handle alerts better (including forwarding non-urgent alerts directly to team members who could work on them at their own pace, rather than have them mixed in with urgent alerts), we decided to explore some different options for handling production alerts, ultimately choosing to use Sentry, as we’ve used it successfully in other scenarios.

Options We Explored

First, we tried using the Datadog integration with Amazon SNS but received partial data, which required additional calls to the Datadog API. Additionally, the AWS SNS integration didn’t work when enabling server-side encryption on the AWS side.

Next, we tried using the Datadog webhooks integration, which required a webhooks server. We considered using an existing webhooks server but we decided we didn’t want to mix concerns with critical loads like Zoom native recordings.

We also considered creating a dedicated webhooks server but opted not to as it didn’t offer any added value over a generic pipeline using AWS API Gateway and Simple Queue Service (SQS), which already provides authentication, rate limiting, encryption, and so on out of the box.

The Solution

We found that OpsGenie works best for urgent alerts so we still use it to handle those. However, we also needed a way to deal with non-urgent alerts that clog up the system, and OpsGenie wasn’t the answer.

We’ve had a good experience working with Sentry for managing developer workflows for production error logs, so we opted to use Sentry’s application monitoring and error-tracking software for non-urgent Datadog alerts as well.

With Sentry, we can assign alerts to specific individuals or groups, which frees up the duty owner to deal with more pressing needs.

A big advantage in Sentry is we can group alerts into issues based on a certain “fingerprint” so they can all be handled at once, unlike OpsGenie, which creates a separate event for each alert. This is especially helpful for recurring alerts, since they’re grouped together, rather than each one appearing individually.

Datadog alerts in Sentry

However, while Sentry provides an integration for sending events from Sentry to Datadog, we needed an integration in the opposite direction — to send events from Datadog to Sentry. To do so, we came up with a custom Datadog-to-Sentry bridge.

We configured Datadog to use Datadog webhooks integration to send events to an AWS API Gateway endpoint, protected by an AWS Web Application Firewall (WAF). Then, we configured the API gateway to forward the Datadog events to Amazon SQS. Now, a periodic task on the Gong server periodically polls the relevant SQS queue, consumes the Datadog events, converts them to Sentry events, and sends them to Sentry using the Sentry SDK client.

We could have used AWS Lambda instead of a periodic task on the Gong server to read events from Datadog and migrate them to Sentry, but it was easier to use our existing Gong frameworks.

Datadog-to-Sentry bridge

Once we put the Datadog-to-Sentry bridge in place, we needed to implement a watchdog to make sure the alert system is indeed working properly.

To do this, we set up an integration beacon, which triggers a specific Datadog monitor that sends out alerts every 10 minutes.

On the other side of it, there’s a periodic task in charge of calling the Sentry API to make sure that events are coming through in Sentry. If they don’t come through, there’s an issue, and an alert goes out to OpsGenie to make sure that we get the monitoring system back up and running as soon as possible.

Configuration and Logging

If you want to configure the AWS components of this pipeline, you can do so by following the instructions in Webhook Processing with API Gateway and SQS.

When events go through the pipeline, there’s a chance that some won’t reach SQS.

CloudWatch logs are useful for debugging these issues, but they’re turned off by default. Instructions for turning them on can be found at How do I turn on CloudWatch Logs for troubleshooting my API Gateway REST API or WebSocket API?

After the events are sent through the pipeline, you’ll find their logs in two AWS CloudWatch log groups:

1. Execution logs, with a name like API-Gateway-Execution-Logs_<apiId>/<stageName>

2. Access logs, with the name you configured in the “Access Log Destination ARN” field in the API Gateway Logs/Tracing configuration

The Results

We’ve still got some kinks to work out, but by using OpsGenie for urgent events and letting Sentry handle the others, we’ve been able to automate more and reduce production management overhead.

With Sentry, alert management is distributed. We can automatically assign non-urgent alerts (or groups of alerts) to certain teams, team members, or work owners, and they can handle them at their convenience as part of their regular workflow.

As a result, the on-call duty owner can focus exclusively on the urgent alerts that need to be resolved immediately in order to keep our product up and running for our customers, without being bogged down by non-urgent alerts and without having to invest time or effort into funneling them to other team members.

Do you want to be part of a team that creates solutions and finds better ways to work together? We’re pretty unique here at Gong and we’re hiring!

--

--