Manage New Relic Alerts Using Rest API

Yagizhan Kanbur
Trendyol Tech
Published in
5 min readJan 12, 2023

As Pudo(Pick Up Drop Off Points) team we had around 50 microservices that we needed to set up alerts. If you ever used new relics interface for creating and managing alerts is a long process, if you are trying to create lots of alerts at once, so we designed a different way to set up and manage alerts using New Relics Rest API.

New Relic Rest API

New Relic Rest Api has lots of endpoints for different uses but we are interested in four endpoints for alert conditions. We used these endpoints to create or update our alerts.

Both create and update request payloads has the same fields. Only difference here is the endpoint

PUT 'https://api.newrelic.com/v2/alerts_conditions/{id}.json' // Update endpoint
POST 'https://api.newrelic.com/v2/alerts_conditions/policies/{policy_id}.json' // Create endpoint
{
"condition": {
"type": "apm_jvm_metric",
"name": "Foo Service - CPU Utilization Time",
"enabled": true,
"entities": [
"123456789"
],
"metric": "cpu_utilization_time",
"violation_close_timer": 1,
"terms": [
{
"duration": "5",
"operator": "above",
"priority": "critical",
"threshold": "90",
"time_function": "all"
},
{
"duration": "5",
"operator": "above",
"priority": "warning",
"threshold": "80",
"time_function": "all"
}
]
}
}

Payload above looks complicated at first but it’s easy to understand when you learn about certain fields. So let’s look whats these fields for one by one.

Condition: describes a monitored data source and the behavior of that data source that will be considered a violation.

Type: defines the type of metric that will be used for the alert such as “apm_jvm_metric” or “apm_app_metric” etc.

Name: is up to you, it will allow you to identify it.

Enabled: is for enabling or disabling a condition. Either true or false.

Entities: are application ids identifying the objects that will be monitored with your condition.

Metric: is the exact parameter based on type such as “response_time” or “cpu_utilization_time”

Violation Close Timer: use to automatically close instance-based violations after the number of hours specified.

Duration: is the time (in minutes) for the condition to persist before triggering a violation.

Operator: determines what comparison will be used between the metric and the threshold value to trigger a violation.

Priority: corresponds to the severity level selected

Threshold: is the threshold that the metric must be compared to using the operator for a violation to be triggered.

Time Function: either all or any corresponding to for at least and at least once in

You can find more about these in new relic documentation.

GET 'https://api.newrelic.com/v2/alerts_conditions.json' // Get List

List endpoint has two parameters in its request policy_id and page. Endpoint returns all the conditions that are created inside a policy page by page.

Design

We stored these payloads in a gitlab repository by separating them to directories according to which alert policy they are reside in and which applications they belong (Foo Policy / Foo Service / cpu-utilization.json)

*Payload from earlier also represents cpu-utilzation.json

When a change made in gitlab repository a script sends changed file names to our observability api’s syncAlerts endpoint, then our service fetches these documents from gitlab repository and fetches alerts list from new relic(List endpoint mentioned above), then decides if alert exists or not. If alert exists on new relic it sends and update request with payload fetched from gitlab. If alert does not exists then it sends a create request.

Apart from the standart creation and update of alerts. A scheduler works every night to compare alerts in our repository to alerts that exists on new relic and synchronizes alerts on new relic based on definitions made in gitlab repository. That way we can ensure that any change made from new relic interface would be temporary and alert definitions in gitlab repository are always true.

Iteration

Having lots of alerts is tricky. Alerts needs to be precise and only trigger when there is a problem that needs to be solved. It is nearly impossible to set perfect alerts from the get-go. Even if you can, services are evolving and changing every day and it changes the behavior of alerts as well. So we’ve been using new relics dashboard feature to support our alerting system to provide us valuable knowledge about alerting patterns, and we’ve been able to figure out certain repetitive conditions that caused alerts triggering and solve the problems or change the alerts to prevent false positives.

We use dashboards to see alerts that triggered in a week or in a month and to see the statistics about which condition triggered how many times in a certain period of time. It’s help us to see which alerts are repetitive or poorly defined so lots of false positive alerts triggered.

I will not dive deep on dashboards on this post. It would’ve make this post longer than necessary and its a bit out of the main subject.

Conclusion

As our services evolves and new projects and features developing every day, we need to actively change and improve existing alerts, and add new ones.

We’ve been using this system for a while now and it made easier for us to create new alerts or updating already existing ones by cutting the time we spend on new relic interface. And acting as a source where we can easily see what alerts we have.

If you have lots of alerts and trying to improve those alerts constantly or planning on setting up lots of alerts, a similar solution may help you.

Thank you for your time and feel free to contact me if you have any questions.

--

--