How to Implement Service Level Objectives in New Relic APM

Use NRQL query language to build service level objective (SLO) monitoring and unlock insights into the reliability of your apps and services

Published in

Slalom Build

7 min readMay 11, 2021

APM platforms like DataDog and Prometheus include native SRE features that support SLO based monitoring of system reliability. These features are unavailable in New Relic but similar SRE monitoring behaviors can be crafted using New Relic Query Language (NRQL).

Overview

In this article, we will explore design considerations and implementation details of latency and availability (5xx) SLOs in New Relic. Right-sizing reliability is beyond the scope of this article but readers may find this point-of-view useful for thinking about what level of reliability makes sense for your apps.

ColeSLAw

Before jumping in, let’s do a quick review of service level indicators (SLIs), service level objectives (SLOs) and service level agreements (SLAs). These terms are often used interchangeably when talking about systems reliability. I fondly call these collective concepts “Coleslaw”. I have a bias towards defining these concepts from a user’s perspective. After all, users expect products and services to work on their terms!

An SLI approximates a user experience and is best expressed as a latency, availability, or quality (degraded response) metric.
An SLO measures a user experience metric based on a reliability threshold over time.
An SLA is a guarantee of reliability and may have punitive repercussions when a SaaS provider fails to meet its contractual obligations for its customers. Organizations who publish SLAs must operate SLOs at higher reliability thresholds and allocate engineering dollars accordingly.

Photo by Sebastian Coman Photography on Unsplash

Implementation Guide

The first implementation step starts with a reliability formula definition based on a service level indicator (SLI) we want to measure. In this article, we are focusing on latency and availability metrics.

Typically, we express the SLO formula as a proportion of:

 invalid events / valid events

Next, create an NRQL template and parameterize the inputs based on the reliability formula. Depending on the infrastructure as code (IaC) provider used by your DevOps team, use the templates to churn out and deploy SLOs with different thresholds. Otherwise, plug and chug the parameter values and then execute within the NRQL query window.

Latency SLO

Reliability Formula

This reliability formula defines the number of events that “can” exceed a latency threshold.

/*
Sum the residual latency times of all events > latency {threshold}
Divided by 
The sum of all transaction times multiplied by a reliability threshold (1 - NINES/100)
*/100 * (sum(((duration - {threshold}) > 0) 
/ 
sum(duration) * (1 - NINES/100))

NRQL Reliability Template

Create the latency SLO query based on the reliability formula definition from above. After testing the queries, replace hard-coded values with parameter tokens that align with your IaC provider and pipes automation.

SELECT 
  100 *     
  (filter(sum(duration - {threshold}), WHERE (duration - {threshold}) > 0))      
  /     
  (sum(duration) * (1 - {NINES}/100)) as `Error Budget Burn %`   
FROM 
  {PageView || Transaction}    
SINCE {period} DAYS AGO  
WHERE appName = {app}

NRQL Reliability Example

Measure the residual duration of all events whose duration is greater than a 3 second threshold.
Calculate an error budget for 2 nines of reliability (99) — the amount of budgeted time that events “can” take longer than 3 seconds.
Calculate the user experience near the load balancer— use the Transaction event table.
Monitor over a 7 day rolling window.
Express the error budget burn, what has been spent, as a percentage.

SELECT 
  100 *     
  (filter(sum(duration - 3.0), WHERE (duration - 3.0) > 0))      
  /     
  (sum(duration) * (1 - 99/100)) as `Error Budget Burn %`   
FROM 
  Transaction    
SINCE 7 DAYS AGO 
WHERE appName = 'myApp'

NRQL Reliability Trend Example

This NRQL query is similar to the example above but shows how to visualize a latency SLO over time.

SELECT 
   100 *     
   (filter(sum(duration - 3.0), where (duration - 3.0) > 0 ))     
   /     
   (sum(duration) * 0.01 ) as `Burn Trend %`  
  ,99 as `SLO`  
  ,100 as `Max`  
FROM 
  Transaction  
SINCE 7 DAYS AGO TIMESERIES AUTO  
WHERE appName = 'myApp'

Availability SLO

Reliability Formula

This reliability formula defines the number of events that “can” return 5xx HTTP responses.

/*
Count all 'valid' events multiplied by a reliability threshold {NINES} less the count off all 'invalid' events
Divided by 
All 'valid' events times multiplied by a reliability threshold {NINES}
*/100 * 
(count(valid-events) * (1 - {NINES}/100)) - count(invalid-events))
/ 
count(valid-events) * (1 - {NINES}/100))

NRQL Reliability Template

Create the availability SLO query based on the reliability formula definition from above. After testing the queries, replace hard-coded values with parameter tokens that align with your IaC provider and pipes automation.

The concept of valid and invalid events map to Transaction and TransactionError event tables respectively. NRQL doesn’t support table joins but multiple event tables can be referenced in the FROM clause. Use the filter() function to perform inline calculations. Otherwise, fields common to both event tables will distort calculations especially with wildcards.

SELECT 100 - (100 * 
   (((filter (count(*), WHERE EventType() = 'Transaction') * 1 - {NINES}/100)                 -  (filter (count(*), WHERE ((EventType() = 'TransactionError' and (httpResponseCode like '5%' ))))))     
  /
(filter (count(*), WHERE EventType() = 'Transaction') * 1- {NINES}/100))) 
    as `Error Budget Burn %` 
FROM 
  Transaction, TransactionError   
WHERE appName = {appName}
SINCE {period} DAYS AGO

NRQL Reliability Example

Calculate an error budget for one nine of reliability — the percentage of transactions that “can” return 5xx HTTP responses.
Calculate the user experience near the load balancer — use Transaction event table.
Monitor over a 7 day rolling window.
Express the error budget burn, what has been spent, as a percentage.

SELECT 100 - (100 * 
   (((filter (count(*), WHERE EventType() = 'Transaction') * 0.10)                 -  (filter (count(*), WHERE ( (EventType() = 'TransactionError' and (httpResponseCode like '5%' ))))))     
  /
(filter (count(*), WHERE EventType() = 'Transaction') * 0.10))) 
    as `Error Budget Burn %` 
FROM 
  Transaction, TransactionError   
WHERE appName = 'myApp'  
SINCE 7 DAYS AGO

NRQL Reliability Trend Example

This NRQL query is similar to the example above but shows how to visualize an availability SLO over time.

SELECT 
  100 - 
    (100 * 
    (((filter (count(*), WHERE EventType() = 'Transaction') * 0.10)                      - 
    (filter (count(*), WHERE ( (EventType() = 'TransactionError' and  (httpResponseCode like '5%' ))))))     
/     
    (filter (count(*), WHERE EventType() = 'Transaction') * 0.10)))  as `Error Budget Burn %`   
,100 as `Max`   
,90 as `SLO` 
FROM 
  Transaction, TransactionError   
WHERE appName = 'myApp'  
SINCE 14 DAYS AGO TIME SERIES AUTO

General Usage and Implementation Notes

SLOs generate reliability data that open new opportunities for improving products and operations efficiency. Experimentation and right-sizing reliability based on the unique characteristics of the systems architecture is a good first step for identifying actionable work items.

Experiment with “nines” and thresholds when implementing SLOs on existing systems — establish SLO baselines.
Combine multiple services for broader, top-level views of reliability.
NRQL PageEvents table provides round-trip request/response data including time to first byte (TTFB) however, ISP network latency may distort reliability.
Alert when error budgets approach warning and critical thresholds— an indication of something going wrong, though not right away. Consider automating ticket generation for the team and investigate warning SLOs.
Review SLOs over time and look for trends. Sprint retrospectives and weekly Ops meetings are good places to bring engineering and business stakeholders together. Is the team shipping too many features at the expense of reliability?
The NRQL parsing engine is sensitive to order of operations and parentheses with complex inline calculations — incrementally test each part of the query.
A quality or degraded response SLO requires customization of the Transaction event table to track additional data for event filtering. For those apps that incorporate circuit breakers, consider modifying your APM monitoring application code and write additional state to the Transaction event table. Degraded response trends provide additional reliability insights.

Reliability Dashboard

The redacted dashboard below shows our SLOs in action.

The first column illustrates SLO error budgets —transactions that can fail or can take longer. The middle column provides summary data of the various calculations used in the SLO. And finally, the last column illustrates how the SLO error budget is being spent over time.

Reliability Insights

Does the shape and volume of your app’s traffic follow a pattern over a 24hr, weekday, or weekend periods? If so, observe these traffic patterns and watch how error budgets are impacted as volume changes.

Other insights to look for:

Observe how SLOs change during peak traffic periods. Fluctuations may indicate a need for capacity planning and auto-scaling policy reviews.
Observe how SLOs are impacted during deployments
Observe how SLOs change during peak events such as Black Friday or Super Bowl Sunday
Observe how SLOs change on Monday mornings — cold cache starts impact latency

A Note about PromQL

New Relic includes a Prometheus based query language called PromQL, click here for more details, however, there is no support for SLO monitoring. In the meantime, consider using NRQL as a strategy for surfacing reliability insights until PromQL matures.

Wrapping Up

Start measuring the reliability of your systems by first focusing on availability and latency metrics. Use the NRQL templates to plug in different thresholds and experiment with different “nines” of reliability over various time periods. After tuning and settling on SLO baselines, stand up your reliability dashboard and observe how your apps burn their error budgets over 24hr, 3 day and 7 day windows. Use this data to identify periods of sluggish and non-responsive behavior and add backlog items to address these reliability issues.

Hope you enjoyed the coleSLAw and thanks for reading!