Taking a Dive Into New Relic APM

Keith Smith
Imagine Learning Engineering
14 min readJul 13, 2020

This is part 2 of a walkthrough series on using New Relic with containerized Docker applications. Part 1 describes how to modify an existing .NET Core Docker application to run New Relic APM and can be found here. Part 2 will take place in New Relic itself. In this walkthrough, we’re going to discuss monitoring and alerting in depth to understand what should be monitored and what metrics are important to keep an application healthy. We’ll then dive into some of the key components of New Relic and configuring New Relic to create monitoring and alerting that integrates with incident response lifecycles.

For this tutorial, you will need the following access:

  • Access to manage an Application in New Relic.
  • Access to OpsGenie and rights to create integrations.
  • Access to Slack and rights to create integrations.

A free trial of each of these applications is sufficient for this walkthrough. To create alerts in New Relic you do need to have a Pro license.

Before we start a deep dive into New Relic APM, we will start with a quick primer on monitoring and alerting applications. Understanding monitoring and alerting well provides three big wins for your organization:

  • Transparency — I know how services are doing.
  • Understanding — I know how services work and what needs to happen / change for my services to be healthy.
  • Confidence — I am confident in the reliability of my services.

The Four Golden Signals

Every application has four major metrics, signals, or indicators of service health. The four “Golden Signals” are:

  • Latency
  • Traffic
  • Errors
  • Saturation

Effectively measuring and understanding these signals allows for proactively identifying issues and resolving problems as they occur, sometimes before a customer will even tell you that there is a problem.

Latency

Latency is a measure of customer experience. It describes how long it takes to get anything done. This metric has a direct impact on customers, and can be measured in many different places within an application:

  • The client itself.
  • Application load balancer.
  • Server code.
  • Database code.

New Relic measures Latency at the server code level in New Relic APM, though It’s the first metric you see as you log in, Web transactions time.

New Relic APM Latency

New Relic can be used to measure Latency throughout your application with New Relic browser and application traces in New Relic APM. We’ll take a deeper dive into this later in the tutorial. New Relic also goes a step further and measures latency of individual endpoints and transactions through traces.

New Relic transactions

Traffic

Traffic is a measure of the total volume of work being attempted at a given time. This can be measured in transactions per minute, request rates, etc. Traffic directly correlates with business value. The more traffic an application can handle, the more value it can provide to the business.

Once again New Relic highlights this signal right on the first screen of New Relic APM in the Throughput graph:

New Relic throughput

Errors

If I had to choose a favorite golden signal, Errors would be it.

Errors provide a nice, defined target to aim towards. Errors measure the ratio of success to failures. They measure a direct impact on customer experience. The goal of any application should be to get down to what I call “Error Rate Zero”. When all errors are eliminated from an application, when errors crop up, it’s a lot easier to find them.

Another very cool aspect of error rate, is that if you take the inverse of Error rates, you get your application success rate, or the rate of successful requests to your application.

New Relic’s primary purpose is to aid in finding and resolving errors, and Errors are featured prominently on the front page again:

New Relic Error rate

Saturation

Of all the golden signals, saturation is the most difficult to understand and the most difficult to measure. That is because saturation can be different for different applications.

Saturation is a measure of how close you are to reaching capacity. It provides a direct measure of scalability and is essential for capacity planning. In a nutshell, saturation is the metric that “fails first” when an application is under extreme load. This can be memory, CPU, and in some cases storage capacity. Knowing what the saturation metric is for an application takes careful observation of how an application reacts under load. Saturation is identified by performing application load testing or reviewing what happened to an application that experienced higher-than-expected load.

New Relic features some common saturation metrics at the bottom of the application page in APM, broken out by host in either table or graph form:

Host-level metrics in New Relic APM (table). CPU and Memory are the most common Saturation metrics.
Host-level metrics in New Relic APM (graph).

Creating Alerts on the Four Golden Signals

Now that you understand what to monitor and why each of these Four Golden Signals are important, let’s dig into creating alerts on these metrics in New Relic.

To get started with this, select Alerts from the top-right navigation menu.

Alerts are broken down into two sections, Alert policies and Notification channels. Alert policies are how alert thresholds are defined. Notification channels are where alert notifications are sent for alerts policy violations. Before an alert policy is set up, notification channels should be created. Select Notification channels from the top navigation.

At the top right, select the + New notification channel to create an integration. For this tutorial, we’ll create two, one for Slack and one for OpsGenie. The reason we’re creating two is that OpsGenie is great for production workloads and Incident Response Management. Slack is great for general notifications and non-prod workloads.

Adding Slack Integration

Open up a new browser tab for this next section. We’ll be coming back to New Relic in a moment. To add a slack integration, the first step is to find New Relic Alerts in the Slack App Directory and add it to Slack. The Slack app directory can be found at <organization>.slack.com/apps. Once the app is approved for your workspace, select Add to Slack to begin the process of creating the integration.

New Relic Alerts Slack App

Adding the New Relic integration is simple. Choose a channel and select Add New Relic Integration.

You can follow the instructions in the New Relic Alerts [Beta] section to configure a new Notification channel. For some reason New Relic hasn’t updated these instructions since New Relic Alerts came out of Beta several years ago, but they are accurate if you ignore the word “Beta”.

Go back to the browser tab with New Relic Notification Channels to configure the Slack channel integration.

  • Select Slack from the dropdown of Notification Channel types.
  • Name the channel.
  • Add the webhook URL from Step 2 of the configuration instructions.
  • Add the channel name (optional).
  • Select Create channel.

Once the integration is created, select Send a test notification to validate that the integration is working. You should get a 200 response back from New Relic and see an incident created in your slack app.

New Relic Slack incident.

You’re all set up and Ready to create an alert policy. Before we do that, let’s set up an OpsGenie integration.

Adding OpsGenie Integration

To create an OpsGenie integration, start by creating an OpsGenie integration in the OpsGenie app. This is found in by selecting Settings from the top navigation and selecting Integration list from the left navigation. Search for New Relic and select New Relic Alerts (New).

New Relic OpsGenie integration.

Name the integration, copy the API Key, and Save the integration.

Navigate back to New Relic and create a new Notification Channel in the same way that was done for the Slack integration with the following:

  • OpsGenie channel type.
  • Name the channel.
  • Paste the API key from OpsGenie.
  • Create the Channel.

Selecting what teams are alerted in OpsGenie is outside the scope of this tutorial, but this can be done on either the OpsGenie side or directly in the New Relic integration.

Send a test notification to validate that this integration is working correctly. A 200 response should be returned and an alert should be created in OpsGenie.

New Relic Test violation in OpsGenie.

Creating Alert Policies in New Relic

Now the real fun begins. Navigate to Alert policies in the Alerts section of New Relic. Select + New Alert Policy.

Follow the prompts to create an Alert Policy.

  • Name the Policy.
  • Select Incident Preference. For this tutorial we’re creating an alert policy for a single app, so we’ll aggregate all alerts By Policy.
  • Create Policy.

In practice, an OpsGenie notification channel with its own alert policy for each app makes sense, as the on-call team for each app may change independently of other apps.

Next let’s create some conditions for alerts based on the Four Golden Signals described above. Select Create a condition.

For Categorize, no changes are required. We’re looking at Application Metrics in APM. Select Next to select the entities (app) we’re evaluating alerts against.

Select your app and choose Next to define the metrics to alert on.

Let’s start with Error Rate. Under define thresholds, select Error percentage from the dropdown. If you’ve got an Error rate baselined at 0%, choose 0% for at least 5 minutes (the minimum threshold allowed from New Relic). Create the condition.

OK let’s break down what we just did. For this condition, we set up an alert on Error Rate that will trigger when the Error percentage is greater than 0% for at least 5 minutes. This means that in any 5 minute window if the average Errors go above 0%, an alert will trigger.

But the alert won’t go anywhere yet! Let’s add the notification channels we created earlier to this policy. Select the notification channels tab at the top of the alert policy. Add notification channels.

Search for both the Slack integration and the Opsgenie integration and select it, then update the policy.

Add additional Alert Conditions for throughput (web) and Response time (web). Throughput can be used to detect massive increases in traffic (3–10x) to detect things such as DDoS attacks or traffic increased beyond expectations. It can also be used to detect when throughput drops to zero. This can be an indicator that the application is unavailable and not working as expected.

Response time should select a threshold that aligns with agreed upon expectations for your app. If your app has an SLA, you can tie alerts to SLA violations or internal objectives for response time.

When you are done, the conditions should look similar to the screenshot below:

Alert policy for sample application.

Creating alerts on Saturation metrics requires a bit more work. It isn’t done directly in APM, but requires leveraging NRQL and New Relic Infrastructure. This will be described in future posts.

Alerts are now configured for this application, but what do I do now? What happens when I get an alert?

Troubleshooting Application Issues

Let’s head back to the APM section to dive into a few areas that provide details on where errors are being thrown in an application and to investigate where they are coming from.

The first place I often go to when investigating issues is the Error analytics section. This can be found by selecting the title of the Error rate graph or selecting Error analytics in the left navigation.

Error analytics allows you to filter errors by transaction name and error class, and to quickly see error messages as they occur.

From here you can drill down into individual transactions by selecting them and see sample traces including stack traces if they are available. This often helps identify the root of issues related to errors being thrown by applications down to the line of code.

To troubleshoot performance issues, the Transactions section of APM can be used to identify endpoints that have slow response times.

Select Transactions from the left navigation menu or select the Transactions heading from the main screen.

In the dropdown, select Slowest average response time.

Identify slow transactions with APM.

New Relic will also automatically take sample traces for transactions that are performing slowly. You can dig directly into these by selecting them from the table at the bottom of the page.

Visualizing APM Data with Insights

With New Relic, all of the data and metrics that we’ve looked at so far are available to be queried and sliced using New Relic Insights. This can help tremendously with understand how requests coming in are affecting the application. To start querying this data, select Insights from the top navigation menu.

To query data in Insights, New Relic leverages a query language called New Relic Query Language (NRQL).

NRQL is fairly simple and easy to get started with, but has a tremendous level of depth as you get deeper into understanding it. New Relic makes it easy to get started with tab completion. Let’s run a simple query to see the transactions from a typical application running in production.

SELECT * FROM Transaction WHERE appName = '<app-name>'

This query will pull all of the details SELECT * FROM TRANSACTION for a typical transaction into a table, including Response Status Code, Duration (response time), Database Duration (time spent talking to the DB), Transaction name, URL, etc.

These can be further filtered down to individual transactions by filtering on the transaction name:

SELECT * FROM Transaction WHERE appName = '<app-name>' AND name = '<transaction-name>'

Tables are great, but let’s visualize some data over time. For this we’ll modify the existing query to get average response time for a specific transaction.

SELECT average(duration) FROM Transaction WHERE appName = '<app-name>' AND name = '<transaction-name>' TIMESERIES 1 minute

In this query we are selecting average(duration) to query average response time instead of the transaction details and using TIMESERIES 1 minute to view this metric over time with a 1 minute granularity.

Let’s make this chart even more interesting and look at the response codes associated with each of these transactions to identify any trends with average response time. We can do this with FACET.

SELECT average(duration) FROM Transaction WHERE appName = 'Galileo Prod' AND name = 'WebTransaction/ASP/aspx/admin/addstudent.aspx' TIMESERIES 1 minute FACET response.status

This query slices each of the transactions by response code and assigns a color to each response code.

Creating Custom Dashboards

Insights queries can be added to a custom dashboard at any time simply by adding a title and selecting Add to a dashboard.

From here you can create a new Dashboard or add the chart to an existing dashboard.

Open up the Dashboard by selecting the green bar that appears when you create it or searching for it by selecting All dashboards from the left navigation.

Take some time and create a few more queries and add them to a new dashboard to create a customized view of what is most interesting to you. Some useful samples to get you started:

  • SELECT count(*) FROM Transaction — View the count of transactions, filterable by WHERE and easily combined with TIMESERIES or FACET.
  • SELECT count(*) FROM TransactionError — View the count of transaction errors, filterable by WHERE and easily combined with TIMESERIES or FACET.
  • COMPARE WITH 1 week ago — Compare any query with the same results from a week earlier.

Basic queries not including FACET or TIMESERIES also support the ability to add thresholds to dashboards. For example:

SELECT count(*) FROM TransactionError SINCE 1 day ago

This query allows for a Warning / Critical threshold to be set. This will make the metric go red in the dashboard. Alerting via NRQL is also supported but is different from this function.

You now have a functional dashboard. Now that you know some NRQL, let’s go back to the Alerts and add some new more powerful alerts to the alert policy.

Adding Insight Queries to Alert Policies

To add a NRQL Query to New Relic Alerts, go back to APM and your alert policy and add a new condition. This time, select NRQL from the Categorize section and select define thresholds.

What is possible with NRQL Alerting? New Relic states the following:

You can alert on basic NRQL queries that return a numerical value. A basic query is structured like this:SELECT function(attribute) FROM event [WHERE attribute [comparison] [AND|OR ...]]

The following functions and clauses are available in NRQL alerting:apdex,average,count,latest,max,min,percentage,percentile,sum,uniqueCount,=,!=,<,<=,>,>=,AND,OR,IS NULL,IS NOT NULL,IN,LIKE,NOT LIKE,FACET,IS,IS NOT

The full documentation on NRQL Alerting is available here. It is by far the most powerful alerting capability within New Relic.

Using a NRQL statement from above, we’ll add a Baseline Threshold to spot and alert on any outliers in a given query.

SELECT * FROM Transaction WHERE appName = '<app-name>' AND name = '<transaction-name>'

The violations slider will tighten how far the metric will deviate before alerting. You can adjust that and even set it up to automatically close alerts when the baseline returns to expected levels.

NRQL Baseline Alerting

Name the condition and select Create condition.

In this tutorial, we dove deep into what is important to alert on for any application. We created basic alerting. We then dove deep into querying our data using New Relic Insights. We created custom dashboards to visualize our data. And last, we created custom alerting on key parts of our application.

--

--