Scaling Tailored Observability — Dynamic dashboards creation using NewRelic NerdGraph

Published in

Ancestry Product & Technology

5 min readJan 3, 2024

If we compare developing software to building an airplane, then running the software without good observability and monitoring is like flying the airplane blind. Needless to say, flying blind is dangerous and potentially destructive. Why would anyone not want to have helper dashboards and alerts to assist them and diagnose any issue that mysteriously wakes them up in the middle of the night?

At Ancestry, we use NewRelic as our observability tool to help us monitor our applications’ health and performance using NewRelic APM. We have numerous alerts and dashboards setup for quickly identifying and easily drilling down into issues. We have always created NewRelic dashboards using the UI or using terraform. Recently, we automated and scaled the dynamic dashboard creation for our business events using NewRelic’s NerdGraph. We have automated and scaled observability for this specific use case and this has indeed boosted our confidence and made us feel mighty capable of handling mishaps with our business events.

Brainstorm and design the solution:

The dashboards need to dynamically update based on schema registry changes. However, we don’t need dashboard changes for all the changes in schema but only for certain attributes. So we let the analysts tag the attributes of interest as monitored attributes. We also needed a way to default some common attributes as monitored attributes. Any monitored attribute change in the schema registry needs to result in corresponding dashboard change.

Implementation Details:

As all our resources are in AWS Cloud, we published schema create, delete and monitored attribute changes to an AWS SQS queue. NerdGraph is NewRelic’s GraphQL API and it can be leveraged for our use case to create and modify NewRelic widgets, pages and dashboards.

Dashboard creation/update:

SQS listeners in our monitoring micro service check if the dashboard for the given schema exists using NerdGraph entitySearch actor and if no dashboard exists, it creates new dashboard, pages and widgets using createDashboard mutation. If the dashboard exists, updates are performed using dashboardUpdate mutation.

We designed various pages to each schema’s dashboard. First page to give overall statistics (count, validation errors, schema versions triggered etc) of the schema. Second page to drill down by faceting the counts based on each of the monitor fields as those are of interest to our analysts. Looping through our comprehensive set of overall stats queries and monitor fields, widgets are added.

Example entitySearch request made to NerdGraph:

{
  actor {
    account(id: $NEWREIC_ACCOUNT_ID) {
      id
    }
    entitySearch(query: "domainType IN ('VIZ-DASHBOARD') AND name LIKE '$SCHEMA_NAME'") {
      results {
        entities {
          guid
          name
        }
      }
    }
  }
}

Example create dashboard mutation:

{"query":"mutation CreateDashboard($accountId: Int!, $dashboard: DashboardInput!) { dashboardCreate(accountId: $accountId, dashboard: $dashboard) { entityResult { guid name description accountId createdAt updatedAt owner { email userId __typename } permissions pages { guid name description createdAt updatedAt owner { email userId __typename } widgets { id visualization { id __typename } layout { column row height width __typename } title linkedEntities { guid __typename } rawConfiguration __typename } __typename } variables { name items { title value __typename } defaultValues { value { string __typename } __typename } nrqlQuery { accountIds query __typename } title type isMultiSelection replacementStrategy __typename } __typename } errors { description type __typename } __typename } }","variables":{"accountId":$NEWRELIC_ACCOUNT_ID,"dashboard":{"name":"$SCHEMA_DASHBOARD_NAME","permissions":"PUBLIC_READ_WRITE",

//Event Overall Stats page and its widgets
"pages":[{"widgets":[{"id":null,"visualization":{"id":"viz.line"},"rawConfiguration":{"nrqlQueries":[{"accountIds”:[$NEWRELIC_ACCOUNT_ID],”query":"SELECT sum($SCHEMA_DIMENSIONAL_METRIC) FROM Metric TIMESERIES compare with 1 day ago"}]},"title":"Successful events triggered","linkedEntityGuids":[]},
{"id":null,"visualization":{"id":"viz.billboard"},"rawConfiguration":{"nrqlQueries":[{"accountIds":[$NEWRELIC_ACCOUNT_ID],"query":"SELECT sum($SCHEMA_DIMENSIONAL_METRIC) FROM Metric FACET stack"}]},"title":"Events count across various microservices","linkedEntityGuids":[]},

{"id":null,"visualization":{"id":"viz.line"},"rawConfiguration":{"nrqlQueries":[{"accountIds":[$NEWRELIC_ACCOUNT_ID],"query":"SELECT count(*) FROM $SCHEMA_ERRORS_CUSTOM_EVENT WHERE eventName = 'Copy_Core_TestBackendRR' TIMESERIES EXTRAPOLATE}]},”title":"Total validation errors","linkedEntityGuids":[]},

{"id":null,"visualization":{"id":"viz.table"},"rawConfiguration":{"nrqlQueries":[{"accountIds":[$NEWRELIC_ACCOUNT_ID],"query":"SELECT count(*) FROM $SCHEMA_ERRORS_CUSTOM_EVENT WHERE eventName = ‘$SCHEMA_NAME FACET errorMessage LIMIT MAX EXTRAPOLATE"}]},"title":"Validation error messages","linkedEntityGuids":[]},

{"id":null,"visualization":{"id":"viz.line"},"rawConfiguration":{"nrqlQueries":[{"accountIds":[$NEWRELIC_ACCOUNT_ID],"query":"SELECT sum($SCHEMA_DIMENSIONAL_METRIC) FROM Metric FACET stackId TIMESERIES"}]},"title":"Various stacks triggering event ","linkedEntityGuids":[]},

{"id":null,"visualization":{"id":"viz.line"},"rawConfiguration":{"nrqlQueries":[{"accountIds":[$NEWRELIC_ACCOUNT_ID],"query":"SELECT sum($SCHEMA_DIMENSIONAL_METRIC) FROM Metric FACET platformType TIMESERIES"}]},"title":"Various Platform types","linkedEntityGuids":[]},

{"id":null,"visualization":{"id":"viz.pie"},"rawConfiguration":{"nrqlQueries":[{"accountIds":[$NEWRELIC_ACCOUNT_ID],"query":"SELECT sum($SCHEMA_DIMENSIONAL_METRIC) FROM Metric FACET schemaVersion"}]},"title":"Schema versions that the event is triggered with","linkedEntityGuids":[]},
],"guid":null,"name":"Event Stats"},

// Event Attributes stats page drilling down into the monitor fields in the schema
{"widgets":[{"id":null,"visualization":{"id":"viz.table"},"rawConfiguration":{"nrqlQueries":[{"accountIds":[$NEWRELIC_ACCOUNT_ID],"query":"SELECT sum($SCHEMA_DIMENSIONAL_METRIC) FROM Metric  FACET $MONITOR_FIELD2”}]},”title":"Monitor $MONITOR_FIELD2”,”linkedEntityGuids":[]},

{"id":null,"visualization":{"id":"viz.line"},"rawConfiguration":{"nrqlQueries":[{"accountIds":[$NEWRELIC_ACCOUNT_ID],"query":"SELECT sum($SCHEMA_DIMENSIONAL_METRIC) FROM Metric  FACET $MONITOR_FIELD4 LIMIT MAX TIMESERIES"}]},"title":"Monitor $MONITOR_FIELD4”,”linkedEntityGuids":[]}],"guid":null,"name":"Event Attributes Stats"}],"variables":null},"guid":"NotFound"}}

We also added a third page, to show the “Trace events Stats” that shows counts of new relic synthetic tests triggered every 15 minutes for each of these schemas across various micro services in our pipeline. Why and how these synthetic tests were automated will be covered in detail in the next post. Stay tuned :)

Dashboard deletion:

When a schema is deleted, we get the associated dashboard by calling entitySearch actor and fire the deleteDashboard mutation to help us cleanup the dashboard associated with the deleted schema.

DeleteDashboard Example:

mutation {
  dashboardDelete(guid: "$GUID") {
    status
    errors {
      description
      type
    }
  }
}

NewRelic Dimensional metrics vs custom events:

We used the NewRelic dimensional metrics for account keeping as the NewRelic custom events are sampled and could not give us accurate counts. While using the dimensional metrics, there are cardinality limits per account and per metric. When these cardinality limits are reached, aggregated data is turned off for the rest of the UTC day. So, we have to be cautious on having fewer and needed dimensions. We used the monitor fields as dimensions.

Example of Dimensional metric :

SELECT sum($SCHEMA_DIMENSIONAL_METRIC) FROM Metric FACET schemaVersion

However for validation errors, we use the NewRelic custom events with all the event attributes as having all the attributes helps with debugging errors. EXTRAPOLATE added to the end of the aggregate type of functions helps with compensating for event sampling though not completely eliminating sampling. Both are helpful depending on the use case.

Example of Custom event :

SELECT count(*) FROM $SCHEMA_ERRORS_CUSTOM_EVENT WHERE eventName = ‘$SCHEMA_NAME FACET errorMessage LIMIT MAX EXTRAPOLATE

Here are some sample screen shots of the dashboards that are created.

Events Stats Page showing overall stats:

Event Attribute Stats Page drilling down on various monitor attributes:

Conclusion:

We have close to 500 dashboards created already and as schemas evolve everyday, the dashboards stay up to date. This approach has helped us scale up the observability of our business events.

If you’re interested in joining Ancestry, we’re hiring! Feel free to check out our careers page for more info. Also, please see our medium page to see what Ancestry is up to and read more articles like this.