Observability-Driven Development

Prioritize product monitoring features and build operations-focused engineering teams on day one.

Published in

Slalom Build

7 min readJan 26, 2023

An image of a person doing a cart wheel on a beach — Photo by Nik Shuliahin 💛💙 on Unsplash

Engineering teams who build and run the systems they own are accountable to a level of service availability that their customers and stakeholders depend upon. The pressures of deadlines, market demands, and business initiatives more often than not under-prioritizes operations work. But perhaps you should consider flipping the script by engineering observability solutions earlier and building an operations mindset from day one. Your on-call teams will thank you.

What is observability-driven development?

Observability-driven development is a code-first approach that balances product monitoring with product feature development. With test-driven development (TDD), the expected behaviors of features are first codified in unit tests and then pass while production code is written. Similarly, observability-driven development (ODD) focuses on codifying the monitoring signals sooner rather than later. Endpoint telemetry surfaces when product features start shipping.

Scenario

Let’s suppose we have finished a discovery on an exciting new business concept: LaundryDash. During the requirements gathering, the team identifies a number of services needed to support the product rollout such as Order, Payment, Pickup, Delivery, and Billing API endpoints. Many of the details and workflows still need to be worked out but there’s enough information to start shipping the code.

The basic operating behavior of an endpoint is its health check. The team starts scaffolding the endpoints, provisions its APM tool of choice, New Relic, and starts designing the cloud infrastructure. The team agrees three environments are needed: devtest, preview, and production.

The New Relic instance has been stood up quickly but there’s no integration yet with the endpoints or the cloud infrastructure as they are still under development. That’s okay. The monitoring dashboards can be delivered today for all endpoints in all environments. Although the dashboards have no telemetry to report on, they will light up once the endpoints start shipping. This is an example of observability-driven development and shifting operations “left” in the development lifecycle.

Endpoint telemetry surfaces when product features start shipping.

A telescope pointed towards a calm lake on an overcast day — Photo by Daniel Lerman on Unsplash

ODD design principles

The team prioritizes the monitoring work and adheres to these dashboard design principles:

Endpoint monitoring is easy to configure and onboard.
Endpoint monitoring is tunable.
Dashboard layout is scalable, repeatable, and convention-based.

ODD golden signals

The golden signals of endpoint monitoring provide the simplest and fastest approach for standing up dashboards. When used together, these signals form a repeatable convention for rapidly instrumenting new API endpoints. The team focuses on:

Errors: A measurement of failing requests over a time period.
Latency: A measurement of the time taken to serve web requests over a time period.
Transactions: A measurement of the request rate faced by an endpoint over a time period.
Saturation: A measurement of capacity and the ability to meet increased demand over a time period.

Saturation may require invention and complex queries to understand the capacity constraints of a system. Apdex, an industry defined user satisfaction score, can be used as a reasonable substitute.

A layer of gold bars — Photo by Jingming Pan on Unsplash

New Relic working example

Let’s explore a code-first approach for designing and building monitoring solutions earlier in the development process. The following working example is built on top of New Relic’s Terraform provider. We’ll dive into the design principles, but first, a few points before we start:

Requirements & qualifiers

A New Relic endpoint is available. If not, create a free account here.
Example uses local Terraform state in lieu of remote state management
Install the Terraform provider and validate via bash shell

terraform --version

Design principle #1

Endpoint monitoring is easy to configure and onboard.

We’ll use a data-driven approach for describing each of the API’s golden signals with Terraform’s map(any) variable object type. This snippet includes the specification for the LaundryDash Order API latency signal:

variable "api_golden_signals" {
  type        = map(any)
  description = "Describes golden signals configuration for endpoints supporting laundryDash "
}

api_golden_signals = {
  order_api = [
    {
      "metric"        = "Latency",
      "endpoint_name" = "Order API",
      "row"           = "1",
      "col"           = "1",
      "thresh_warn"   = "3",
      "thresh_error"  = "5",
      "nrql_bill"     = "select percentile(duration, 90) as `P90 Latency` from Transaction where appName = 'laundry-dash-order-prod' since 3 hours ago",
      "nrql_trend"    = "select percentile(duration, 90) as `P90 Latency` from Transaction where appName = 'laundry-dash-order-prod' since 3 hours ago compare with 1 day ago timeseries auto"
    },
    {...}]
}

Design principle #2

Endpoint monitoring is tunable.

The baseline operating thresholds may not be known right away, so having the ability to quickly change these values is important for faster operations feedback. Lower environments may have orders of magnitude less traffic than production, so threshold tuning is trickier.

The code snippet above includes two threshold parameters: thresh_warn and thresh_error. These parameters configure traffic light “billboards” which display green: okay; yellow: warning; and red: critical. The same thresholds are suitable for building alerting polices after the baselines thresholds have soaked a while in production.

The NRQL queries may need further tweaking so sticking to its raw, native form is easier to change within each signal specification. The signal specification is intentionally not DRY in favor of testing, feedback, and time to first signal.

Image of a person’s hand modifying a blueprint with a pen — Photo by Daniel McCullough on Unsplash

Design Principle #3

Dashboard layout is scalable, repeatable, and convention-based.

One of the nice things about New Relic’s dashboard layout designer is its use of constrained width. This allow us to programmatically align golden signals left to right for each API endpoint and scale the growth of each new API vertically on the dashboard. The New Relic Terraform provider exposes a page object which supports programmatic grouping of other widgets under their own tabs.

Wireframe layout of dashboard monitoring widgets — Golden Signals Dashboard Wireframe

The row and col attributes from the variable snippet found under Design principle #1 have been configured for layout of billboard widgets on odd rows. Shortly, we’ll see how we programmatically use the odd row offset to dynamically create the trend/line graphs for each golden billboard widget on the even rows.

Terraform for_each provides the means of building a repeatable, convention-based dashboard that requires no further code changes, as new specifications are added to the api_golden_signals variable later. First, however, we’ll need to project or flatten our configuration into a usable form. In this snippet, we walk the key-value objects and then project a flattened, ordered list.

locals {
  widget_width  = 3
  widget_height = 3

  // flatten the object signal configs defined within the prod.tfvars to a flattened list
  // this supports declarative for-each
  api_signals = flatten([
    for k, signals in var.api_golden_signals : [
      for s in signals : {
        metric        = s.metric
        endpoint_name = s.endpoint_name
        row           = s.row
        col           = s.col
        warn          = s.thresh_warn
        error         = s.thresh_error
        nrql_bill     = s.nrql_bill
        nrql_trend    = s.nrql_trend
      }
    ]
  ])
}

The flattened api_signals variable list can now be used with declarative for_each. Using the dynamic keyword, each billboard widget resource is created on odd rows, followed by line widgets on even rows. Note how the even row calculation increases the current row value by one. When using dynamic resource creation, a fully qualified variable name must be used. Example: widget_billboard.value.error. Value is a reserved word that allows us to use dot syntax on the current list object.

In the code snippet below, we are creating four billboard signals, left to right, on odd rows followed by supporting trend widgets, left to right, on even rows. The code dynamically picks up new configurations found in the api_golden_signals variable.

resource "newrelic_one_dashboard" "dash_api_prod" {
  name = "Laundry Dash Monitoring (PROD)"
  page {
    name = "Laundry Dash API Monitors (PROD)"

    // billboards
    dynamic "widget_billboard" {
      for_each = local.api_signals
      content {
        title    = "${widget_billboard.value.endpoint_name} (${widget_billboard.value.metric})"
        row      = widget_billboard.value.row // odd rows for billboards
        column   = widget_billboard.value.col
        width    = local.widget_width
        height   = local.widget_height
        warning  = widget_billboard.value.warn
        critical = widget_billboard.value.error

        nrql_query {
          query = widget_billboard.value.nrql_bill
        }
      }
    }

    // line trends
    dynamic "widget_line" {
      for_each = local.api_signals
      content {
        title  = "${widget_line.value.endpoint_name} Trend (${widget_line.value.metric})"
        row    = (1 + widget_line.value.row) // even rows for trends
        column = widget_line.value.col
        width  = local.widget_width
        height = local.widget_height

        nrql_query {
          query = widget_line.value.nrql_trend
        }
      }
    }
  } // page
}

The baseline operating thresholds may not be known right away, so having the ability to quickly change these values is important for faster operations feedback.

An image of a hammer sitting on a top of a piece of wood — Photo by iMattSmart on Unsplash

Putting it all together

The full implementation of this example can be found here. Your secrets will need to be configured here before executing the Terraform steps.

Terraform steps

cd {pwd}/observability/terraform/stacks/dash-endpoints

# initialize tf providers
terraform init

# create and select the prod workspace
terraform workspace new prod
terraform workspace select prod

# validate selected workspace
terraform workspace list

# plan
terraform plan -var-file=./configurations/prod.tfvars -out=launrdy-dash.tfplan

# apply
# skip this section if you don't have a New Relic account - plan output
# reviewable in the console 
terraform apply laundry-dash.tfplan

Result

Here is our dashboard. Nicely formatted and ready to surface our telemetry after the applications and services ship.

New Relic Golden Signals Dashboard with widgets reporting no data — LaundryDash Golden Signals API Dashboard

Conclusion

Observability-driven development builds an operations mindset from day one — codifying operations as early as possible in the development process. Early-career engineers who have little exposure running production systems will gain valuable insights into understanding how operations teams run and monitor deployed code. Teams who explicitly recognize operations work along with the non-functional requirements needed to deliver this work will improve their delivery estimates. And your teams will be proactively armed with the monitoring tools they need to successfully operate products and services in production.