Creating monitors in a generic, dynamic, and automated way

Datadog plus Terraform: a winning combination

Deepak Shivaji Patil
Globant
8 min readMar 21, 2023

--

Photo by Luke Chesser on Unsplash

In this article, we will learn how to create Datadog monitors/alerts in a generic, dynamic, and automated way by using a CSV file via Terraform.

We will also cover the overview of the tools we will use as part of the implementation of this article. We will use infrastructure as a code (IaaC) tool Terraform to codify our monitor creation infrastructure. We will also try to understand the need and benefits of using the IaaC tool. The article is more hands-on and has code snippets which we refer to as part of the implementation.

So, let’s get started.

Background

Monitoring has become an important pillar while designing software architecture. Most broadly, it refers to the process of becoming aware of the state of a system. This is done in two ways, proactive and reactive. The former involves watching visual indicators, such as time series and dashboards. The latter involves automated ways to deliver notifications to operators to bring to their attention a grave change in the system’s state; this is usually referred to as alerts.

System downtimes are a pain to any business regarding user retention, credibility, and revenue.

At the end of the day, the data from infrastructure monitoring helps a business plan for client demands, fulfill Service-Level Agreement (SLA) requirements, and meet client expectations.

Overview

Let’s understand the basics of the tools that we will use. We will learn about each tool in brief.

Datadog

Image source — https://www.datadoghq.com/

Datadog is an observability service for cloud-scale applications, providing monitoring of servers, databases, tools, and services through a SaaS-based data analytics platform.

Datadog is a monitoring, logging, and security platform tool for cloud applications. It provides end-to-end traceability, metrics, and logs to make infrastructure, applications, and third-party services entirely observable. Such capabilities help avoid downtime, businesses secure their systems and ensure customers get the best user experience. It’s a licensed tool.

Datadog monitors

Image — Datadog monitor icon

Monitors allow us to watch a metric or check that we care about and notify our team when a defined threshold has been exceeded. It’s of the utmost importance to know when critical changes are occurring when it comes to infrastructure monitoring.

So, the ability to check metrics, service health checks, and HTTP endpoints is provided by Datadog monitors.

We can configure monitors, notify our teams, and manage alerts at a glance on Datadog’s alerting platform via monitors.

Available options to create Datadog monitors

There are multiple types of monitors available. The easiest way to create monitors is via Datadog web UI. Datadog provides a web interface via which we can create monitors.

Refer to this link to create monitors via the web UI method.

This approach is manual, and it becomes tedious when discussing hundreds of monitors to create and manage.

So, let’s explore now why and how we should codify our alerting infrastructure.

Terraform

Image source — https://www.terraform.io/

Terraform is a tool from the Hashicorp family and is an infrastructure as code(IaaC) tool. The tool helps to create cloud and on-prem resources in human-readable configuration files that you can reuse, share, and version control. It helps maintain consistency across the infrastructure. Terraform can be used to manage components like computing, storage, and networking resources, as well as DNS entries and SaaS features.

Why use Terraform?

We already know that Terraform works great when we use it as an IaaC tool to create infrastructure on cloud platforms like AWS, Azure, etc., in an automated way.

Terraform can also help us create Datadog components in an automated way.

A few advantages of using Terraform are:

  • Create and manage Datadog monitors at scale
  • Reuse the code by maintaining genericness via modules
  • Codify the alerting infrastructure and version control it
  • Securely and safely rolling out monitors changes across environments in a controlled way.
  • Ability to recreate alerting infrastructure quickly in case of a complete disaster.

Problem

We will specifically talk about the problem of managing lots of Datadog monitors at scale. As infrastructure grows, the scope of monitoring grows in proportion as well.

So creating, modifying, maintaining, and deploying alerting on data getting ingested from various sources in infrastructure to centralized systems like Datadog becomes paramount.

Moreover, there needs to be a way to take along non-technical business SMEs like business analysts working on a project who do not necessarily need to know how actual alerting setup has been implemented but whose expert inputs around SLAs, SLOs, and SLIs of the system can be taken into consideration while setting up the alerts.

Solution

We will solve the above-discussed concerns with the help of Terraform and a commonly used file format called comma-separated file (CSV) file.

We will store all required information to create Datadog monitors in a comma-separated file. The columns in the file represent the actual fields corresponding to Datadog monitors, such as monitor name, query, threshold levels, re-notification settings, etc., and we can add as many rows as we can for different monitors.

We will use Terraform to create monitors in Datadog. We will write and configure Terraform scripts so that those will read the CSV file we created, read one by one row, and create a monitor in Datadog.

Prerequisite

  1. Access to an active Datadog account.
  2. A workstation with the latest version of Terraform installed.
  3. Understanding of Terraform.
  4. Any text editor to edit the CSV file

Implementation Steps

We will implement the solution in three steps.

  1. Prepare the CSV template file
  2. Create Terraform configuration scripts
  3. Run Terraform scripts

The basic folder structure will be like below

--RootFolder
|
--application
|
--stub_files
|
--csv_generic_template_stub.csv
--application_name
|
--variables.tf
--create_monitors.tf
--csv_generic_template.tpl
--modules
|
--variables.tf
--main.tf

1. Prepare the CSV template file

1.1 Create the CSV file having column names like the one below with the name csv_generic_template.tpl

id,name,type,message,escalation_message,query,monitor_thresholds_critical,monitor_thresholds_warning,notify_no_data,priority,notify_audit,timeout_h,include_tags,tags,

2. Create Terraform configuration scripts

2.1 Define variables.tf file

variable "monitors" {
default = ""
}

variable "datadog_api_key" {
default = ""
}

variable "datadog_app_key" {
default = ""
}

2.2 Define create_monitors.tf file

# Define a data source for template file

data "template_file" "csv" {
template = "${file("${path.module}/csv_generic_template.tpl")}"
}

# Create a stub file by using local_file resource by rendering template csv file

resource "local_file" "generate_csv" {

content = data.template_file.csv.rendered
filename = "${path.module}/stub_files/csv_generic_template_stub.csv"

}

# Define a locals variable block and store all content of csv file generated above in variable named monitors

locals {

monitors = csvdecode(file("${path.module}/stub_files/csv_generic_template_stub.csv"))
depends_on = [
local_file.generate_csv,
]
}

# Call the generic module now and pass all required variables along with locals variable named monitors

module "create_monitor" {

source = "../../modules"
datadog_api_key = var.datadog_api_key
datadog_app_key = var.datadog_api_key
monitors = local.monitors
}

2.3 Create the generic module having the below files

  • variables.tf
variable "monitors" {
default = ""
}

variable "datadog_api_key" {
default = ""
}

variable "datadog_app_key" {
default = ""
}
  • main.tf
# Define required providers block

terraform {
required_providers {
datadog = {
source = "DataDog/datadog"
}
}
}

# Initialize provider block

provider "datadog" {
datadog_api_key = var.datadog_api_key
datadog_app_key = var.datadog_app_key
}

# Create a new Datadog monitor
# We will loop through the variable named 'monitors' to create as many monitors defined in csv file.

resource "datadog_monitor" "generic_monitor" {

for_each = { for monitor in var.monitors : monitor.id => monitor }

name = each.value.name
type = each.value.type
query = each.value.query
message = each.value.message
escalation_message = each.value.escalation_message

monitor_thresholds {
critical = each.value.monitor_thresholds_critical
warning = each.value.monitor_thresholds_warning
}

notify_no_data = each.value.notify_no_data
priority = each.value.priority
notify_audit = each.value.notify_audit
timeout_h = each.value.timeout_h
include_tags = each.value.include_tags
tags = split(",", each.value.tags)
}

3. Run Terraform scripts

Once we have all the above script files ready, it’s time to run the scripts and apply the changes.

Before we run scripts, let’s update the csv_generic_template.tpl file to add the high memory usage alert definition.

In addition, the file will look like the one below.

id,name,type,message,escalation_message,query,monitor_thresholds_critical,monitor_thresholds_warning,notify_no_data,priority,notify_audit,timeout_h,include_tags,tags,
1,High Memory Usage for the host {{host.name}},query alert,High Memory Usage for the host {{host.name}},,"avg(last_5m):avg:system.mem.pct_usable{*} by {host} < 0.2",0.2,0.3,false,1,false,0,true,"application:infra,severity:S1,type:high_memory_usage_alert",

3.1 Run Terraform plan

Open a terminal and navigate to the application folder containing create_monitors.tf file. First, do the terraform init to initialize the working directory.

Image 3.1.1 terraform init output

Then run the terraform plan command. It will show the output like below.

Image 3.1.2 terraform plan output

The above output will ‘plan’ to create a CSV file. Go ahead and run the terraform apply command to create that file.

Image 3.1.3 terraform apply output

3.2 Run Terraform apply

Now, run the terraform apply command again.

This time, Terraform will refer to the CSV file generated in the first terraform apply and will create a monitor in Datadog.

Note: We need to run terraform apply two times only when we are running this very first time. As stated, the first ‘apply’ will create the CSV file based on the template file. And the second ‘apply’ will use the CSV file to create/deploy the monitor in Datadog.

The output of the second terraform apply will be as below.

Image 3.2 terraform apply output

Validation

Now, head to the Datadog UI and navigate to the ‘Monitors’ section.

We should see a monitor created like the one below.

Image: Datadog monitor

We have now codified our monitor creation infrastructure. We can use the same set of scripts to modify deployed monitors and add new monitors as we want. We need to add multiple entries in the template file; that’s it.

No need to add additional code.

We can create monitors for, just to name a few, CPU usage, Memory usage, Disk usage, or any metric available in Datadog.

Summary

In this article, we have addressed the problem of managing, creating, modifying, maintaining, and deploying lots of Datadog monitors at scale. We have learned how we can automate monitor creation in Datadog in an automated way using Terraform by using CSV-based templates. We have covered the basics of the tools we used, such as Datadog, Terraform, etc while implementing the solution. We have leveraged the comma-separated file to define our monitors in a text-based format. By using Terraform, we have automated the deployment of the Datadog monitor creation, achieving the IaaC and codifying the monitor creation infrastructure. We have also validated the execution results of the scripts by navigating to the Datadog UI. The generic code we have implemented helps us in avoiding duplicate code, thereby keeping the code clean.

Hope this article has helped you by adding value to your knowledge.

Happy learning.

References

--

--

Deepak Shivaji Patil
Globant

Working as DevOps Tech Lead. More than 11 years of experience in architecting , building and deploying enterprise grade application on public cloud, mainly AWS.