Managing Datadog by hand is an incredibly tedious and error-prone process.
Writing an automated tool against Datadog’s API will pay dividends but it’s a hefty price to pay.
Ways Terraform can help you manage Datadog
- Make reliable sweeping changes across all your Datadog components in seconds. Whether you have one, ten, or a hundred monitors the time it takes to run Terraform is about ten seconds.
- Track and review changes to your Datadog setup in GitHub pull requests just like when making a code change. And edit your whole DD setup with your favorite text editing tools.
- Template your alert messages and reuse queries across monitors and time boards. For example, give your monitors a consistent footer (below shows how in code) which includes links to your Datadog dashboards and your Kibana logs, and your on call person can jump into fixing the issues.
- Setup a new Datadog account for a new environment with all your monitors and timeboards in seconds. You start using Datadog with a sole account for dev for example, then make a separate one for prod and need it provisioned.
How do you use Terraform?
You write your configuration in HCL (Hashicorp’s Configuration Language). You describe the components you want, for example “I want a DD monitor named disk_usage, for the metric system.disk.in_use, alerting at 85% and above”. TF then figures out and runs whatever changes are needed on your DD account to get to your described desired state.
And while HCL is a declarative language, it’s dynamic, providing features like variables; counts; conditions; types such as lists and maps; built-in functions like format — to format strings, join — to join list items with a delimiter, and element — to access list items; and modules. Therefore you don’t need a templating language in addition to your configuration language to make up for the latter’s shortcomings. When using a tool configured by YAML for example, you often need to use a templating language like Jinja2 with its conditions and loops to generate the working YAML. HCL is all you need.
You run Terraform with an easy-to-use CLI. Data-to-day you run a couple shell commands: plan and apply. Plan for TF to figure out the difference between what you want and you’ve got in DD and plan the changes needed to get to your desired state. And apply to run those changes on DD and get there.
Example using TF to manage DD
Here’s an example showing how to create monitors and timeboards, using variables to have consistent alert messages, queries, and thresholds. You can follow along with this example repo. Clone it and run against a trial Datadog account (your own account is fine too if you don’t have monitors/timeboards with the same names).
What’s going on in the script/gist below:
- The script defines two monitors, disk usage and CPU usage. They share a footer for Pagerduty and Slack alerts. The queries used for the monitors are also used in the timeboard graphs, so you can visually confirm your monitors are correct and check each metric’s history against its alert threshold.
- Then the script defines a timeboard with two graphs for the disk usage and CPU usage.
- The script also defines variables so constant values are defined in one place, rather than having them littered all over having to dig them up when they change.
You can read Terraform’s Datadog provider docs to learn the API’s details.
Use Terraform! The sooner the better
Terraform is a tool that pays compounding dividends. You start with a small investment, like Datadog, then Kubernetes, and AWS. While they can be managed separately, by connecting them together, you can compose your service’s infrastructure in a module. Now you can easily spin up all your infrastructure per environment — dev, staging, prod (even per IaaS — AWS and GCP).
Terraform is one of the best ops tools ever made. If you know it’s out there and don’t use it you’re a masochist. Even with Datadog it’s a big boon.
Please say hi at @travisjeffery.
Hit the 👏 if you found this useful, and feel free to share.
Thanks for reading.