Puppet, Datadog, Google Cloud Platform — recipe for a small outage

Sorin Tudor
METRO SYSTEMS Romania
3 min readMar 27, 2020

The majority of infrastructure engineers are currently living in the age of Cloud and Devops. New opensource is developed and used on daily basis, agile no longer was transformed from “nice to have” to “must have”.

In order to achieve this level of agility and flexibility on cloud deployment, continuous delivery and deployment pipelines were also a must. What tools are integrated in this process? It depends, let me give you some examples: Gitlab for the code, Puppet and r10k for automation code deployment, and of course, good old Jenkins.

There are also difficulties, beside new functionalities, the pipeline needs to be up and running with all the changes that happen, and it goes wrong in few cases.

We were trying the other day to fix a “bug” that consists of installing the apt key on each puppet agent run.

The main direction pointed to something in the Datadog module for Puppet, or at least that is what we thought at that moment. Since the code was also used by other teams and we couldn’t make changes on the master branch, I need to create a separate branch for our project. Unfortunately, I did not pay attention to the fact that we are not using the latest version of code and we have a certain older tag and ended by cloning the master branch with the latest version of code.

After everything was done and a tag was created and added to Puppetfile, we started receiving alerts that Zookeeper process monitoring was not working anymore(lots of them).

Check filter from Datadog:

"zookeeper.ruok".over("type:kafka","type:zookeeper").exclude("env:prod").by("host","port").last(2).count_by_status()

So it was linked with our error, but how?

Turns out that one of the change that is done in the latest code from Datadog module made our datadog.yaml look like:

Before

#
# MANAGED BY PUPPET
#
---
hostname: [hostname]
api_key: [key]
dd_url: https://app.datadoghq.com
cmd_port: 5001
conf_path: "/etc/datadog-agent/conf.d"
enable_metadata_collection: true
dogstatsd_port: 8125
dogstatsd_socket: ''
dogstatsd_non_local_traffic: false
log_file: "/var/log/datadog/agent.log"
log_level: info
tags:
- vertical:kafka
- env:dev
- type:kafka
apm_config:
apm_enabled: false
process_config:
process_enabled: disabled

After

#
# MANAGED BY PUPPET
#
---
api_key: [key]
dd_url: https://app.datadoghq.com
cmd_port: 5001
conf_path: "/etc/datadog-agent/conf.d"
enable_metadata_collection: true
dogstatsd_port: 8125
dogstatsd_socket: ''
dogstatsd_non_local_traffic: false
log_file: "/var/log/datadog/agent.log"
log_level: info
tags:
- vertical:kafka
- env:dev
- type:kafka
apm_config:
apm_enabled: false
process_config:
process_enabled: disabled

No hostname in agent configuration file anymore, but this shoudn’t be a problem after all, right? Wrong!

For internal management and operations purpose, one of the GCP daemon updates each machine with lines in /etc/hosts

[ip] [instance-name].c.[project_name].internal [instance_hostname] # Added by Google
[metadata_ip] metadata.google.internal # Added by Google

And /etc/resolv.conf

domain c.[project_name].internal
search c.[project_name].internal. google.internal.
nameserver [metadata_ip]

Even if our hosts file have one entry with the correct server hostname, since the .internal record was still present, on a Datadog agent restart, wrong hostname was selected.

New “hostname” looked like [machine name from GCP].internal instead of [machine name from GCP].[domain]

The outcome was that all the metrics were published on wrong .internal hostnames in Datadog, and it turn out that all of monitoring was affected.

We are running our pipeline on Continuous Integration principle and even if there are tests that make sure services are running, it’s pretty hard to think of such scenarios. This caused for us a small outage of one hour (and it was short since we figured it pretty fast), time that was used to redeploy the reverted code on all of instances that were affected.

There are two advises to be taken from this unfortunate event:

  • Pay attention on the branch you are using as base for another, errors happen and they get ugly at the actual rate in which everything is integrated.
  • Create a separate testing setup without direct integration. You can not have 100% test coverage for all the scenarios that appear on an ever changing opensource landscape of infrastructure solutions. Stay safe!

--

--

Sorin Tudor
METRO SYSTEMS Romania

DevOps Engineer, Technical blogger (log-it.tech) and amateur photographer.