Puppet, Datadog, Google Cloud Platform — recipe for a small outage

Sorin Tudor
Mar 27, 2020 · 3 min read

The majority of infrastructure engineers are currently living in the age of Cloud and Devops. New opensource is developed and used on daily basis, agile no longer was transformed from “nice to have” to “must have”.

In order to achieve this level of agility and flexibility on cloud deployment, continuous delivery and deployment pipelines were also a must. What tools are integrated in this process? It depends, let me give you some examples: Gitlab for the code, Puppet and r10k for automation code deployment, and of course, good old Jenkins.

There are also difficulties, beside new functionalities, the pipeline needs to be up and running with all the changes that happen, and it goes wrong in few cases.

We were trying the other day to fix a “bug” that consists of installing the apt key on each puppet agent run.

The main direction pointed to something in the Datadog module for Puppet, or at least that is what we thought at that moment. Since the code was also used by other teams and we couldn’t make changes on the master branch, I need to create a separate branch for our project. Unfortunately, I did not pay attention to the fact that we are not using the latest version of code and we have a certain older tag and ended by cloning the master branch with the latest version of code.

After everything was done and a tag was created and added to Puppetfile, we started receiving alerts that Zookeeper process monitoring was not working anymore(lots of them).

Check filter from Datadog:

"zookeeper.ruok".over("type:kafka","type:zookeeper").exclude("env:prod").by("host","port").last(2).count_by_status()

So it was linked with our error, but how?

Turns out that one of the change that is done in the latest code from Datadog module made our datadog.yaml look like:

Before

#
# MANAGED BY PUPPET
#
---
hostname: [hostname]
api_key: [key]
dd_url: https://app.datadoghq.com
cmd_port: 5001
conf_path: "/etc/datadog-agent/conf.d"
enable_metadata_collection: true
dogstatsd_port: 8125
dogstatsd_socket: ''
dogstatsd_non_local_traffic: false
log_file: "/var/log/datadog/agent.log"
log_level: info
tags:
- vertical:kafka
- env:dev
- type:kafka
apm_config:
apm_enabled: false
process_config:
process_enabled: disabled

After

#
# MANAGED BY PUPPET
#
---
api_key: [key]
dd_url: https://app.datadoghq.com
cmd_port: 5001
conf_path: "/etc/datadog-agent/conf.d"
enable_metadata_collection: true
dogstatsd_port: 8125
dogstatsd_socket: ''
dogstatsd_non_local_traffic: false
log_file: "/var/log/datadog/agent.log"
log_level: info
tags:
- vertical:kafka
- env:dev
- type:kafka
apm_config:
apm_enabled: false
process_config:
process_enabled: disabled

No hostname in agent configuration file anymore, but this shoudn’t be a problem after all, right? Wrong!

For internal management and operations purpose, one of the GCP daemon updates each machine with lines in /etc/hosts

[ip] [instance-name].c.[project_name].internal [instance_hostname] # Added by Google
[metadata_ip] metadata.google.internal # Added by Google

And /etc/resolv.conf

domain c.[project_name].internal
search c.[project_name].internal. google.internal.
nameserver [metadata_ip]

Even if our hosts file have one entry with the correct server hostname, since the .internal record was still present, on a Datadog agent restart, wrong hostname was selected.

New “hostname” looked like [machine name from GCP].internal instead of [machine name from GCP].[domain]

The outcome was that all the metrics were published on wrong .internal hostnames in Datadog, and it turn out that all of monitoring was affected.

There are two advises to be taken from this unfortunate event:

  • Create a separate testing setup without direct integration. You can not have 100% test coverage for all the scenarios that appear on an ever changing opensource landscape of infrastructure solutions. Stay safe!

METRO SYSTEMS Romania

METRO SYSTEMS Romania Engineering

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store