Screenshot of the GCP Operations Dashboard that is created from the accompanying Github Repo.

Monitoring and Logging for Terraform Enterprise — GCP Operations

Peyton Casper
5 min readSep 1, 2020

--

Introduction

In the first post of this series, we explored Terraform Enterprise (TFE) and presented a starting point for monitoring TFE. Today we’re going to focus solely on GCP Operations, including how to set up the BindPlane Universal Agent, Logging queries, Uptime Checks, and the Monitoring Dashboard featured above. All of the corresponding code and Terraform code to set up TFE on GCP can be found in the accompanying GitHub repo.

Metrics and Log Collection

GCP has three different ways of collecting metrics and logs from agent environments. We’re going to discuss the trade-offs of the three different patterns and showcase how to set up each one to work with TFE.

  1. Stackdriver Agent — GCP Operations used to be called Stackdriver, and as such, the native collection agent is still the Stackdriver agent. While the Stackdriver agent can successfully capture metrics from a Docker environment, it is far from the easiest solution. It effectively functions as a collectd endpoint that forwards metrics, which means that we have to set up a custom collector that monitors Docker stats and forwards them along. Given the lack of native integration with Docker, this represents another piece that must be created and maintained.
  2. Bindplane Collector — GCP is officially moving towards BlueMedora’s BindPlane platform, which seems to be the most promising option given the number of deprecated integrations for the Stackdriver agent. In terms of simplicity, it provides several integrations out of the box, including the ability to monitor logs and collect metrics. The only aspect that initially turned me off was the need to run a separate collector instance within GCP. Thankfully that requirement is remedied by our last option.
  3. Bindplane Universal Agent — The BindPlane Universal Agent is BlueMedora’s next version of the agent that is currently in beta. In addition to removing the need for a separate collector instance, the Universal Agent also provides a single binary that can collect both logs and metrics from underlying services. Compared to the two previous options, the Universal Agent simplifies the deployment process immensely and will be the option that we explore in this post.

GCP Logging

Screenshot of the GCP Logging interface and custom metric editor panel.

GCP Operations provides a unified and simple interface for parsing log entries utilizing a mixture of Boolean operators, regex, and a few functions. In combination with the BindPlane Universal agent responsible for streaming our Docker log entries to GCP, this will be the primary interface for us to collect metrics from TFE. In the example above, we are using a regex pattern to extract all the error logs from the various TFE containers. The next step is to define a custom metric, which is explained below.

Custom Metrics

Once we have successfully isolated a subset of log entries, the next question becomes, how do we successfully turn those into a metric. Such as tracking the number of errors over time. Custom metrics are the answer to this and allow us to count the number of entries or create a distribution over the matching entries. Also, we’re provided with the option to create labels that further segment our matching logs. An interesting example of this, if possible, would be to extract the Container ID from each log entry. We could then utilize this label to understand the number of errors over time, not only from TFE but by each container.

Given that we explored Azure Log Analytics in our previous post, an interesting difference between Azure Log Analytics and GCP Operations is the separation of concerns. While Log Analytics utilizes KQL (Kusto Query Language) to provide additional capabilities for aggregation and grouping, GCP Operations separates this into two separate components. Namely, GCP Monitoring handles grouping/aggregation while Logging offers filtering. It’s an interesting separation that simplifies the final step of data manipulation into a UI driven approach.

Metrics Explorer

Screenshot showing the Metrics Explorer interface with the RAM usage per container chart configured.

The Metrics Explorer interface provides a simple search interface to select from various metrics that are being collected. After configuring the BindPlane Universal Agent, all of the metrics collected natively are prefixed by external.googleapis.com/bluemedora. All of the standard container metrics will fall under this domain. However, the custom metrics we’re defining based on log entries are merely going to be based on the metric name itself, such as tfe-errors-over-time.

Once you’ve created a chart from either an existing metric or a custom metric, you can utilize the “Save Chart” button in the top right to save this to a GCP Operations Dashboard.

Healthchecks

A healthcheck typically refers to some form of a continuous poll that queries a given interface for the status of the underlying service. Terraform Enterprise provides this interface via an API endpoint that returns a standard 200 OK status via an HTTP(S) request. Unfortunately, while Cloud SQL provides a similar availability metric that can be easily charted, Google Cloud Storage does not.

Terraform Enterprise

Screenshot of the GCP Operations Uptime Check configuration panel.

GCP Operations provides a simple mechanism called Uptime Checks for polling HTTP(S) endpoints continuously. This allows us to specify a specific IP Address, Compute Instance, or Load Balancer. That identification combined with the necessary protocol, API path, and frequency provides us with a metric that we can utilize within the Metrics Explorer.

Terraform Enterprise’s health check endpoint is documented here and should look something like this. http://tfe.company.com/_health_check.

Cloud SQL

Screenshot of the Metrics Explorer interface with the CloudSQL Availability metric displayed.

As mentioned above, CloudSQL provides a built-in availability metric, which makes it pretty straight forward to configure. Head over to the Metrics Explorer and filter by the “Cloud SQL Database” resource type and then select the “Server up” metric. This provides a simple counter metric that defines the service as either up or down.

Conclusion

This post intended to present the specific choices made and their trade-offs as I explored monitoring TFE with GCP Operations. All of the queries and setup steps can be found in the repo below. However, this is simply a place to start, and I’ll be the first to admit that this covers neither all nor every metric that one could track.

Did something trip you up? Have additional questions? Drop a comment below or send me a message on Twitter, and I’ll gladly offer help.

--

--