Monitoring and Logging for Terraform Enterprise
Introduction
Today we’re going to explore the various subsystems of Terraform Enterprise (TFE), how they interact, and present a starting point for monitoring metrics of a production deployment of TFE. Also, we’ll explore how to set up dashboards that display those metrics on Azure Monitor, and Google Cloud Operations.
Terraform Enterprise Components
Terraform Enterprise has a microservices architecture that is made up of several Docker containers. These containers implement various API backends and caching layers like Redis and RabbitMQ and (depending on the exact configuration) can communicate with external services including PostgreSQL, blob storage, or Vault.
Application Layer
The application layer contains almost 30 containers. The ptfe_atlas and ptfe_nginx containers serve as entry points for TFE and route traffic to the various backend components.
ptfe_build_manager — Responsible for spawning worker processes that process TFE Runs, Cost Estimation, and Sentinel Checks.
retraced* — Retraced handles audit logging for all the services, including the API, PostgreSQL, NSQD, and more. You’ll notice several containers with the retraced prefix which identify their function, such as retraced-postgres, retraced-cron, etc.
replicated *— Represents a few different containers and helps with the deployment, management, and installation of TFE by a third-part framework.
Coordination Layer
The coordination layer is simpler than the application and storage layers. It provides caching for the various API components and includes all of the internal queues that help “coordinate” the different worker processes that handle each step of a TFE Run.
ptfe_redis — As mentioned above, this handles caching for the API layer.
rabbitmq — The internal queue that TFE utilizes.
Storage Layer
The storage layer contains four primary components.
Configuration Data — This includes two settings files for both Replicated and TFE. These are covered in the automated installation documentation.
Vault — Vault’s Transit Secrets Engine is used to encrypt most of the data stored by TFE in PostgreSQL and the blob storage system. Terraform Enterprise has the option of spinning up Vault within a container on the host or utilizing an external Vault cluster. The latter is generally not needed and should only be used by customers who need to manage the Vault encryption keys.
PostgreSQL — PostgreSQL is the backend database that stores transactional state for all the workspaces, users, teams, and more. See this document for TFE’s PostgreSQL requirements.
Blob Storage — TFE utilizes blob storage to store plans, state files, plan/apply logs, ingressed VCS data, and workspace variables. The underlying system can vary depending on the platform, but in general, you’re looking at a system like S3, Google Cloud Storage (GCS), Azure Blob Storage, or an S3-compatible object storage service like Minio.
For more information on TFE storage and its encryption, see this document.
Monitoring Terraform Enterprise
As we have seen, TFE is made up of several components. Each of these has a list of metrics that can be monitored. The goal here is to provide a set of recommended metrics that provide the necessary insight into the health of TFE and its underlying storage components.
Host Resource Utilization
The resource utilization metrics of the TFE host are a pretty standard starting point for both monitoring and alerting. In most situations, CPU and Memory are the two most critical host metrics when running TFE with external services given that data storage is stored elsewhere. However, it’s always good to monitor disk capacity and ensure OS disk does not exceed capacity.
Containers Resource Utilization
Similar to the host metrics, we’re looking at both CPU and RAM for the various containers. Given that our host metrics give us the holistic picture and an understanding of whether or not the system is overutilized, the container level metrics give us the ability to monitor the various worker containers that are running.
In general, we want to identify and alert operators if a specific worker container approaches the RAM limit, which is 512 MB by default. Monitoring worker resource utilization allows us to determine if the limit needs to be raised, or at least identify which worker exceeds this limit.
I highly recommend reading through the TFE capacity documentation, which provides excellent recommendations on how many concurrent runs your TFE system can handle and describes settings you can tune.
In general, these metrics are captured from Docker’s stats engine and streamed by an agent to a metric backend. In subsequent posts, we’ll dive into how to set up these agents for Azure Monitor, and Google Operations.
Log Errors
I consider this the canary in the coal mine for your monitoring system. Looking for errors in the logs is a straightforward method for understanding if you have an issue in one of the various components of TFE. Given the architecture of TFE, there are several log sources and ways that error messages are reported. However, I have these formats documented in my GitHub repo below and will dive deeper into them in subsequent blog posts.
Number of Active Workers
Along with health checks, I find this to be one of the most valuable metrics for understanding the state of your TFE instance. The maximum number of concurrent runs is an adjustable setting within the TFE Admin UI; tracking the actual number of concurrent runs allows you to understand the capacity and utilization of TFE as a service.
If you’re always running with the maximum number of concurrent runs, it might make sense to deploy TFE on a larger machine and increase the concurrent runs limit. On the other hand, if you’re noticing spikes of usage during a specific time of the day, this might provide you the insight to identify and redistribute those deployments so that those runs don’t queue up.
Healthchecks
- Terraform Healthcheck
- SQL Healthcheck
- Blob Storage Healthcheck
- Vault Healthcheck (Optional)
Healthchecks are pretty straight forward, but I did want to address them directly. The TFE health check endpoint provides a simple 200 OK status code if TFE is up and responding to requests. While this provides information on the service layer, it’s essential to have health checks for each of your external services so that you’re able to identify which storage system might be experiencing an outage.
A Monitoring Solution
The Azure Monitor dashboard shown below details the various metrics described above. In the following two blog posts, we will explore how to set up metrics and log collection along with the queries behind these metrics for the native monitoring systems from Azure, and GCP. In addition, the GitHub repository linked below will also contain set up instructions and Terraform files for deploying this example in the various environments.
Conclusion
Monitoring and logging of any service is an opinionated topic, and a lot depends on the system and configuration you’re using. So my goal was to provide a generalized list of metrics and an understanding of the TFE components so that you can customize it for your own environment. Over the next few weeks, I will cover how to set up this dashboard with Azure Monitor, and Google Cloud Operations.
Are there additional metrics that you utilize in your environment? Did something trip you up? Have additional questions? Drop a comment below or send me a message on Twitter, and I’ll gladly offer any help that I can.