Managing Your Consul Cluster with Terraform

Considerations regarding the Consul provider in Terraform

Published in

GlobalLogic UK&I

9 min readMay 30, 2019

This is an article about the Consul provider for Terraform, not the backend. This is about using Terraform as a way to manage your Consul cluster’s configuration from a centralised Terraform project.
If you are only interested in using Consul as a backend for Terraform, i.e. to store state and locking information, this article will not be very relevant to you. (But hopefully still interesting.)

Disclosure: Jeff Hemmen works for ECS Digital, who are an official HashiCorp partner.

This article also presumes knowledge of what both Terraform and Consul are.

The Terraform documentation says: “The Consul provider exposes resources used to interact with a Consul cluster.”
Thus I imagined I could use one of Terraform’s major cloud providers to provision my infrastructure and bootstrap a few services (as is its most prominent use-case), and subsequently use the Consul provider to configure a pristine Consul cluster running on this environment and register services and health checks—all in the same terraform project!

A brief glance at the list of resources in the Consul provider further strengthened my assumption, as it had both consul_node and consul_service in it.
Some other resources had a warning that they have been deprecated; not an issue!, I thought. Terraform is in active development, and the folks over at HashiCorp are constantly working on improving their offering for the community. There were even references to an upgrade guide.

So: ho! it’s off to work I did go.

My project, in general terms, was a simple webserver/database setup, fronted by a loadbalancer. Each server was also running a Consul agent, and there were additionally three dedicated Consul servers.
The loadbalancer would route traffic to the webservers by default, but the Consul UI (exposed on the consul servers) could be reached via a specific URL prefix¹.

From here, Terraform can use the loadbalancer itself to reach the Consul cluster and make API calls. And the best part is: since Terraform excels at resolving dependencies, we do not need to specify any triggers dependencies explicitly.
I am using the loadbalancer’s DNS name in then Consul provider configuration, so Terraform knows it can only start making API calls once the loadbalancer is up².

provider "consul" {
 address = "${aws_lb.hc-usecase-c1.dns_name}"
}

Since the Consul provider cannot be used to install consul on the various nodes, I did this as part of the instance provisioning process, namely through user-data. I used this to do the minimum configuration that I figured the Consul provider would expect:

install the agents;
set the node names³;
define which nodes are servers and which ones clients;
set the server addresses for the initial connection⁴;
set the shared key for encrypted communication throughout the cluster.

This really is the minimum viable setup of a healthy Consul cluster, devoid of any service and health check definitions. On to the Consul provider!

The Quirk

I defined my Consul services using the consul_service resource⁵:

resource "consul_service" "web" {
  name  = "web"
  node  = "web-00"
  port  = 80
  check = [
    {
      check_id = "service:web-listen-00"
      name     = "Webserver listen health check"
      http     = "http://localhost:80/"
      interval = "10s"
      timeout  = "1s"
    }
  ]
}

Sure enough, when I navigated to the Consul UI endpoint, all my services showed up—but the Health Checks were failing. This is to be expected, as they are ‘critical’ by default, and only change to ‘passing’ once they have actually run and passed. (This is done by design, to prevent Consul advertising newly-registered Services before they are ready.)

Fig. 2: Consul UI Service view with three user-defined services and failing health checks.⁶

But then… my services started disappearing! First the number of health checks started dwindling, then—with the last health check gone—the whole Service definition would disappear.

My first suspicions—all wrong in the end—revolved around the health checks. Were they failing? Never running? Was the timeout too low? (Hardly.)

I set all my health checks to start off as ‘passing’ (except for one set—one always needs that control group when debugging ☝),but the same thing happened:

Fig. 3: [Animated GIF] Consul UI Service view with passing health check which decrease in numbers. Eventually, each Service line disappears altogether, leaving only the built-in ‘consul’ Service.

The Confusion

The issue—as I discovered in the end—stems from a bit of confusion around Consul’s architecture and API: namely between the concepts of Agents and Catalog.

Let’s first dig into the specificities at the heart of Consul itself, before we then look at how this affects Terraform.

API Endpoints

There are two similar and related Consul API endpoints: /agent and /catalog. At first glance, they look very similar. The documentation says the following about them:

/agent:

💬 … used to interact with the local Consul agent.
💬 Usually, services and checks are registered with an agent which then takes on the burden of keeping that data synchronized with the cluster.

/catalog:

💬 … endpoints register and deregister nodes, services, and checks in Consul.
💬 The catalog should not be confused with the agent, since some of the API methods look similar.⁷

Thus far, these looked sensibly similar to me! (Except that, when making API requests manually, one of them produced the desired result and the other resulted in disappearing services, just like above.)

Hence, we need to go deeper…

Consul’s Architecture

Consul.io has a section about Consul Internals. This goes into deeper detail about Consul’s components, but an understanding of this is normally not required to use and operate Consul successfully.
Nevertheless, this is where the heart of my misunderstanding lies. We’ll have a brief look at what the Internals documentation says about Agents and the Catalog, and then move to the crux of the issue:

Agents

💬 Each Consul agent maintains its own set of service and check registrations as well as health information.
💬 The agents are responsible for executing their own health checks and updating their local state.

The Catalog

💬 … formed by aggregating information submitted by the agents.
💬 … maintains the high-level view of the cluster […] (which services are available, which nodes run those services […]).
💬 The catalog is maintained only by server nodes.

This is when I started to get a sense of where the problem might lie. So I read on and found…

The Crux

Consul, being distributed and decentralised by design, carries out anti-entropy operations like most such systems.
Here’s what I found in the documentation:

💬 Anti-entropy is a synchronization of the local agent state and the catalog.
💬 If any services or checks exist in the catalog that the agent is not aware of, they will be automatically removed to make the catalog reflect the proper set of services and health information for that agent.

A-ha! This certainly explains the behaviour we observed, but does raise a few questions. Why is there /catalog API endpoint at all then? And what would happen without anti-entropy?

Why is there /catalog API endpoint at all?
Simply put: for external services. Consul allows you to register services outwith your cluster, such as third-party APIs. Since these don’t live on one of the nodes running a Consul agent, they can’t be registered with a specific agent. Rather, we can add them to the Catalog, which will keep track of them regardless of specific nodes joining or leaving the cluster.

What would happen without anti-entropy?
While it sounds tempting to think anti-entropy is an annoying thing we’d be better off without, that’s not the case. Without anti-entropy, we would see a state that would be ‘incidentally consistent’ at first, but might become inconsistent without us noticing: the actual service, and the Service registered with Consul would be completely disjoint, not health checks would run⁸, and Consul would continue to advertise the Service even if it went down, even if the whole node left the cluster.

Conclusion—Consul

In summary, Consul provides two similar API endpoints, /agent and /catalog, but only /agent should be used to register Services residing within the Consul cluster.

So: doesn’t Terraform use the /agent endpoint?

Terraform’s approach

No. Terraform indeed wants to use the /catalog endpoint, and for a very valid reason.

As seen in Fig. 1, my Terraform client sits wholly outside my infrastructure—as it is meant to. By design, there is no way of reaching Consul clients from outside the private network. Only Consul servers are reachable, through the loadblancer¹, but even so we cannot choose which specific server to route a request to.
What we are exposing, conceptually, is the Consul cluster, not a set of Consul nodes. Thus, we cannot cater to any action that operates on a specific Consul node.

The Consul Terraform Provider documentation explicitly states this as well:

Because Terraform is intended to be run externally to the cluster, and for other internal reasons, [the /agent] API was the incorrect one to use.

Heed the warnings

In hindsight, the warnings of obsolete ‘resources’ in the Consul Terraform Provider, which I had conveniently ignored, make sense.

consul_agent_service, consul_catalog_entry, and other resources have been made obsolete and/or renamed in favour of different resources with modified behaviour.
I had assumed that, as on previous occasions, I needn’t even bother worrying about this, as the fantastic folks over at HashiCorp usually make sure their products stay as intuitive to use as they are powerful.

However, breaking changes are sometimes inevitable. Terraform is currently at version 0.12, and the whole community is still figuring out the best approaches and most scalable practices, continually striving to improve the overall experience.
Sometimes, to make a vegan omelette, you have to break a few chia seeds down with almond milk!

Workarounds

Because of the design limitations of my infrastructure setup, I cannot use the obsolete resources made available by the Consul Terraform Provider (a sign that I’m doing something right!). Nor can I use the updated resources, as they don’t offer the functionality that I require.
So I’m afraid I cannot configure my Consul cluster using the Consul Terraform Provider at all, for the time being!

The way I went about this is the end is by defining the respective Consul Services and Health checks in JSON files, and installing these on-disk on each Consul node. I did this using the same mechanism I used for the basic Consul cluster configuration, i.e. using user-data .

Feature requests

Because it is still desirable—and technically feasible—to provision and configure a whole Consul cluster from Terraform without unnecessarily exposing endpoints, my team here at ECS Digital and I are putting in a Feature Request with HashiCorp to allow for this.

A simple way of achieving this would be to add an “API Proxy” functionality to Consul servers. That way, a request destined for an [unreachable] Consul client node can be sent to any one of the [exposed] Consul servers, with an additional X-Consul-API-Target header for example. The initial recipient can then use available intra-cluster communication channels to send the request on to the intended recipient’s /agent endpoint.

If this is deemed the wrong approach (e.g. because it would renege on the concept of a unified cluster which external services such as Terraform are to interact with, and instead effectively have API callers address specific nodes again), a deeper change could be made to how Consul deals with Service registrations, e.g. allowing these definitions to come through the Catalog.

Footnotes

[1] I am ignoring my auth’n/z considerations here, but please do not leave your Consul cluster unsecured!

[2] There may be a race condition here whereby the loadbalancer is ready and enables the Consul API calls to take place before all/any of the Consul nodes themselves are ready—but this has not not been an issue for me so far!

[3] A note on the consul_node resource: Nodes joining a cluster withoutnode_name specified in their confuguration use their hostname as their node name. Initially I attempted to also set the node names using the consul_node resource, but this would create a duplicate ‘node’ in the Consul GUI with the indicated name and IP address, but which was disjoint from original ‘hostname’-named node. It would also not run a Serf Health Status check, and other health checks would never run and stay in their initial state forever. Read on to find out why.
The consul_node data source was also not the solution here, as it retrieves all nodes, and doesn’t directly allow to retrieve one, e.g. by IP address.

[4] I used a cloud provider for Consul here—yay, providers!—allowing me to define which nodes are servers using tags in my cloud provider. This enables a more dynamic setup, where all the servers in the initial set can change without causing any issues.

[5] Code simplified for didactic purposes.

[6] The passing health checks are Consul’s built-in Serf Health Status check, which exists by default for each node.

[7] yah… no kidding!

[8] Consul as a different mechanism to provide health checks for external services. Click here to learn more.