Things I learned while integrating HashiCorp products together

Chris van Meer
HashiCorp Solutions Engineering Blog
11 min readMay 2, 2024

--

An article based on my HashiConf 2023 Hallway track

In my job as an IT consultant, I was called upon multiple times to implement a so-called HashiStack. And today I would like to focus on a few of these products that make up a HashiStack: Consul, Vault, Nomad, and especially the glue that can bind them together.

$ grep -Eiw 'consul|vault|nomad' /hashicorp/hashistack

Documentation

When it comes to finding the right documentation, there is a one-stop shop for that: the HashiCorp Developer website. Here you will find loads and loads of documentation on each and every product in a structured manner.

Each product is covered in a couple of main topics: Installation, Tutorials, (Regular) Documentation, and API documentation. And each product does have some references to its neighboring products, but basically the product documentation is based on the specific product at hand, which is totally understandable.

However, this means that there is no complete integration example guide for combining these products. And when you think of it, no deployment is ever the same right? I mean, a deployment for customer A could be fundamentally different than the deployment for customer B.

So when it comes to documentation, you’re pretty much on your own and you will have to take bits and pieces from the relevant documentation pages of the neighboring product(s).

Dependencies

An important thing to note while planning your deployment is that you will have to decide which product relies on what other product(s).

In theory, we could have the following HashiPendencies to name a few:

  • Consul depends on Vault for TLS certificates
  • Nomad depends on Consul for service discovery, Vault for TLS certificates, and workload secrets
  • Vault depends on Consul for its storage backend

To zoom in on the latter dependency (Vault and Consul), the choice of using either Consul’s KV store or using Vault integrated storage can make a difference in what you will have to configure in order to be part of Consul’s service catalog.

If we choose consul as Vault’s storage backend, one would provide the necessary Consul data and everything would be fine. But if we choose raft as Vault’s storage backend, then we would not be able to leverage Consul’s DNS capabilities right away.

$ dig @127.0.0.1 -p 8600 +noall +question +answer active.vault.service.consul
;active.vault.service.consul. IN A

If we want Vault to be able to register itself within the Consul services catalog, we would have to configure a service_registration stanza within the Vault configuration file. Below you will find an example of service registration.

service_registration "consul" {
address = "127.0.0.1:8501"
scheme = "https"
token = "<consul-token-here or inject-into-env-file>"
tls_ca_file = "/etc/consul.d/consul-agent-ca.pem"
}

After applying the configuration and restarting the Vault agent, we now have the power of Consul to play with.

$ dig @127.0.0.1 -p 8600 +noall +question +answer active.vault.service.consul
;active.vault.service.consul. IN A
active.vault.service.consul. 0 IN CNAME vault01.node.service.consul.
vault01.node.service.consul. 0 IN A 10.156.189.163

Also, something to keep in mind is that when you don’t have a service registration for Vault, when you are using Nomad you will probably have to rely on a third party load balancer to get the right address value in your vault block to represent your active Vault node.

With service registration, it’s a piece of cake:

vault {
enabled = true
address = "https://active.vault.service.consul:8200"
task_token_ttl = "1h"
create_from_role = "nomad-cluster"
token = "<vault-token-here or injected in /etc/nomad.d/nomad.env>"
}

And lastly, if you are using Consul as Vault’s storage backend, this also affects the way you will need to backup “Vault”. But more on that later.

Deployment

IaC versus CaC

A production deployment should be predictable and how you deploy might depend on your perception of where Infrastructure as Code (IaC) stops and Configuration as Code (CaC) starts.

In a non-SaaS deployment, IaC - for me - stops right after delivering the servers that will be used for our HashiStack. Combined with a cloud-init deployment file, the IaC should deliver the resources in a way that a configuration management tool can continue doing what it does best: configuration.

My “weapons of choice” are:

  • IaC — Terraform
  • CaC — Ansible

But there are cases where people would like to manage the entire installation with IaC tooling. For Terraform, there are multiple ways we could achieve this, one of them being the remote-exec provisioner (most of the time combined with a file provisioner as well). Its inline code would look something like this:

(..)
provisioner "remote-exec" {
inline = [
"sudo apt update && sudo apt -y install gpg",
"wget -O- https://apt.releases.hashicorp.com/gpg| gpg --dearmor | sudo tee /usr/share/keyrings/hashicorp-archive-keyring.gpg >/dev/null",
"echo \"deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main\" | sudo tee /etc/apt/sources.list.d/hashicorp.list",
"sudo apt update && sudo apt -y install vault",
"sudo mv ${var.tmp_vault_env} /etc/vault.d/vault.env",
"sudo mv ${var.tmp_vault_config} /etc/vault.d/vault.hcl",
"sudo chown vault:vault /etc/vault.d/vault.env",
"sudo chmod 600 /etc/vault.d/vault.env",
]
}
(..)

With Ansible, we could achieve the same thing, but then with the following snippet of a playbook:

- name: Ensure Hashicorp signing key
ansible.builtin.apt_key:
url: https://apt.releases.hashicorp.com/gpg
keyring: "{{ keyring }}"
state: present

- name: Ensure Hashicorp repository
ansible.builtin.apt_repository:
repo: "deb [signed-by={{ keyring }}] {{ url }} {{ ansible_distribution_release }} main"
state: present

- name: Ensure apt update
ansible.builtin.apt:
update_cache: true
cache_valid_time: 3600

- name: Ensure vault package
ansible.builtin.apt:
name: vault
state: present

- name: Ensure configuration
ansible.builtin.template:
src: "{{ item }}.j2"
dest: "/etc/vault.d/{{ item }}"
mode: 0600
owner: vault
group: vault
with_items:
- vault.env
- vault.hcl

Ansible uses an inventory as it’s source of truth about which hosts it should manage. In that inventory there is a declaration of the hosts by either specifying them with a name that can be resolved through DNS or by specifying an ansible_host argument that sets the IP address of the given host.

But when you use Terraform to deploy your HashiStack infrastructure, how would Ansible know which hosts Terraform just deployed for us? Well, the answer lies in Terraform where we can make use of the templating capabilities and “deploy” the Ansible inventory through IaC as well.

We would first create a template file (called inventory.tmpl), which would look something like this:

[vault_servers]
%{ for name, ip in vault ~}
${name} ansible_host=${ip}
%{ endfor ~}

And then we use the local Terraform provider to create us our inventory file:

resource "local_file" "ansible_inventory" {
content = templatefile("inventory.tmpl",
{
vault = tomap({
for instance in aws_instance.vault :
instance.tags.Name => instance.public_ip
})
}
)
filename = "../ansible/inventory.ini"
}

Configuration segmentation

By default, when you install either one of the three products that we discussing, you will be presented with a single configuration file (e.g. consul.hcl, vault.hcl and nomad.hcl).

Know that you don’t have to stick to just that one file. Sometimes it makes more sense to segment your configuration into multiple files. In the case of running these products through systemd please make sure that your ExecStart entry starts your binary with the -config or -config-dir argument followed by the configuration directory (e.g. /etc/vault.d ).

This lets you break down your configuration into multiple files which will automatically be read (given that they have the right extension .hcl or .json) on startup. For instance, in Nomad I typically segment my code into product specific code.

$ tree /etc/nomad.d
/etc/nomad.d
├── consul.hcl
├── nomad.env
├── nomad.hcl
└── vault.hcl

To achieve this, the nomad.service file would have to look something like this:

$ systemctl cat nomad
# /etc/systemd/system/nomad.service
(..)
[Service]
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/usr/bin/nomad agent -config /etc/nomad.d
(..)

TLS certificates

Boy, the times I’ve had people starting to sigh whenever I bring up the subject. Most people do not like dealing with certificates in general. They find it messy, complicated, and it can take up a quite bit of time.

Well sometimes, parts of this is true and when I started my journey to encrypt traffic for Consul, Vault, and Nomad in the same HashiStack, I came to the quick conclusion that the different products have different needs regarding their SAN attributes.

Below you find the products with their SAN attributes (in regex) plus what I think is handy / needed to get the job done.

Consul

  • 127.0.0.1
  • (client|server).<datacenter>.<domain>
    Example: server.dc1.consul

Vault

  • 127.0.0.1
  • ((active|standby).)?vault.service.(<datacenter>)?.<domain>
    Example: active.vault.service.dc1.consul

Nomad

  • 127.0.0.1
  • (client|server).(<region>|global).<domain>
    Example: server.global.nomad

The 127.0.0.1 IP SAN attribute helps a lot when you are working on the specific servers / clients themselves. For instance, the default VAULT_ADDR environment variable is set to https://127.0.0.1:8200 and adding this IP SAN to the list will prevent you from having to actually set this every time.

As for the other SAN attributes, make sure you define your datacenter and domain within the specific product(s). This helps to integrate them more into your environment. For instance for a fictitious company In The Picture, for Consul we could set our domain to inthepicture.photo. Then when we would want to query the active Vault node we would do a dig active.vault.service.inthepicture.photo which would probably be easier to integrate somewhere down the line as opposed to just consul.

Tokens (configuration)

When securing your products you will probably want to turn on ACL’s, but beware! Once you are beyond the point of bootstrapping, you will need tokens and to be more precise: almost every request needs a token.

And if you are bootstrapping Consul and set your default policy to deny, beware that this will break Consul’s DNS capabilities. Because by default, you would need a token for that. Now this might seem something you could overcome by setting an environment variable with a token — whether or not you would consider that safe and/or a hassle— but ok, that would work…on a Linux or Windows client.

But what if you rely on conditional DNS forwarding to your Consul domain? The forwarder (whether it would be a BIND, dnsmasq, InfoBlox, etc) will not work. Here is where you would have to create a Consul policy, allowing read capabilities on certain parts and then you can tell Consul to use the default agent token with the consul acl set-agent-token default <token> command.

Avoid using tokens in main configuration files. Put your sensitive data like tokens either in ENV variables, most commonly done by making use of the product’s .env file. However, sometimes you can’t avoid it. Overall, make sure your permissions are restricted to the user running the process.

$ sudo chown -R consul:consul /etc/consul.d
$ sudo chmod 0600 /etc/consul.d/consul.env /etc/consul.d/consul.hcl

Never put sensitive data in version control!

Logs / Logging

By default, Consul, Vault, and Nomad will log to journald. You probably want to change that.

For Consul and Nomad, we can change this relatively easy in the respective configuration files (with a tool like Ansible for instance) and put in the following details:

enable_syslog        = true
log_level = "INFO"
log_json = false
log_file = "/var/log/{{ product }}/{{ product }}.log"
log_rotate_duration = "86400s"
log_rotate_max_files = 7

Where {{ product }} will be your product of choice of course. This will create a log file each 24 hours and keep the latest 7 files. Make sure that the log destination directory is created prior to using this.

For Vault, we will make use of the audit devices. We will enable two audit devices, making sure that we at least have a fallback in case of a failure of one of these audit devices. Vault will stop working when it cannot write to all configured audit devices. As long as at least one of the audit devices keeps working, Vault will remain online as well.

$ vault audit enable syslog
$ vault audit enable file file_path=/var/log/vault/audit.log

But, this audit log does not have any form of rotation, which could end up with something like this:

$ ls -lh /var/log/vault/audit.log
-rw------- 1 vault vault 1.3G Apr 15 09:31 /var/log/vault/audit.log

Now, no one will be able to easily parse a 1G log file. So to overcome this, we will have to create a log rotation ourselves by placing a /etc/logrotate.d/vault which has the following content:

/var/log/vault/audit.log {
rotate 7
daily
notifempty
missingok
compress
delaycompress
postrotate
/usr/bin/systemctl reload vault 2> /dev/null || true
endscript
extension log
dateext
dateformat %Y-%m-%d.
}

Which would lead to a far better overview:

$ ls -lh /var/log/vault/
total 348K
-rw------- 1 vault vault 2.3K Apr 9 00:00 audit.2024-04-09.log.gz
-rw------- 1 vault vault 1.7K Apr 10 00:00 audit.2024-04-10.log.gz
-rw------- 1 vault vault 267K Apr 11 00:00 audit.2024-04-11.log.gz
-rw------- 1 vault vault 1.5K Apr 12 00:00 audit.2024-04-12.log.gz
-rw------- 1 vault vault 3.8K Apr 12 08:15 audit.2024-04-13.log.gz
-rw------- 1 vault vault 2.3K Apr 13 03:38 audit.2024-04-14.log.gz
-rw------- 1 vault vault 37K Apr 14 16:01 audit.2024-04-15.log
-rw------- 1 vault vault 15K Apr 15 00:00 audit.log

Backup

Are you running the community edition? Then you have your work cut out for you to create and maintain backups yourself. When you are running the enterprise edition, you could leverage the snapshot agent.

An example of a backup script that uses the operator raft snapshot save command you can find below which keeps the last 7 days:

#!/bin/bash
# Assumes correct environment variables
# for authentication have been set

ts=$(date "+%Y%m%d%H%M%S")

find /srv/backups -name "*.snap" -type f -mtime +7 -exec rm {} \;
/usr/bin/${PRODUCT} snapshot save /srv/backups/${PRODUCT}_${ts}.snap

Remember that for each product you would have to ensure that you have the right ACL in place and a corresponding token to be able to backup your data.

When using Consul as a storage backend for Vault, you cannot use this, then you would have to create a backup script that exports the vault/ part of the Consul KV store:

#!/bin/bash
# Assumes correct environment variables
# for authentication have been set

ts=$(date "+%Y%m%d%H%M%S")

find /srv/backups -name "*.json" -type f -mtime +7 -exec rm {} \;
/usr/bin/consul kv export vault/ > /srv/backups/vault_kv_${ts}.json

An example of using the enterprise snapshot agent you can find here which will automatically take periodic snapshots for you:

{
"snapshot_agent": {
"http_addr": "127.0.0.1:8501",
"token": "<consul-token-here>",
"datacenter": "",
"license_path": "/etc/consul.d/consul.hclic",
"snapshot": {
"interval": "1h",
"retain": 30,
"stale": false,
"service": "consul-snapshot",
"lock_key": "consul-snapshot/lock",
"max_failures": 3
},
"local_storage": {
"path": "/srv/consul/snapshots"
}
}
}

Tokens (operations)

Are you using Vault? Did you know that Vault has both Consul and Nomad secrets engines? This allows you to authenticate to Vault and then retrieve limited-TTL access tokens for accessing and managing Consul and Nomad. How cool is that?! Remember that you will have to have the right policies in place within both Consul and Nomad to refer to them with Vault.

$ vault read consul/creds/operator
Key Value
--- -----
lease_id consul/creds/operator/EfssiLzg2Qt9zAbFrx5xBuOk
lease_duration 4h
lease_renewable true
accessor 86feeb52-368e-691f-ee0d-598d49aca2cb
token 2117f9fb-414f-afee-a7d9-381f143b3f70

$ vault read nomad/creds/operator
Key Value
--- -----
lease_id nomad/creds/operator/wo7dGKyPwhH2236UDuGWEd3j
lease_duration 4h
lease_renewable true
accessor_id efe76155-4477-5bc3-5220-634750f30567
secret_id 79522702-5056-986e-d5d6-360c7b9f6b02

In this case, the policy operator should be present in Consul and Nomad.

Observability and monitoring

Basically: enable metrics, they will save you someday.

Metrics

It’s as easy as adding a telemetry block to your configuration file with just a few lines (assuming we use Prometheus):

telemetry {
disable_hostname = true
prometheus_retention_time = "12h"
}

Then on your Prometheus server, add a scrape job leveraging Consul service discovery:

- job_name: 'hashicorp_vault'
consul_sd_configs:
- server: 'consul.inthepicture.photo:8500'
services: ['vault']
relabel_configs:
- source_labels: ['__meta_consul_tags']
regex: '(.*)active(.*)'
action: keep
metrics_path: /v1/sys/metrics
params:
format: ['prometheus']
scheme: https
authorization:
credentials_file: /etc/prometheus/vault-token

To maintain a current token for accessing the right API endpoint, Vault Agent could be leveraged:

exit_after_auth = false
pid_file = "./pidfile"

vault {
address = "https://active.vault.inthepicture.photo:8200"
}

auto_auth {
method "approle" {
mount_path = "auth/approle"
config = {
role_id_file_path = "/etc/vault.d/roleid"
secret_id_file_path = "/etc/vault.d/secretid"
remove_secret_id_file_after_reading = true
}
}

sink "file" {
config = {
path = "/etc/prometheus/vault-token"
mode = 0644
}
}
}

api_proxy {
use_auto_auth_token = true
}

listener "tcp" {
address = "10.11.12.1:8007"
tls_disable = true
}

Monitoring

Besides metrics, be sure to monitor and trigger on basic features with a monitoring like Zabbix, Nagios, or CheckMK, which all have free editions.

  • Relevant ports listening
  • Vault sealed state
  • High CPU
  • High RAM
  • Disk usage

Conclusion

In a perfect world, we would install all three products and “enable” the neighboring product through a simple consul = enabled etc. But the reality is that integrating these products together is prone to some extra work… work that takes time to research. But once you have gathered all of the information, this can all be automated, and there lies the key: in automation.

Example code can be found on my GitHub account for multiple products. An all-in-one example is my AT-Hashistack repository.

--

--