Things I learned while integrating HashiCorp products together

Published in

HashiCorp Solutions Engineering Blog

11 min readMay 2, 2024

An article based on my HashiConf 2023 Hallway track

In my job as an IT consultant, I was called upon multiple times to implement a so-called HashiStack. And today I would like to focus on a few of these products that make up a HashiStack: Consul, Vault, Nomad, and especially the glue that can bind them together.

$ grep -Eiw 'consul|vault|nomad' /hashicorp/hashistack

Documentation

When it comes to finding the right documentation, there is a one-stop shop for that: the HashiCorp Developer website. Here you will find loads and loads of documentation on each and every product in a structured manner.

Each product is covered in a couple of main topics: Installation, Tutorials, (Regular) Documentation, and API documentation. And each product does have some references to its neighboring products, but basically the product documentation is based on the specific product at hand, which is totally understandable.

However, this means that there is no complete integration example guide for combining these products. And when you think of it, no deployment is ever the same right? I mean, a deployment for customer A could be fundamentally different than the deployment for customer B.

So when it comes to documentation, you’re pretty much on your own and you will have to take bits and pieces from the relevant documentation pages of the neighboring product(s).

Dependencies

An important thing to note while planning your deployment is that you will have to decide which product relies on what other product(s).

In theory, we could have the following HashiPendencies to name a few:

Consul depends on Vault for TLS certificates
Nomad depends on Consul for service discovery, Vault for TLS certificates, and workload secrets
Vault depends on Consul for its storage backend

To zoom in on the latter dependency (Vault and Consul), the choice of using either Consul’s KV store or using Vault integrated storage can make a difference in what you will have to configure in order to be part of Consul’s service catalog.

If we choose consul as Vault’s storage backend, one would provide the necessary Consul data and everything would be fine. But if we choose raft as Vault’s storage backend, then we would not be able to leverage Consul’s DNS capabilities right away.

$ dig @127.0.0.1 -p 8600 +noall +question +answer active.vault.service.consul
;active.vault.service.consul. IN A

If we want Vault to be able to register itself within the Consul services catalog, we would have to configure a service_registration stanza within the Vault configuration file. Below you will find an example of service registration.

service_registration "consul" {
  address     = "127.0.0.1:8501"
  scheme      = "https"
  token       = "<consul-token-here or inject-into-env-file>"
  tls_ca_file = "/etc/consul.d/consul-agent-ca.pem"
}

After applying the configuration and restarting the Vault agent, we now have the power of Consul to play with.

$ dig @127.0.0.1 -p 8600 +noall +question +answer active.vault.service.consul
;active.vault.service.consul. IN A
active.vault.service.consul. 0 IN CNAME vault01.node.service.consul.
vault01.node.service.consul. 0 IN A 10.156.189.163

Also, something to keep in mind is that when you don’t have a service registration for Vault, when you are using Nomad you will probably have to rely on a third party load balancer to get the right address value in your vault block to represent your active Vault node.

With service registration, it’s a piece of cake:

vault {
  enabled          = true
  address          = "https://active.vault.service.consul:8200"
  task_token_ttl   = "1h"
  create_from_role = "nomad-cluster"
  token            = "<vault-token-here or injected in /etc/nomad.d/nomad.env>"
}

And lastly, if you are using Consul as Vault’s storage backend, this also affects the way you will need to backup “Vault”. But more on that later.

Deployment

IaC versus CaC

A production deployment should be predictable and how you deploy might depend on your perception of where Infrastructure as Code (IaC) stops and Configuration as Code (CaC) starts.

In a non-SaaS deployment, IaC - for me - stops right after delivering the servers that will be used for our HashiStack. Combined with a cloud-init deployment file, the IaC should deliver the resources in a way that a configuration management tool can continue doing what it does best: configuration.

My “weapons of choice” are:

IaC — Terraform
CaC — Ansible

But there are cases where people would like to manage the entire installation with IaC tooling. For Terraform, there are multiple ways we could achieve this, one of them being the remote-exec provisioner (most of the time combined with a file provisioner as well). Its inline code would look something like this:

(..)
provisioner "remote-exec" {
  inline = [
    "sudo apt update && sudo apt -y install gpg",
    "wget -O- https://apt.releases.hashicorp.com/gpg| gpg --dearmor | sudo tee /usr/share/keyrings/hashicorp-archive-keyring.gpg >/dev/null",
    "echo \"deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main\" | sudo tee /etc/apt/sources.list.d/hashicorp.list",
    "sudo apt update && sudo apt -y install vault",
    "sudo mv ${var.tmp_vault_env} /etc/vault.d/vault.env",
    "sudo mv ${var.tmp_vault_config} /etc/vault.d/vault.hcl",
    "sudo chown vault:vault /etc/vault.d/vault.env",
    "sudo chmod 600 /etc/vault.d/vault.env",
  ]
}
(..)

With Ansible, we could achieve the same thing, but then with the following snippet of a playbook:

- name: Ensure Hashicorp signing key
  ansible.builtin.apt_key:
    url: https://apt.releases.hashicorp.com/gpg
    keyring: "{{ keyring }}"
    state: present
 
- name: Ensure Hashicorp repository
  ansible.builtin.apt_repository:
    repo: "deb [signed-by={{ keyring }}] {{ url }} {{ ansible_distribution_release }} main"
    state: present
 
- name: Ensure apt update
  ansible.builtin.apt:
    update_cache: true
    cache_valid_time: 3600
 
- name: Ensure vault package
  ansible.builtin.apt:
    name: vault
    state: present

- name: Ensure configuration
  ansible.builtin.template:
    src: "{{ item }}.j2"
    dest: "/etc/vault.d/{{ item }}"
    mode: 0600
    owner: vault
    group: vault
  with_items:
    - vault.env
    - vault.hcl

Ansible uses an inventory as it’s source of truth about which hosts it should manage. In that inventory there is a declaration of the hosts by either specifying them with a name that can be resolved through DNS or by specifying an ansible_host argument that sets the IP address of the given host.

But when you use Terraform to deploy your HashiStack infrastructure, how would Ansible know which hosts Terraform just deployed for us? Well, the answer lies in Terraform where we can make use of the templating capabilities and “deploy” the Ansible inventory through IaC as well.

We would first create a template file (called inventory.tmpl), which would look something like this:

[vault_servers]
%{ for name, ip in vault ~}
${name} ansible_host=${ip}
%{ endfor ~}

And then we use the local Terraform provider to create us our inventory file:

resource "local_file" "ansible_inventory" {
  content = templatefile("inventory.tmpl",
    {
      vault = tomap({
        for instance in aws_instance.vault :
        instance.tags.Name => instance.public_ip
      })
    }
  )
  filename = "../ansible/inventory.ini"
}

Configuration segmentation

By default, when you install either one of the three products that we discussing, you will be presented with a single configuration file (e.g. consul.hcl, vault.hcl and nomad.hcl).

Know that you don’t have to stick to just that one file. Sometimes it makes more sense to segment your configuration into multiple files. In the case of running these products through systemd please make sure that your ExecStart entry starts your binary with the -config or -config-dir argument followed by the configuration directory (e.g. /etc/vault.d ).

This lets you break down your configuration into multiple files which will automatically be read (given that they have the right extension .hcl or .json) on startup. For instance, in Nomad I typically segment my code into product specific code.

$ tree /etc/nomad.d
/etc/nomad.d
├── consul.hcl
├── nomad.env
├── nomad.hcl
└── vault.hcl

To achieve this, the nomad.service file would have to look something like this:

$ systemctl cat nomad
# /etc/systemd/system/nomad.service
(..)
[Service]
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/usr/bin/nomad agent -config /etc/nomad.d
(..)

TLS certificates

Boy, the times I’ve had people starting to sigh whenever I bring up the subject. Most people do not like dealing with certificates in general. They find it messy, complicated, and it can take up a quite bit of time.

Well sometimes, parts of this is true and when I started my journey to encrypt traffic for Consul, Vault, and Nomad in the same HashiStack, I came to the quick conclusion that the different products have different needs regarding their SAN attributes.

Below you find the products with their SAN attributes (in regex) plus what I think is handy / needed to get the job done.

Consul

127.0.0.1
(client|server).<datacenter>.<domain>Example: server.dc1.consul

Vault

127.0.0.1
((active|standby).)?vault.service.(<datacenter>)?.<domain>Example: active.vault.service.dc1.consul

Nomad

127.0.0.1
(client|server).(<region>|global).<domain>
Example: server.global.nomad

The 127.0.0.1 IP SAN attribute helps a lot when you are working on the specific servers / clients themselves. For instance, the default VAULT_ADDR environment variable is set to https://127.0.0.1:8200 and adding this IP SAN to the list will prevent you from having to actually set this every time.

As for the other SAN attributes, make sure you define your datacenter and domain within the specific product(s). This helps to integrate them more into your environment. For instance for a fictitious company In The Picture, for Consul we could set our domain to inthepicture.photo. Then when we would want to query the active Vault node we would do a dig active.vault.service.inthepicture.photo which would probably be easier to integrate somewhere down the line as opposed to just consul.

Tokens (configuration)

When securing your products you will probably want to turn on ACL’s, but beware! Once you are beyond the point of bootstrapping, you will need tokens and to be more precise: almost every request needs a token.

And if you are bootstrapping Consul and set your default policy to deny, beware that this will break Consul’s DNS capabilities. Because by default, you would need a token for that. Now this might seem something you could overcome by setting an environment variable with a token — whether or not you would consider that safe and/or a hassle— but ok, that would work…on a Linux or Windows client.

But what if you rely on conditional DNS forwarding to your Consul domain? The forwarder (whether it would be a BIND, dnsmasq, InfoBlox, etc) will not work. Here is where you would have to create a Consul policy, allowing read capabilities on certain parts and then you can tell Consul to use the default agent token with the consul acl set-agent-token default <token> command.

Avoid using tokens in main configuration files. Put your sensitive data like tokens either in ENV variables, most commonly done by making use of the product’s .env file. However, sometimes you can’t avoid it. Overall, make sure your permissions are restricted to the user running the process.

$ sudo chown -R consul:consul /etc/consul.d
$ sudo chmod 0600 /etc/consul.d/consul.env /etc/consul.d/consul.hcl

Never put sensitive data in version control!

Logs / Logging

By default, Consul, Vault, and Nomad will log to journald. You probably want to change that.

For Consul and Nomad, we can change this relatively easy in the respective configuration files (with a tool like Ansible for instance) and put in the following details:

enable_syslog        = true
log_level            = "INFO"
log_json             = false
log_file             = "/var/log/{{ product }}/{{ product }}.log"
log_rotate_duration  = "86400s"
log_rotate_max_files = 7

Where {{ product }} will be your product of choice of course. This will create a log file each 24 hours and keep the latest 7 files. Make sure that the log destination directory is created prior to using this.

For Vault, we will make use of the audit devices. We will enable two audit devices, making sure that we at least have a fallback in case of a failure of one of these audit devices. Vault will stop working when it cannot write to all configured audit devices. As long as at least one of the audit devices keeps working, Vault will remain online as well.

$ vault audit enable syslog
$ vault audit enable file file_path=/var/log/vault/audit.log

But, this audit log does not have any form of rotation, which could end up with something like this:

$ ls -lh /var/log/vault/audit.log
-rw------- 1 vault vault 1.3G Apr 15 09:31 /var/log/vault/audit.log

Now, no one will be able to easily parse a 1G log file. So to overcome this, we will have to create a log rotation ourselves by placing a /etc/logrotate.d/vault which has the following content:

/var/log/vault/audit.log {
  rotate 7
  daily
  notifempty
  missingok
  compress
  delaycompress
  postrotate
   /usr/bin/systemctl reload vault 2> /dev/null || true
  endscript
  extension log
  dateext
  dateformat %Y-%m-%d.
}

Which would lead to a far better overview:

$ ls -lh /var/log/vault/
total 348K
-rw------- 1 vault vault 2.3K Apr  9 00:00 audit.2024-04-09.log.gz
-rw------- 1 vault vault 1.7K Apr 10 00:00 audit.2024-04-10.log.gz
-rw------- 1 vault vault 267K Apr 11 00:00 audit.2024-04-11.log.gz
-rw------- 1 vault vault 1.5K Apr 12 00:00 audit.2024-04-12.log.gz
-rw------- 1 vault vault 3.8K Apr 12 08:15 audit.2024-04-13.log.gz
-rw------- 1 vault vault 2.3K Apr 13 03:38 audit.2024-04-14.log.gz
-rw------- 1 vault vault  37K Apr 14 16:01 audit.2024-04-15.log
-rw------- 1 vault vault  15K Apr 15 00:00 audit.log

Backup

Are you running the community edition? Then you have your work cut out for you to create and maintain backups yourself. When you are running the enterprise edition, you could leverage the snapshot agent.

An example of a backup script that uses the operator raft snapshot save command you can find below which keeps the last 7 days:

#!/bin/bash
# Assumes correct environment variables
# for authentication have been set
 
ts=$(date "+%Y%m%d%H%M%S")
 
find /srv/backups -name "*.snap" -type f -mtime +7 -exec rm {} \;
/usr/bin/${PRODUCT} snapshot save /srv/backups/${PRODUCT}_${ts}.snap

Remember that for each product you would have to ensure that you have the right ACL in place and a corresponding token to be able to backup your data.

When using Consul as a storage backend for Vault, you cannot use this, then you would have to create a backup script that exports the vault/ part of the Consul KV store:

#!/bin/bash
# Assumes correct environment variables
# for authentication have been set
 
ts=$(date "+%Y%m%d%H%M%S")
 
find /srv/backups -name "*.json" -type f -mtime +7 -exec rm {} \;
/usr/bin/consul kv export vault/ > /srv/backups/vault_kv_${ts}.json

An example of using the enterprise snapshot agent you can find here which will automatically take periodic snapshots for you:

{
  "snapshot_agent": {
    "http_addr": "127.0.0.1:8501",
    "token": "<consul-token-here>",
    "datacenter": "",
    "license_path": "/etc/consul.d/consul.hclic",
    "snapshot": {
      "interval": "1h",
      "retain": 30,
      "stale": false,
      "service": "consul-snapshot",
      "lock_key": "consul-snapshot/lock",
      "max_failures": 3
     },
    "local_storage": {
      "path": "/srv/consul/snapshots"
    }
  }
}

Tokens (operations)

Are you using Vault? Did you know that Vault has both Consul and Nomad secrets engines? This allows you to authenticate to Vault and then retrieve limited-TTL access tokens for accessing and managing Consul and Nomad. How cool is that?! Remember that you will have to have the right policies in place within both Consul and Nomad to refer to them with Vault.

$ vault read consul/creds/operator
  Key                Value
  ---                -----
  lease_id           consul/creds/operator/EfssiLzg2Qt9zAbFrx5xBuOk
  lease_duration     4h
  lease_renewable    true
  accessor           86feeb52-368e-691f-ee0d-598d49aca2cb
  token              2117f9fb-414f-afee-a7d9-381f143b3f70

$ vault read nomad/creds/operator
  Key                Value
  ---                -----
  lease_id           nomad/creds/operator/wo7dGKyPwhH2236UDuGWEd3j
  lease_duration     4h
  lease_renewable    true
  accessor_id        efe76155-4477-5bc3-5220-634750f30567
  secret_id          79522702-5056-986e-d5d6-360c7b9f6b02

In this case, the policy operator should be present in Consul and Nomad.

Observability and monitoring

Basically: enable metrics, they will save you someday.

Metrics

It’s as easy as adding a telemetry block to your configuration file with just a few lines (assuming we use Prometheus):

telemetry {
  disable_hostname          = true
  prometheus_retention_time = "12h"
}

Then on your Prometheus server, add a scrape job leveraging Consul service discovery:

- job_name: 'hashicorp_vault'
  consul_sd_configs:
  - server: 'consul.inthepicture.photo:8500'
    services: ['vault']
  relabel_configs:
  - source_labels: ['__meta_consul_tags']
    regex: '(.*)active(.*)'
    action: keep
  metrics_path: /v1/sys/metrics
  params:
    format: ['prometheus']
  scheme: https
  authorization:
    credentials_file: /etc/prometheus/vault-token

To maintain a current token for accessing the right API endpoint, Vault Agent could be leveraged:

exit_after_auth = false
pid_file        = "./pidfile"

vault {
   address = "https://active.vault.inthepicture.photo:8200"
}

auto_auth {
   method "approle" {
       mount_path = "auth/approle"
       config = {
           role_id_file_path                   = "/etc/vault.d/roleid"
           secret_id_file_path                 = "/etc/vault.d/secretid"
           remove_secret_id_file_after_reading = true
       }
   }

   sink "file" {
       config = {
           path = "/etc/prometheus/vault-token"
           mode = 0644
       }
   }
}

api_proxy {
  use_auto_auth_token = true
}

listener "tcp" {
   address     = "10.11.12.1:8007"
   tls_disable = true
}

Monitoring

Besides metrics, be sure to monitor and trigger on basic features with a monitoring like Zabbix, Nagios, or CheckMK, which all have free editions.

Relevant ports listening
Vault sealed state
High CPU
High RAM
Disk usage

Conclusion

In a perfect world, we would install all three products and “enable” the neighboring product through a simple consul = enabled etc. But the reality is that integrating these products together is prone to some extra work… work that takes time to research. But once you have gathered all of the information, this can all be automated, and there lies the key: in automation.

Example code can be found on my GitHub account for multiple products. An all-in-one example is my AT-Hashistack repository.