Platform Engineering on the HashiCorp Ecosystem Nomad

Published in

HashiCorp Solutions Engineering Blog

16 min readApr 18, 2023

UPDATE Jul 13, 2023: The consumption modes referenced below (enforced by HashiCorp Sentinel) should be considered deprecated in favour of Nomad Node Pool Governance announced here.

This series of blog posts covers features available in HashiCorp commercial offerings. The goal of this series is to provide a practical guide on how to facilitate a multi-tenant developer PaaS using the HashiCorp ecosystem and why that matters to your overall business outcomes.

In this post, I will cover a few topics:

How a platform team topology is critical to achieving high velocity delivery with guardrails
Why you should consider HashiCorp’s Nomad Enterprise from a platform engineering perspective
How multi-tenancy works with HashiCorp Nomad, how to onboard, and why it matters in a platform team context
Validated consumption modes in Nomad from a shared service context. Modes that facilitate low cognitive load on users to optimize the developer and operator experience
How to enforce the consumption modes using policy as code

Platform Team Criticality

In the book Team Topologies by Matthew Skelton and Manuel Pais, the platform team is described as a horizontal mission critical team and a path for stream-aligned teams (or product teams) to speed up the flow of change.

If you’ve confirmed that your engineering department as suffering from slow velocity, it may be because your teams aren’t structured in a way that is optimized for fast flow. To put it in my own words, your team structure as well as the interaction modes between those teams matter to achieve the following outcomes:

Reduced time-to-market — Product teams (the teams that build the business logic) rely on the platform team to help them ship code with very little toil as well as maintain autonomy.
Reduced risk — The platform team is concerned with not just software delivery velocity, but also security and governance. Product teams use the platform team’s XaaS offerings and established consumption modes to ship software with guardrails, reducing risk of breach and ultimately data exfiltration.
Reduced cost — Product teams who roll their own infrastructure tend to re-invent the wheel and duplicate components, adding costs. Time is taken away from shipping business code largely due to unplanned work and managing their own infrastructure. The platform team removes unknowns from the product team, saving countless hours of FTE toil.

In order to achieve these outcomes, there’s a whole lot of organizational and technological complexity that platform teams must streamline for their internal customers. HashiCorp’s cloud oprating model is now centered around platform team criticality. If you’re serious about reaching those business outcomes, it is worth reading the entire paper.

Why Enterprise or HCP?

So, why would you consider powering your software delivery platform with Nomad, Vault, and Consul? If you’re already sold on the OSS tech, why would you pay for HashiCorp enterprise software or HCP? Apart from some more advanced use cases, HashiCorp’s OSS versions come with tech you can utilize for most projects. I spent years adopting HashiCorp workflows and going too far with OSS until I realized the limitations of offering this tech in a shared service capacity.

In my experience as a platform engineering team lead, a few factors went into my decision to purchase HashiCorp Vault, Consul, Nomad, and Terraform Cloud:

At some point, things will break, and my team and I will need support regardless of how much we know about the OSS versions. The platform we’re building will support at least 3 business units, with around 300 engineers, as well as around 150 multi-tier applications with stateful workloads that serve millions of users per day. There’s a huge risk to the business not having a paid support channel for critical platform components. Or in terms of HCP, managed services removes a lot of responsibility from the team so you can focus on consumption.
As a team in charge of a shared service platform, multi-tenancy needs to be a core part of the design. When I say multi-tenancy, I’m talking about logical segmentation and isolation of workloads with granular ACLs to control access to a tenant or workload. A tenant can be a team, project, project environment, etc. With this approach, we’re taking security seriously and mitigating the risk of an attacker gaining access to other tenants while still utilizing a shared cluster. This is something that HashiCorp unlocks with the enterprise feature set and is part of the foundation for zero trust security architecture.
If I or a colleague leave the company, I don’t want their understanding of the DIY solution to leave with them. A DIY solution is usually poorly documented once and never updated, leaving the next person to pick up the pieces and spend countless hours figuring out how things work. In the meantime, the business loses delivery velocity, and has increased risks and costs. Partnering with HashiCorp is like an insurance policy to protect against people churn and lost productivity. I prefer to leave behind a solution that outlives my tenure.

In my mind, a large part of being an effective engineer is knowing when not to DIY and to value your time by solving a problem with a paid solution. Chances are it will actually be cheaper (and better) in terms of total cost of ownership. In all the complexity, don’t forget that platforms teams are a means to an end — for developers to ship code for the business. The faster you can get to this step, the happier everyone will be.

Nomad — Beginning with the end in mind

Let’s dive into how HashiCorp’s tech itself powers a platform, starting with Nomad. Why would I start with Nomad? Because Nomad (and Waypoint) are the most relevant tools that developers will utilize in the HashiCorp ecosystem. Remember, one of our main business outcomes is less time to market, so let’s begin with the end in mind and address the developer experience first.

There are already these great posts: Introduction to Nomad, Nomad vs Kubernetes, and a K8s to Nomad conversion post, so instead I will focus on how Nomad plays a critical role in powering multi-tenancy scheduling and how this is especially useful for a platform team looking to offer Nomad for multiple teams, all while keeping things as simple as possible.

Getting started in Nomad has never been easier thanks to features like: native service discovery, and secure variables. It is now possible to go quite far using just Nomad without requiring Consul or Vault. This is great for developers who are just getting started and simple architectures. However, if you find yourself serving multiple BUs and teams, advanced use cases for service mesh, service discovery and secrets management become relevant very quickly.

Instead of talking about day 0–1 topics of a Nomad cluster, let’s look at what is important to platform folks in enabling consumption of a Nomad cluster, assuming it is already fully operational.

Onboarding Namespaces & Policies

Namespaces exist in HashiCorp Nomad, Consul, and Vault and work quite well together. In this post, let’s focus purely on Nomad multi-tenancy.

Remember, we are platform engineers trying to create a centralized Nomad for all to use. Namespaces provide the means to logically isolate a group of jobs with ACLs. We don’t want to have everyone using the same namespace or even shared ACL tokens as this would drastically increase the risk to the business. Instead, we need to isolate a team’s project and environment with a namespace, but also provide a great experience in onboarding. A common question I hear in the field is “Is there a namespace limitation”? Short answer is no. Namespaces take little to no overhead on the Nomad servers and are there to be used. Consider the following example:

nomad namespace apply -description "Be descriptive!" qa-acme-darkhorse

The convention we’re using is {env}-{bu}-{project/team}. Since names matter, we’re choosing a convention that is descriptive, but also makes sense in a multi-tenant context. We’ve purposely left out things like datacenter or region in the namespace name since those data points are defined in the job itself. Organizationally, we want to provide separate access for the {env}-{bu}-{project/team} convention for increased security posture. Secure access can now be done using ACL roles w/ SSO and OIDC provider integration which are ultimately tied to a namespace and policy.

What about the dev experience? If their primary function is to write business code, what’s the quickest path to that? How do we automate this in a similar workflow to developing code that doesn’t involve opening a ticket? At the end of the day, the developer simply needs access to their desired namespace to launch jobs. As a platform team, we need to get to this point as quick as possible through automation.

In a previous life, I wrote a custom CLI in Go to do the namespace creation, but this approach was inherently flawed and ultimately increased risk to the business because it was yet another DIY project with no LTS plan. Writing code felt good, but we eventually decided to abandon the project and address the Nomad onboarding challenge using infrastructure as code with Terraform and GitOps.

Now Terraform isn’t the answer to everything (or is it?), but we identified that the solution to getting a developer appropriate access quickly was the following:

A developer opens a pull request to a Terraform repo. The change they make consists of adding a name to a Terraform input map variable. This Terraform map variable is a collection of namespaces with a set of initial config.
That pull request is viewed by code owners or an appropriate team which provides the necessary checks and balances to satisfy a healthy change management process.
Once approved, the pull request is merged and Terraform Cloud picks up the change and executes the apply. The developer can then use nomad login to validate their identity and retrieve their Nomad token and start submitting jobs to their project namespace.

Instead of waiting for access for an indefinite period of time, devs use a similar Git workflow as they do with their application lifecycle. They merely populate a bit of HCL and open a pull request with their desired name for their Nomad project. Let’s look at some Terraform code:

An example of a variables.tf

variable "tenants" {
  type = map
  default = {
    dev-webteam-frontend = {
      description     = "The dev env for the web team frontend services"
      team =  "webteam"
    }
    dev-apiteam-backend = {
      description     = "The dev env for the API team backend services"
      team =  "apiteam"
    }
    qa-acme-darkhorse = {
      description     = "The qa env for acme team darkhorse project"
      team =  "data"
    }
  }
}

An example of a main.tf

resource "nomad_namespace" "tenants" {
  lifecycle {
    prevent_destroy = true
  }
  for_each    = var.tenants
  name        = each.key
  description = each.value.description
}

data "template_file" "default_policy" {
  for_each = var.tenants
  template = file("${path.module}/files/default-policy.hcl")
  vars = {
    namespace = each.key
  }
}

resource "nomad_acl_policy" "default_policy" {
  for_each    = var.tenants
  name        = each.key
  description = "Default policy for ${each.key} namespace."
  rules_hcl   = data.template_file.default_policy[each.key].rendered
}

data "template_file" "custom_policy" {
  for_each = var.tenants
  template = file("${path.module}/files/${each.key}-policy.hcl")
  vars = {
    namespace = each.key
  }
}

resource "nomad_acl_policy" "custom_policy" {
  for_each    = var.tenants
  name        = each.key
  description = "Custom policy for ${each.key} namespace."
  rules_hcl   = data.template_file.custom_policy[each.key].rendered
}

The default template file in ./files/default-policy.hcl:

namespace "${namespace}" {
  policy       = "read"
  capabilities = ["submit-job"]
}

The custom policy template file in ./files/qa-acme-darkhorse-policy.hcl:

namespace "${namespace}" {
  policy       = "write"
  capabilities = ["csi-register-plugin"]
}

node {
  policy = "read"
}

Since ACL roles are ultimately assigned to policies, we want to maximize the flexibility and give the option for each tenant to have their own set policies. When it comes to mapping an ACL role to an OIDC auth method in Nomad, our binding rules need to follow the same flexible pattern since we’re facilitating multi-tenancy.

Alternatively, Nomad ACL tokens can be issued through Vault with a TTL through the tenant’s respective Vault namespace using the Nomad secrets engine. More on that in the next post.

To expand on the Terraform onboarding approach, it can create everything from namespaces to resource quotas on a namespace. When operating a platform for engineering teams, resource quotas become critical to restrict aggregate resource usage and set limits. Luckily, the Terraform Nomad provider also supports managing quotas.

Another tool in our Nomad toolbox is Dynamic Application Sizing (DAS) which can be used to collect and report actual resource usage for running jobs and make recommendations. Together with DAS and resource quotas, platform teams running Nomad are better equipped to manage multi-tenancy at scale and the scheduling challenges that come with that scale.

Now that we’ve talked about the onboarding problem, how do developers engage with a shared Nomad cluster? What are some validated consumption modes that other Nomad customers are using? The following modes are influenced by a HashiCorp customer in the domains and telecommunications industry.

Nomad Consumption Modes

In an effort to address some Nomad Day 2 challenges, let’s break down a couple of Nomad consumption modes:

Shared Mode

In this mode, a platform team manages the Nomad servers and a set of shared Nomad clients. The goal for this mode is to enable a developer or team to manage a job or set of jobs in a namespace, with no infrastructure knowledge required. This mode is useful for teams that have limited or no DevOps people dedicated to a project. It is important to point out that in this mode, Nomad feels like a managed service to the developer, reducing cognitive load and providing a quicker path to deploy their app or task. I have found this mode is suitable for the following use cases:

Any stateless API or frontend web app
Webhook handlers (GitHub webhooks, Terraform Cloud Run Tasks payloads)
Data engineering tasks, periodic batch, cron-style (with external blob storage)
Terraform Cloud Agents

Pros:

Manage only a job HCL, no infrastructure knowledge required
Ingress managed by the platform team, i.e. routing via Consul service tags in the job file
Teams get their own namespace, providing logical isolation even though the Nomad clients are shared

Trade-Offs:

Product teams do not have control over the shared Nomad clients
Less service isolation on the Nomad client
Jobs that require mounted persistent data are not recommended in this mode due to the previous point (external blob storage w/ auth is good)
A common API gateway is used to manage ingress, usually adding to the platform team’s list of things get working for developers

We can further enhance this mode and bring more service isolation when we start using Consul service mesh (connect) Nomad integration, but we’ll cover this in a subsequent post.

BYONC Mode (bring your own Nomad client)

Nomad “Bring your own Nomad client” consumption mode

This mode (pronounced: bionic) allows cross-functional product teams to attach their own Nomad clients to a cluster of Nomad servers in a desired region. This mode is useful for teams that have dedicated DevOps folks who want autonomy and flexibility over their worker nodes without the added complexity of running Nomad servers. As long as there is routing and appropriate networking ports open, these Nomad clients can reside in any VPC while connected to a common set of Nomad servers in the same geographic region. I found this mode useful for the following use cases:

Projects that require a higher degree of service isolation or segmentation from the shared pool
A project or team that requires their own service catalog (i.e. using Consul namespaces or admin partitions)
Projects that require more control over ingress and egress
Jobs that need mounted persistent storage (CSI or host volumes)
HPC batch processing that requires more compute

Pros:

Users have complete management of Nomad clients connected to the Nomad server cluster (operator autonomy). This means you can also have your own set of node classes with your own drivers
No operational overhead of managing the Nomad servers (stays with the platform team, autonomy with less complexity)
Operate jobs with persistence (i.e. databases, message queues) due to more control over storage attachments, mounted volumes
Manage CSI storage plugins per namespace, allowing plugins to use different cloud credentials
Separate services and KV catalog when integrating with Consul enterprise

Trade-Offs:

Ingress solution or API gateway is self-managed
Requires systems knowledge and or Terraform to manage the lifecycle, configuration of the VM/Nomad client
Requires more CSP (cloud service provider) knowledge in general, or Terraform

In BYONC mode, we tackle the onboarding challenge using a Terraform module to create and attach Nomad clients to existing Nomad servers. An example Nomad client module to faciliate BYONC (on AWS) can be found here.

Remember that as a platform team, the initial user experience for both modes is critical to the adoption of the platform and Terraform plays a big part in powering the experience with both of the modes, especially with bringing your own Nomad clients (BYONC).

Now that we’ve talked about different consumption modes, how do we enforce those modes centrally? How, as a platform team, can we ensure that everyone using Nomad in our company is funneled into one of those two established modes?

Policy Enforcement and Guardrails

Having consumption modes is meaningless without some level of enforcement. Your users will inevitably do something that will introduce risk to the business if left unchecked. In order to protect the business, we need to put up some guardrails for our platform users, so let’s introduce some policy enforcement at job-submission time.

The great thing about policy as code is not only does it help enforce our platform rules but it allows for transparent, collaborative policy that is codified. If one of our users starts questioning the policy, we can refer them to our policy as code repository and encourage them to collaborate. As a platform team we want to walk the fine line between governance and giving autonomy to our end users. We ultimately want to build bridges between ourselves and the product teams we support to cultivate developer happiness. When our users see policy as code, they will begin to understand why it has been instituted and roll with it.

Let’s examine some Sentinel code:

# validate_mode function
validate_mode = func() {

 validated = false

 if job.meta contains "mode" and job.meta["mode"] is "shared" {

  for job.constraints as c {
   if c.l_target is "${meta.namespace}" and c.r_target is "default" {
    validated = true
    break
   }
  }

  if not validated {
   print("You tried to run a shared mode job in the", job.namespace, "namespace.")
   print("Each job must include a constraint with attribute set to",
    "${meta.namespace} and value set to 'default'.")
  }

 } else if job.meta contains "mode" and job.meta["mode"] is "byonc" {

  for job.constraints as c {
   if c.l_target is "${meta.namespace}" and c.r_target is job.namespace {
    validated = true
    break
   }
  }

  if not validated {
   print("You tried to run a BYONC job in the", job.namespace, "namespace.")
   print("Each job must include a constraint with attribute set to",
    "${meta.namespace} and value set to the namespace of the job.")
  }
 } else {
  print("Jobs must use either \"shared\" or \"byonc\" mode.")
 }

 return validated
}

# Call the validate_mode function
validated = validate_mode()

# Main rule
main = rule {
 validated
}

This is highly influenced by this Sentinel policy file on nomad-guides GitHub repo. In this file we have two main conditionals: a check for shared mode and check for BYONC mode. In those conditionals, the code tries to iterate over any constraint {} blocks and validates the value.

The mechanism we use to enforce our modes is Nomad client metadata. This is represented by ${meta.namespace} in a job spec. In shared mode, developers must add a mode and a constraint to their Nomad jobs for Nomad to schedule jobs on the “shared” Nomad clients owned by the platform team:

job "shared-example" {
  ...

  meta {
    mode = "shared"
  }

  constraint {
    attribute = "${meta.namespace}"
    operator = "="
    value = "default"
  }

  ...
}

This means that shared mode Nomad clients must have a metadata namespace value set to default in this example. Remember, these Nomad clients are part of the offering from the platform team and the metadata values would have been populated either during the creation of those Nomad clients or updated using dynamic node metadata.

Now, let’s look at how a product team can utilize our BYONC mode:

job "shared-example" {
  ...

  meta {
    mode = "byonc"
  }

  constraint {
    attribute = "${meta.namespace}"
    operator = "="
    value = "qa-acme-darkhorse"
  }

  ...
}

Remember that in this mode, product teams are bringing their own Nomad clients to a centralized cluster in a nearby region. Regardless of how they’ve attached these Nomad clients, the important part is that these nodes need metadata that matches their respective Nomad namespace value.

Nomad client configuration example:

client {
  enabled = true
  ...
  node_class = "web-servers"
  meta = {
    "namespace" = "qa-acme-darkhorse"
  }
}

This way they can manage their own “pools” or groups of Nomad client nodes with their own metadata and task drivers all tied to an environment namespace with more isolation. The same namespace constraint on the job is still required, however this can be grouped together with other constraints, such as node class.

Full example of a job using BYONC mode:

job "shared-example" {

  region = "global"
  type = "service"
  datacenters = ["dc1"]
  namespace = "qa-acme-darkhorse"
   
  # mandatory  
  meta {
    mode = "byonc"
  }

  # mandatory
  constraint {
    attribute = "${meta.namespace}"
    operator = "="
    value = "qa-acme-darkhorse"
  }

  # optional
  constraint {
    attribute = "${node.class}"
    operator  = "="
    value     = "web-servers"
  }

  group "svc" {
  
    network {
      mode = "bridge"
      port "http" {
        to = 5678
      }
    }

    service {
      port = "http"
      check {
        type = "tcp"
        interval = "10s"
        timeout = "5s"
      }
    }

    task "server" {
      driver = "docker"
      config {
        args = ["-text", "Hello shared mode job!"]
        image = "hashicorp/http-echo:latest"
        ports = ["http"]
      }
    resources {}
  }
}

Finally, to tie this all together, we need to register the Sentinel rules on the Nomad cluster. Luckily, The Terraform Nomad provider has a nomad_sentinel_policy resource to register the policy:

resource "nomad_sentinel_policy" "consumption_modes" {
  name        = "consumption-modes"
  description = "Only allow job submissions through shared or BYONC mode."
  policy = file("${path.module}/files/consumption-modes.sentinel")
  scope = "submit-job"
  enforcement_level = "hard-mandatory"
}

We want the Sentinel enforcement level to be hard-mandatory to enforce our consumption modes with no exceptions. It is worth noting that as a platform team you can register any number of Sentinel rules for the submit-job scope. Take a look at more example Sentinel rules in the nomad-guides repo.

Summary

I would be remiss if I didn’t bring us back to our desired business outcomes for offering Nomad enterprise to our product teams:

To ship our products faster
To reduce risk and loss
To save money

Structuring our engineering teams to reflect the platform team topology plays a huge role in accelerating the SDLC of our apps. Developers and operators need to do their jobs with the least amount of cognitive overhead and toil as possible. The platform team topology combined with established consumption modes in Nomad offers a prescriptive path for teams to ship code in a fast, predictable and safe manner. HashiCorp Nomad and its enterprise features play a critical role in powering a multi-tenant app scheduling strategy, which is arguably the most important component in our platform.

Thanks for reading.