HashiCorp Nomad is an easy-to-use and flexible workload orchestrator that enables organizations to automate the deployment of any applications on any infrastructure at any scale across multiple clouds. While Kubernetes gets a lot of attention, Nomad is an attractive alternative that is easy to use, more flexible, and natively integrated with HashiCorp Vault and Consul. In addition to running Docker containers, Nomad can also run non-containerized, legacy applications on both Linux and Windows servers.
Like all of HashiCorp’s solutions, Nomad has both open source and enterprise versions. Nomad Enterprise adds features that enable multiple teams within large organizations to run their applications on shared Nomad clusters without interfering with each other.
This blog post describes four of these features: Access Control Lists (ACLs), Namespaces, Resource Quotas, and Sentinel Policies. It then describes the Nomad Multi-Job Demo, which illustrates how these features combine to allow multiple teams to share one or more Nomad clusters. You can even try out the demo yourself after cloning HashiCorp’s nomad-guides GitHub repository.
On March 27, 2019, I delivered the Nomad Multi-Job demo in a webinar. You’ll find a link to the recording at the bottom of this blog post.
Each Nomad cluster consists of Nomad servers which manage the cluster and Nomad clients onto which workloads are deployed.
A typical Nomad cluster runs 3 or 5 Nomad servers, but can schedule workloads to hundreds or even thousands of Nomad clients. Nomad typically uses a Consul cluster for automatic clustering, service discovery, health checking, and dynamic configuration.
When a user submits a job to a Nomad cluster, the Nomad servers decide whether the job is allowed and where to place the tasks of the job. In making the first decision, Nomad considers ACL and Sentinel policies, the namespace specified by the job, and the resource quota of that namespace. In making the second decision, Nomad factors in the number of tasks already running on the clients in the cluster, the resources those tasks are consuming, and various attributes of the job specification that might make certain clients more desirable for the job’s tasks.
Key Nomad Features
Nomad’s Access Control List (ACL) System restricts access to Nomad’s API, CLI, and UI with ACL tokens. These are assigned policies with capabilities that determine which actions the bearer of a token can perform. For instance, an ACL token might allow a developer to run jobs in the dev namespace, view resource quotas assigned to it, and view all the nodes in one or more clusters. ACL tokens can be marked as global or limited to specific Nomad clusters.
Namespaces allow multiple teams to co-exist without conflict on one or more Nomad clusters by segmenting jobs and associated objects. Within a single namespace, each job name must be unique, but two teams using different namespaces can run jobs with the same name in those namespaces. For example, the qa team could run a job that allocated more memory and CPU than the dev team allocated for their version of the same job. ACLs enforce the isolation of namespaces, ensuring that users can only read, run, and modify jobs in namespaces they belong to.
Resource Quotas ensure that teams using different namespaces cannot adversely affect each other by consuming excessive CPU and memory. They can be applied to namespaces in specific clusters or globally. Typically, only Nomad administrators would have the ACLs needed to create, edit, and delete resource quotas, so ordinary users cannot circumvent them.
HashiCorp’s Sentinel is a language and framework that implements Policy as Code with fine-grained, logic-based policy decisions just as HashiCorp Terraform implements Infrastructure as Code. Sentinel policies can have the following enforcement modes: advisory, soft-mandatory, and hard-mandatory. Authorized users can override soft-mandatory policies, but no one can override hard-mandatory policies.
Nomad Sentinel Policies allow Nomad administrators to enforce policy restrictions on all jobs submitted to any cluster. These policies can limit any job attributes. For example, policies might only allow the Docker and Java drivers, restrict permitted Docker images, or require that Docker containers cannot use the host network of the Nomad clients they run on. In fact, the Nomad Multi-Job demo implements all three of these restrictions in Sentinel policies.
Global ACL tokens, ACL policies, namespaces, resource quotas, and Sentinel policies are automatically replicated to all federated Nomad clusters.
The Nomad Multi-Job Demo
The README.md file of the Nomad Multi-Job Demo describes the demo in great depth, even giving step-by-step instructions so that readers can run the demo themselves. In this blog post, I focus more on what happens when various commands are used to run the demo’s jobs.
Here are some key facts about the demo:
- It is provisioned with Terraform which shows us all the information we need for running the demo including an ssh command for connecting to the Nomad server, the URL of the Nomad UI, and some ACL tokens.
- It runs a single Nomad server and three Nomad clients in AWS.
- There are two teams, dev and qa, each of which has its own namespace and associated resource quota (with the same names as the teams).
- The dev and qa resource quotas are set to 4,600 MHz of CPU and 4,100 MB of memory.
- Alice is a developer on the dev team with her own ACL token.
- Bob is an engineer on the qa team with his own ACL token.
- The Nomad administrator has a bootstrap ACL token that can be used to run jobs in the default namespace and view all jobs in the Nomad UI.
There are three Sentinel policies deployed:
- The allow-docker-or-java-driver policy only allows the Docker and Java drivers to be used. It is a hard-mandatory policy.
- The prevent-docker-host-network policy prevents Docker containers from running on the host network. It is a soft-mandatory policy.
- The restrict-docker-images policy only allows the nginx and mongo Docker images to be run. It is also a soft-mandatory policy.
There are five Nomad job specification files:
- sleep.nomad uses the exec driver to invoke the Linux sleep command. It does not specify a namespace, so will run in the default namespace (if allowed).
- catalogue.nomad runs a Go application and a MySQL database in two Docker containers, but tries to use the Docker host network. It does not specify a namespace, so will run in the default namespace (if allowed).
- webserver-test.nomad is configured to run two instances of the Apache HTTP web server (httpd). It is configured to run in the qa namespace, so can only be run by members of the qa team or by administrators.
- website-dev.nomad is configured to run two instances of nginx and two instances of mongo, each with 500 MHz of CPU and 512 MB of memory. It is configured to run in the dev namespace with the job name “website”.
- website-qa.nomad is also configured to run two instances of nginx and two instances of mongo, but it tries to allocate 500 MHz of CPU and 1,024 MB of memory to each instance. It is configured to run in the qa namespace with the job name “website”.
We deliver the demo with the following flow after having already provisioned the Nomad cluster with Terraform:
1. We first set up two SSH sessions connected to the Nomad server, exporting the bootstrap ACL token in the first and Bob’s ACL token in the second.
2. We use the session with the bootstrap token to list and show the namespaces, resource quotas, and Sentinel policies.
3. With the bootstrap token, we try to run the sleep.nomad job. The allow-docker-or-java-drive policy prevents this because the job uses the exec driver (which can invoke arbitrary OS-level commands and scripts).
4. Still using the bootstrap token, we try to run the catalogue.nomad job. This is blocked by the prevent-docker-host-network policy because the job tries to run docker containers on the host network.
5. However, we run the job again with the command
nomad job run -policy-override catalogue.nomad to override the soft-mandatory policy violation. This time, the job successfully runs in the default namespace.
6. In the first session, we export Alice’s ACL token and then have her try to run the webserver-test.nomad job. She is prevented because her dev ACL token does not allow her to run a job in the qa namespace.
7. In the second session, we have Bob try to run the same job. It is blocked by the restrict-docker-images policy because the job tries to run the httpd docker image.
8. Bob then changes the job file to use the nginx:1.15.6 image instead, and he is able to successfully run the job in the qa namespace. It uses 1,000 MHz of CPU and 1,024 MB of memory.
9. In Alice’s session, we begin to explore resource quotas by having her run the website-dev.nomad job. Since this job tries to run 2,000 MHz of CPU and 2,048 MB of memory and the dev resource quota supports twice that, she is able to run the job in the dev namespace. Since the actual name of the job at the top of website-dev.nomad is “website”, that is what we see in the Nomad UI.
10. We continue our exploration of resource quotas by having Bob run the website-qa.nomad job in his session. The actual name of the job at the top of website-qa.nomad is “website”, which is the same as what website-dev.nomad has. This is allowed because the two jobs are run in different namespaces.
Since the website-qa.nomad job tries to run 4,096 MB of memory and the webserver-test job Bob ran earlier in the qa namespace is already using 1,024 MB of memory, Bob will exceed the memory limit of the qa resource quota. Here is what the
nomad quota status qa command shows us at this point:
Bob runs the website-qa.nomad job anyway. Nomad successfully starts three of the tasks but queues the fourth.
11. Realizing that he could not run all four tasks of the website-qa.nomad job and the webserver-test.nomad job at the same time, Bob stops the webserver-test.nomad job to free up some memory.
At this point, Nomad automatically starts the fourth task that had been queued.
12. Bob is now happy because his website job is fully running.
13. Alice is happy too because Nomad’s namespaces and resource quotas kept Bob and the rest of the qa team from hogging the resources of the Nomad cluster, allowing the dev team to continue its work.
14. The security team is very happy because Sentinel prevented Alice and Bob from running drivers or docker images that would have violated their policies and might have caused security breaches.
In this blog post, I have discussed key Nomad Enterprise features that allow multiple teams to safely share Nomad clusters without adversely affecting each other and that also enforce policies against the jobs those teams try to run on the clusters. I also walked you through the Nomad Multi-Job Demo to illustrate these features in action.
Here is the recording of the webinar I delivered on March 27, 2019. I shared some slides about Nomad and then delivered the Nomad Multi-Job demo.