Cluster scheduling systems for large scale Security Operations

There are a number of scenarios where it’s very useful to have the capabilities to run arbitrary jobs against any number of machines in a secure and automated fashion — Threat Hunting, Incident response, coordinated eviction of entrenched adversaries, Forensics, etc.

I have spent some time recently with HashiCorp’s Nomad — a very fast and secure cluster scheduling system to explore how such a system might be applicable to the field of Information Security.

TL;DR

Nomad can provide the building blocks for a distributed and secure job scheduler that is inherently useful for Security Operations teams that need to operate in large, distributed environments. In practice it’s easy to run Nomad jobs against tens of thousands of machines or a single box while supporting a diverse mix of Operating Systems.

Why Nomad ?

Nomad is relatively simple to deploy and operate yet sufficiently flexible and secure enough to be useful in the context of Security Operations. It has literally no external dependencies and as such is much more lightweight than other tools out there.

A Nomad deployment usually consists of one or more machine running Nomad in server-mode and a number of clients running Nomad in client-mode. All communications between servers, clients and the operator are encrypted — all HTTP and RPC communication is done over Mutual TLS when configured.

In short you can :

  • Prevent unauthorised Nomad access
  • Prevent observing or tampering with Nomad communication
  • Prevent client/server role or region misconfigurations
  • Prevent other services from masquerading as Nomad agents

Setting up a fully functioning Nomad cluster takes less than an hour and Nomad clusters are easily federated to ensure logical seperation of individual clusters while maintaining operational efficiency — but in essence you can run Nomad globally distributed :)

Besides the ability to schedule containers and virtual machines Nomad can schedule arbitrary applications or binaries — this is the feature we will be abusing for this article.

I am not going the spend much time detailing how to get started using Nomad — their documentation is much better than anything I could produce.

What’s a Job ?

A job in Nomad is a specification of a piece of work that you would like Nomad to execute on your behalf and any constraints that you would like to associate with that job (ie. instructing the scheduler what’s in or out of scope for job placements).

A sample job file could look like this :

job "autorun" {
region = "global"
datacenters = ["dc1"]
type = "system"
constraint {
attribute = "${attr.unique.hostname}"
value = "my-machine"
}
group "autorun" {
# Specify the number of these tasks we want.
count = 1
task "autorun" {
driver = "raw_exec"
config {
command = "c:\\windows\\system32\\cmd.exe"
args = [ "/c local\\autorunsc64.exe -accepteula -c"]
}
artifact {
source = "https://1.2.3.4/autorunsc64.exe"
options {
checksum = "sha256:adf767759......"
}
}
}
}
}

In essence this will pull the Autoruns executable from an internal HTTPS server, validate it’s checksum and execute it on a specific machine (“my-machine”).

Nomad can not only download files but also work with Git repositories, S3 buckets, unpack .zip or .tgz files, etc (see their docs for more info ) and the resulting job output can be handled in a number of ways.

The real beauty lies in the fact that Nomad can execute anything that the underlying OS can understand (PS or VB scripts, executables, binaries, shell scripts, etc) so you are not at all constrained — whatever gets the job done, right? :)

To give you a small idea of what you can do with this kind of technology here’s a small list of jobs that I have written in the last weeks :

  • rekall.nomad (supports memory forensics against running memory as well as taking memory dumps).
  • reg.nomad (searching & reading registry keys).
  • firewall.nomad (working with firewall rules on clients running Nomad).
  • autoruns.nomad (dumping autorun locations and their content).
  • sigcheck.nomad (run signature checks against files or folders).
  • snatch.nomad (copy artefacts from machines).
  • sandbox.nomad (search and submit files to our internal sandbox).
  • yara.nomad (running yara rules).
  • procdump.nomad (dump process memory from a single process).
  • isolate.nomad (isolate an offending machine from the network).

All in all our Nomad repo has 50+ jobs and is growing every week — hopefully I’ll be able to open source them soon!

Once a job has finished it’s logs, artefacts, etc are removed from the clients the job ran on — which is really convenient. You won’t leave clues for your adversaries to find or clutter up your workstations :)

Operating Nomad

Nomad jobs are usually run using their command line tool (nomad) but that means distributing the CLI certificates (see this for more info) to all operators.

We actually build a wrapper around Nomad’s REST API and can now run Nomad through our chatbot — this also means that we can enforce additional security such as 2FA approval and peer validation workflows which is really nice.

The REST API also means that you can integrate Nomad against pretty much … anything. Triggering a forencics workflow directly from your SIEM, exposing jobs through Webhooks, etc is all pretty darn easy.

Adding more Security

Nomad can integrate with HashiCorp’s Vault ( their secret management solutions) in order to provide PKI services for cert revocation and renewal) but also in order to give your jobs a secure way to access secret material (API keys etc) — which is really nice.

Nomad also provides a nice ACL system to give you fine-grained control.

Shortcomings

There are some shortcomings in Nomad for this particular use-case (primarily around influencing scheduling decisions) such as supporting system batch jobs or providing constraints for parameterised jobs but many of them will show up as Nomad matures.

Especially the ability to provide job constraints as input for parameterised jobs would make things much easier — but it’s nothing you cannot work around.

We have made it a habit to tag our Nomad clients with various internal information such as the OU a machine is placed in — that allows us to target machines in a very efficient manor.

Conclusions

The use of general purpose schedulers for Information Security purposes is largely overlooked — typically people tend to use tools that are either very, very expensive or operationally very demanding. Nomad might help people scratch itches that otherwise would be out of their reach :)