Secure Credential Management on a Budget: DC/OS with HashiCorp’s Vault — Part 3

Published in

MobileForGood

16 min readMay 9, 2017

This article is the third, and last instalment in the Secure Credential Management on a Budget series. In the previous articles, we set up and configured Vault (version 0.8.1) for a simplified DC/OS production use case.

Part 3 will focus on the process of implementing an entry-level Vault workflow on DC/OS with Vault-Gatekeeper-Mesos, and some miscellaneous thoughts with regards to building out such a Vault workflow into a production-ready system. It will build on some of the terminology and concepts introduced in previous sections, so if you’re confused by some of the assumptions in this article, kindly head back to Part 1 and Part 2 for clarity.

What’s the value of doing this ‘on a budget’? Aside from working within the budget contraints that for security management in an NGO context, I believe that reasoning about secret management from a non-Enterprise viewpoint can grant vendor-independent insight into what factors could strengthen a secret management workflow and what factors could hinder it. Although we will use Django-on-Docker-on-DC/OS as the setup grounding the exploration, the learnings here are, hopefully, transferrable to other platforms.

Introduction to Vault Workflows

The Vault workflow refers to a software process, designed with Vault at its core, whose actors are orchestrated to automate the secure storage, distribution, and management of secrets. A very simple Vault workflow in a non-containerised, non-cluster setup might simply consist of a webapp service with knowledge of a valid Vault token, fetching database credentials from Vault before launch.

Things get a bit trickier if we want to design a Vault workflow for a cluster — specifically, a cluster used for high-availability container orchestration. They also get trickier if we want our workflow to be reasonably secure against common threats that are applicable to secret management systems.

Naive Vault Cluster Workflow

Let’s build onto the basic DC/OS cluster setup described in Part 2 of this series: instead of just a single webapp, we’re manually deploying multiple Dockerised webapps on our cluster through the Marathon UI. All services on the cluster, including Vault, are networked on a private interface. Each webapp needs access to their private PostgreSQL schema to work properly. In order to request PostgreSQL credentials from Vault, each webapp must have knowledge of a valid Vault token with the appropriate policies applied.

A naive Vault workflow for this setup could look like this: Before an app is launched, a PostgreSQL schema is created for it — let’s call this new schema public. Once the DC/OS Vault Maintainer actor knows the schema name, it will add a read/write PostgreSQL backend policy for that schema, at database/roles/psql-readwrite-public (see the explanation for Vault secret backend policies vs. Vault ACL policies here). After the backend policy is written, the Vault Maintainer will then update Vault with an appropriate policy that grants the permission to read credentials from database/creds/psql-readwrite-public.

Vault tokens, with this new policy applied, are manually generated before an app container needs to be launched. The Vault token is then passed as an environment variable to the Docker container as part of the launch command on Marathon. When the app container is launched, some logic in the app uses the token to make calls to Vault, requesting credentials to access the PostgreSQL schema it needs. Some additional logic in the app periodically renews the lease on those credentials.

Improving the Workflow

There are a couple of issues with the workflow described above:

This workflow does not scale well with a large increase in apps to be deployed. The requirement of manual intervention during token generation presents a bottleneck if there are more apps that need to be deployed than there is capacity to generate tokens at any one time.
It relies on getting the Vault token to the Dockerised app through environment variables, which can be logged in plaintext and exposed by debugging output. Storing secrets in environment variables also makes them easily available to threat actors that have compromised any process, and often any subprocess, running in the same environment as the consumer.

For weakness 1, it’s feasible for us to remove the requirement for manual intervention by writing some adaptor code that automates the initial generation and storage of Vault tokens before passing them in our docker run commands. For 2, however, we’ll need to rethink how we get the Vault tokens into the containers without unnecessarily exposing the token values along the way. A good secure introduction implementation for our Vault workflow should not undermine the benefits of using Vault to manage our secrets in the first place. The concept which is central to the problem space introduced by weakness 2 is that of Secure Introduction.

Secure Introduction

[This article will only explore Secure Introduction in terms of its implications on our Vault workflow design. To understand SI with any sort of holistic depth from a security perspective, it’s highly recommended that you read/watch Vault developer Jeff Mitchell’s talk on SI with containers here.]

The premise of Secure Introduction (SI) can be phrased as follows: “If we can securely get the first secret from an originator to a consumer, then all subsequent secrets transmitted between that originator and consumer can be authenticated with the trust established by the successful distribution and use of that first secret.”

In our workflow, the originator is Vault, and the consumer(s) are our Dockerised webapps. The first secret in question would be the Vault token needed by the webapp to authenticate to Vault. If we can get this token from Vault to the app securely, then the app can use the token to authenticate requests for more credentials controlled by Vault.

The Secure Introducer Pattern

One well-established way of achieving secure introductions on a cluster is implementation of the Secure Introducer pattern.

Various implementations of the Secure Introducer pattern can differ in their exact architectures and protocols. However, (almost) all implementations of the pattern involve the addition of a new system actor to the Vault workflow, the Secure Introduction Client.

How does a secret consumer prove to the secure introduction client and the secret producer that it is the legitimate recipient for a secret? How can we avoid persisting raw token values during our secure introduction ceremony Let’s look at two other principles and Vault primitives that help with solving these problems on Vault-on-DC/OS:

Schedulers As A Source of Truth/Trust
Most modern cluster solutions include an actor called a scheduler. There isn’t strong consensus on what exactly is and isn’t a scheduler, but for the sake of this article we’ll refer to Marathon on DC/OS as the scheduler for the cluster setup we’re interested in.

The main role that such a scheduler plays on your cluster is launching tasks on your available cluster resources. On our example cluster, Marathon executes the docker run commands to launch our webapp containers. The nice thing is that this intimate relationship between the scheduler and the apps means that the scheduler is a source we trust for information on which tasks (ie. containers) are running, when they’re running, and various parameters associated with the launch and deployment of those containers.

The above properties are useful for SI because we can use the scheduler as a reasonably trustworthy means of verifying the identity of the target consumer for the first secret. Coupling the secure introduction client with the scheduler means that the client can query the scheduler to verify the identity of consumers as needed — negating the need for the consumers themselves to keep any secrets to prove their identities.

Wrapped Tokens
A wrapped token is a token whose raw value has been encapsulated in a different value. An actor with knowledge of the wrapped token value can then supply it to Vault to obtain the raw token value. The key attributes about a wrapped token are:

Its usage window (Time-To-Live), which describes the validity period after creation where it can be unwrapped successfully.
Its usage limit, which describes the number of times a token can be successfully unwrapped. This defaults to 1 for Vault wrapped responses.

The nice thing about a wrapped token with a usage limit of 1 is that it mitigates the risk posed by passing secrets to containers in environment variables and persisted storage. If the wrapper value is compromised after its usage limit has been reached, it is effectively useless to the attacker. This means that we can, with caution, pass wrapped tokens as environment variables into our containers, or persist them to storage.

Tying Things Together: Using Vault-Gatekeeper-Mesos as an SI Client

Let’s iterate on our naive workflow to implement the Secure Introducer pattern, using ChannelMeter’s vault-gatekeeper-mesos (Gatekeeper). Gatekeeper is an open-source, DC/OS-compatible project whose sole purpose is to act as a secure introduction client.

Gatekeeper is a service whose primary function is to expose an HTTP API for Mesos/Marathon tasks to request wrapped Vault tokens. It keeps its own Gatekeeper policy file, which maps Mesos task IDs to their applicable Vault policies. The source of this policy file is stored in Vault, on the generic secret backend. Whichever actor is assigned the DC/OS Vault Maintainer role is given the additional responsibility to maintain the source of the mappings and refresh the version kept by Gatekeeper.

The Gatekeeper SI client’s HTTP API allows two calls, POST /token and POST /policies/reload. The first endpoint is used by secret consumers to request wrapped Vault tokens from the client, and the second endpoint is used to update the policy mappings kept by the client. When a task is launched on Mesos, it obtains its wrapped Vault token by sending its Mesos task ID to Gatekeeper. A token is successfully obtained if:

A Mesos task exists with the supplied task ID.
The task ID matches an entry in the Gatekeeper policy file.
The task is below a certain age. This is to narrow the attack window for any particular Mesos task’s associated credentials to the brief period after which it is launched.

The below call flow diagram illustrates the aforementioned interaction:

Secure Introduction with Marathon/Mesos. Huge thanks to **Vinit Mahedia** (Vinit) for creating and sharing this resource.

So, let’s implement it. The first thing to note when setting up Gatekeeper is that it needs its own Vault token, scoped with the correct permissions, to make Vault tokens on behalf of other secret consumers. Set up Gatekeeper’s Vault policy by saving the following policy to /etc/vault/policies/gatekeeper.hcl:

path "auth/token/create" { 
 capabilities = ["create", "read", "sudo", "update"]
} path "auth/token/create/*" { 
 capabilities = ["create", "read", "sudo", "update"]
} path "auth/token/create-orphan" { 
 capabilities = ["create", "read", "sudo", "update"]
}path "secret/gatekeeper" { 
 capabilities = ["read"]
}

In this policy, we give the token holder rights to create child tokens and orphan tokens by granting sudo capabilities to read and write on the auth/token/create and /auth/token/create-orphan paths. The sudo capabilities allows access to paths that are otherwise accessible to root policy holders only. We also need to grant sudo capabilities to write to auth/token/create/*, which would allow the token holder to create tokens scoped with named policies. Lastly, we give read permissions to the /secret/gatekeeper path. This is where Gatekeeper is configured to read its Mesos task name to Vault policy map. Now that you’ve given your future Gatekeeper tokens read access to that path, it’s time to write that Gatekeeper policy to /secret/gatekeeper. Note that the dcos-app policy referenced here is the Vault policy created in Part 2 of this series:

echo -n ‘{"*":{"Policies":["default","dcos-app"], "Ttl":5000 } }’ | vault write secret/gatekeeper -

This isn’t a very comprehensive policy, but hopefully it’ll help with illustrating how Gatekeeper works. What this policy means is that any Mesos task requesting a wrapped token from Gatekeeper will receive one with the default and dcos-app policies applied. Obviously, if you want different tasks to receive differently-scoped access tokens, you’ll need to modify the Gatekeeper policy to reflect this. Set the Ttl value to taste — this is the Time-to-Live value of the token wrapper.

Gatekeeper borrows the concept of sealing and unsealing itself from Vault — the service will start in a sealed state if it is not provided with a Vault token on startup. Sealed Gatekeeper instances will reject all token requests and requests to modify the server state, and need to be unsealed in order to be useful in our Vault workflow.

Let’s create the Vault token we will use to unseal Gatekeeper and allow it to create credentials on behalf of other applications. Note that a token can only create tokens with policies that are a subset of the creator’s policies. This applies on the level of path privileges, and not necessarily by policy name. As an example, a root token can create tokens associated with any policy name(s), even though it was not assigned those policies at its creation.

With this in mind, go ahead and create a wrapped token for your Gatekeeper instance like this:

$ vault token-create -orphan -policy="default" -policy="gatekeeper" -policy="dcos-app" -wrap-ttl=2000s

We include the dcos-app policy for the Gatekeeper token so that it can create new tokens with the dcos-app policy.

You’ll notice here that the wrap-ttl value is quite long, because for this part of the tutorial we’ll be giving it to Gatekeeper manually, but in an automated scenario this will need to be much shorter. Keep the returned token value somewhere safe for the next step.

Next, it’s time to deploy Gatekeeper. In this tutorial, Gatekeeper will be run from one of the VMs with a Vault node on it.

Ensure that Docker is installed on your target machine, then get yourself a copy of the Gatekeeper Docker image from the public Docker Hub:

$ docker pull channelmeter/vault-gatekeeper-mesos

Now deploy your Docker image — the below instruction will work if you run it on the same host as your Vault server, so you should make the appropriate modifications to this step if you’re launching this as an app on your cluster:

$ docker run --rm -it -v /etc/ssl/vault:/etc/ssl/vault -p 9201:9201 \
  --add-host [Vault server hostname]:[Vault server IP] \
  --name gatekeeper channelmeter/vault-gatekeeper-mesos \
   -listen=:9201 -tls-cert=/etc/ssl/vault/fullcert.pem \
   -tls-key=/etc/ssl/vault/privkey.pem \
   -mesos=[Mesos hostname]:[Mesos port] 
   -vault=[Vault hostname]:[Vault service port] \ 
   -ca-cert=/etc/ssl/vault/fullchain.pem \
   -task-life=2m -self-recreate-token=false \
   -wrapped-token-auth=[wrapped token]

The above command will set up Gatekeeper to listen on port 9201 and serve its API over HTTPS, using the combined LetsEncrypt certificate made in Part 1. Most of the flags for Docker and Gatekeeper should be self-explanatory. Refer to the Gatekeeper documentation for more information on its launch flags. Supply your wrapped Vault token value as the -wrapped-token-auth argument to start Gatekeeper in an unsealed state.

If Gatekeeper fails to unseal with an i/o timeout message, you may need to modify your firewall rules to ensure that your Vault server accepts connections from its host’s public IP.

Once your Gatekeeper server is running in an unsealed state, you can test token creation by launching a service on DC/OS, obtaining its task ID, and running the following command:

$ curl -H "Content-Type: application/json" -X POST -d '{"task_id":"[Mesos task ID]"}' https://[Gatekeeper hostname]:9201/token

If all goes well, you should receive a wrapped token encapsulating a Vault token with the default and dcos-app policies applied. If you were an app, you could now unwrap this token and obtain the PostgreSQL credentials required for your database needs.

Let’s recap on this Secure Introduction workflow in higher-level terms:

Before deploying the Gatekeeper client, a Vault token must be created for it. This token must grant Gatekeeper all the privileges that it might pass to the tokens that it creates.
Before an app is launched, a PostgreSQL schema is created for it — let’s call this new schema public. Once the DC/OS Vault Maintainer knows the schema name, it will add a read/write PostgreSQL backend policy for that schema, at database/roles/psql-readonly-public. After the backend policy is written, the Vault Maintainer will then update Vault with an appropriate policy that grants the permission to read credentials from database/creds/psql-readonly-public— let’s call this policydcos-app.
Once these resources and policies are set up, the container can be launched. During the initialisation phase of the webapp, if required, the Vault Maintainer writes to Vault to update the Gatekeeper policy mappings with an entry that matches the Mesos task ID of the newly-created container. The Vault Maintainer would then make a call to Gatekeeper’s /policies/reload endpoint to update Gatekeeper’s records of the mappings.
Once this is done, the app can call the SI client to request a wrapped token. After the app receives the wrapped token, it can use it to retrieve the raw Vault token that will allow it to retrieve database credentials.

In this workflow, DC/OS Vault Maintainer role need to be closely coupled with scheduler. I’ve specifically avoided mentioning exactly which application in your infrastructure should take ownership of the Maintainer role because every system is different. For example, in the DC/OS setup that I work with, my organisation has specifically developed a Django app called Mission Control, which gives users a clean, graphical interface to launch apps and Docker containers on our cluster via Marathon. Because it’s intended as the first subsystem in our infrastructure with knowledge of an impending app launch, as well as the first subsystem to know the intended app launch parameters (eg. what level of access to which database schemas), it would be flagged as the first candidate for assuming the policy maintenace role. In a different system, the policy maintenance role could be played by specialised Marathon tasks, or even by a separate, ad-hoc service. You will need to decide which solution works the best for your infrastructure.

Miscellaneous Challenges

This section discusses some challenges to overcome with implementing a Vault workflow on the DC/OS system I work with. I would love to hear from anyone who has experience solving the problems described here, or with re-architecting any portions of their cluster setup and/or Vault workflow to solve these issues.

Using Dockerised Django Applications with Gatekeeper

This one is pretty specific to Praekelt.org’s infrastructure. Currently, we deploy Dockerised Django apps with Mesos/Marathon (those who are keen will notice that this tutorial series is skewed towards contriving a similar setup). One of the biggest challenges facing this setup is getting the Dockerised Django application to participate meaningfully in the Vault workflow by:

Obtaining the first secret in the secure introduction process
Renewing the leases on any resource credentials it obtains from Vault

Ideally, the secure introduction protocol moves the first secret (ie. unwrapped token value) safely into the app memory of the Django process, and not the environment of its parent Docker container. Additionally, since the Django application is the ultimate consumer of any resource credentials tied with this first token, we’d also want the validity of those credentials to be tied in with the life-cycle of the the app. The simplest way to achieve this is to engineer the Django application to request Vault tokens from Gatekeeper, and to make calls to renew the leases on the credentials in its possession. Unfortunately, owing to the architecture and threading model of Django applications, solving problem 1 is not as straightforward as inserting the logic for token requests to Gatekeeper in your app’s settings.py bootstrapping routine— or, rather, it’s not as straightforward to do this to yield a robust and maintainable app deployment workflow. If each worker thread in Gunicorn calls settings.py when it starts up, you would end up with many more sets of dynamic credentials than would be useful to maintain.

One possibility to get around this problem is to use a separate service in the Docker container that makes the token request call to Gatekeeper, and writes the wrapped token to pass into the Django app. The Django app would then unwrap the token and write the lease IDs of the credentials to file, which the separate service would use to renew the leases on those credentials.

One of the main issues with this proposal is that, because wrapped tokens should only be single-use, this presents a problem when multiple Python processes are spawned from Gunicorn — in this case, only the first Gunicorn/Django instance is able to unwrap the token, after which the wrapper becomes useless.

What if that separate service in the container unwrapped the token before writing it to disk, and owned the responsibility for fetching the resource credentials from Vault as well? These would be written to disk, where it can be read by Django secret consumers. For this solution, we’ll need to negotiate a couple of risks:

An actor with root permission on the container would be able to read the file containing the Vault token value and the raw resource credentials.
The lifecycle management on tokens and credentials may not be as closely coupled with the Django application as we like, since we’re doing it in a separate program. This means that we cannot assume that requests for credential renewals are legitimately on behalf of an active consumer.

The mitigative solutions for these issues are simple enough, in theory:

Ensure that the file permissions on the credential files are restricted to authorised users only — ie. the user running the Django application and the user running the credential fetching/renewal service.
Where possible, ensure that no processes run as root in the container.
Build in processes to ensure that when the Django application dies, the whole container dies — or that the credential fetching/renewal service stops renewing credentials if the Django application dies.

We’ve prototyped such a service for Django-on-Gunicorn-on-Docker-on-DC/OS. It’s called Vaultkeeper, and you can check it out here.

Vaultkeeper is a by-the-book implementation of the Secure Introducer pattern. It’s a service that participates in the Secure Introduction handshake facilitated by vault-gatekeeper-mesos and fetches Vault credentials on behalf of an arbitrary application. Once the initial secret has been disseminated, Vaultkeeper uses it to fetch the credentials that the application needs, saves them as a file, and executes the application as a subprocess. Vaultkeeper handles the renewal of credential leases (where applicable) as long as the child application is still running. When the child application terminates, Vaultkeeper revokes the leases of its credentials. Any leases attached to dynamic credentials not revoked during this step should automatically expire.

The advantages of the Vaultkeeper model are the following:

It guarantees that the client application will only run once the credentials are ready.
It allows wrapping the lifetime of the application with the lifetime of the dynamic credentials it needs, narrowing the temporal attack surface of those credentials.
It allows all Gunicorn worker processes to consume the credentials without placing them in leakble environment variables.
Its encompassing workflow is reasonably platform-independent.

Vaultkeeper is in Proof-of-Concept stage. Please feel free to check it out or contribute. 🌻

End of Part 3

That’s the end of Part 3! It’s been a wild ride.

If you were completely new to Vault or secret management before this, hopefully this series provided more clarity on some of the finer points of setting up a Vault and Gatekeeper workflow. If not, hopefully this series provided you with some tools for reasoning about your own, unique Secret Management workflow.