Secrets and LIE-abilities: The State of Modern Secret Management (2017)
Covers KeyWhiz, Vault, Docker 1.13, DC/OS 1.8, Rancher 1.4, and Kubernetes
I consult for a living. Among other things I help guide teams through some pretty nuanced tech decisions. If you ask me what service orchestration platform I’d suggest you adopt I’d ask you a series of important questions about your team, your workload, and your development and production environments. One of those questions is “How do you protect your secret material like secret keys, database passwords, bitcoin wallet private keys, etc?” For the bulk of 2015 and 2016 if you needed a solution that orchestrated secrets as well as services then I was pointing you to Kubernetes or a more customized solution with Docker and HashiCorp’s Vault or Square’s KeyWhiz. If you had already adopted an advanced cloud like AWS then we’d have a whole different conversation. But its a new year, the space is a little bigger this month, USB-C kills all my YubiKeys, Donald Trump is president, and there are like… a bunch of state funded snoops trying to be the NSA so it’s a good idea to review what tools we can pickup.
This article is not sponsored. This is candid insight based on merits of each project not marketing material. If you’ve got feedback, insight, or other valuable information please share. I’m your student.
Edit: the awesome engineers over at Rancher have already provided some fantastic insight into their architecture. I hope more of that makes it into sensitive documentation.
I provide container/cloud/microservice adoption consulting and corporate training, I wrote Docker in Action, and I perform other software engineering services for hire. I work with some of the largest and smallest companies. If you’d like to work together please reach out.
Current Best Practices
Evaluating solutions is easier if you have a common framework for evaluation. Here’s a quick list of centralized secret management and distribution best practices:
- No secret should be written to disk in cleartext — ever
- No secret should be transmitted over a network in cleartext — ever
- All secret lifecycle and access events should be recorded in an incorruptible audit log
- Secret distribution should be coordinated by a authoritative delegator such as a container/service scheduler or working in a close trust relationship with the scheduler
- Operator access to secret cleartext should be limited — if not impossible without subversive efforts
- Secret versioning or rolling should be easier to accomplish than revealing cleartext
- All infrastructure components related to secret management and distribution should be mutually authenticated
- Secure system configuration should be easier than advanced, and likely insecure configuration
- The attachment of a secret to a service or container should be protected by rich (pluggable) access control mechanisms — role based access control is a plus
- Anything that can be done to minimize the value of a secret should be done
Its a rare that any one solution will fit all of these criteria, but we can dream. You should note that I’m going to try to keep terms consistent across solutions here rather than use the exact terms used by individual tools. For example what one calls an “unseal key” is called a “private key” by another where that solution also uses “unseal key” for something else.
I’ve been asked by a few reviewers to provide a rating system. I think doing that would make this information a bit too easy to gloss over. This topic is far too nuanced and critical to gloss over so please take the time to dig in.
Standalone Secret Managers
KeyWhiz by Square
KeyWhiz (Apache-2) is a secret manager in use at and produced by Square. KeyWhiz relies on an existing Public Key Infrastructure (PKI) and will use whatever database backend you wish to provide it. KeyWhiz itself is the little golden nugget at the center of a secret management system. It does one thing and well: it wraps secret data in a encryption/decryption barrier and vends that data to trusted recipients.
Building a PKI is difficult… No. A PKI is a perfect example of typical InfoSec best practice. They are brilliant tools, and such significant burdens to establish and maintain that few use them even among those who advocate their use.
Operators are authenticated and authorized via LDAP (which is one more thing that KeyWhiz delegates instead of integrates). This will likely be a benefit to organizations who already have a directory in place, but add a dependency for those that do not. Once authenticated and authorized an operator can add, update, list, and describe secrets.
Secrets are encrypted before they hit the network on the way to the database and decrypted after the manager retrieves them. The authors strongly recommend using a HSM for encryption and decryption but you can configure KeyWhiz to use an unlock key instead.
After secrets are decrypted they are sent to consumers over the network. Network connections use mutually authenticated TLS which means that your workers can trust the manager, your manager can trust and authorize your workers to access a secret, and that those secrets are safe on the wire.
KeyWhiz mentions deployment orchestrator integration, but I’ve not seen this or how to accomplish it in my limited experience with the product. The documentation could be described as “light” but the product is fully open source and available for inspection. I have seen Docker integrations in the wild with Volume plugins and the people I’ve talked to who use them didn’t have any complaints. That plugin uses KeyWhizFS which makes secrets available over a FUSE file system (making use of that mutually authenticated TLS under the covers).
Overall KeyWhiz seems like one of those projects that gets the important stuff right and actually offers much more than is initially apparent. I didn’t see anything about access or lifecycle audit logging. Because the project is so specific in scope it will meet the needs of advanced and well funded adopters who can invest in building and maintaining a public key infrastructure (they recommend starting with certstrap another Square project), personnel directory, and who want to manage another database. I believe that secret updates are reflected in live changes to KeyWhizFS so I have no idea how they recommend reacting to those changes and coordinating service updates. I’m a bit disappointed at the current state of the docs. If you want to take a look at what I found checkout:
- Main docs page with a simple slide show — not much else
- Older docs that actually exist
- Great article about KeyWhiz adoption at Square
Vault by HashiCorp
Vault is newer than KeyWhiz and a full-blown open source (MPL-2) HashiCorp product. The documentation is robust and integrations are plentiful. Vault is the current gold standard in secret management and provisioning, but it isn’t perfect either.
Vault’s documentation, examples, and specificity in design and purpose not only help the reader understand how to use the product, but also help to shape and inform how the reader thinks about the problem space. If you want to learn more checkout the Internals > Architecture and Internals > Security Model sections.
Vault does a few things better than all the other projects in this list. First, secret access and lifecycle auditing is a top-tier concern. Second, there is no single database or type of secret. Instead there are several pluggable secret backends that might persist secrets or — and this is cool — provision/rotate/destroy secrets on the fly.
There are three types of backends, Audit, Secret, and Auth. All backend systems have useful defaults but are pluggable with minor effort.
Users consume secrets with names similar to what you would find in a file system. The name is prefixed with the name of the secret backend being used. Vault implements “barriers” between itself and different secret backends ensuring that one bad or compromised actor cannot impact operation of other backends. All data that flows between Vault and a secret backend also passes through an encryption barrier. No cleartext is ever passed through a network (secure or otherwise) and never written to disk.
As mentioned earlier Vault integrates with dynamic secret backends that can provision, rotate, and destroy secrets on the fly. This is a powerful design that decreases the value of any specific secret drastically. There are dynamic secret backends for AWS, MySQL, MSSQL, MongoDB, Postgres, Consul, Cassandra, and others. There are even dynamic secret backends for SSH keys and PKI which will generate an X.509 certificate/key on the fly.
Not all secrets can be generated dynamically, but you should consider adopting the practice in cases where you can.
Vault seems to have it all including Audit. The main sticking point with Vault is workload orchestrator integration. You’re going to have to do some work for the integration but there are a few projects that help. You can look at rancher/secrets-bridge for Rancher and ehazlett/docker-volume-libsecret a Docker Volume plugin. I’ve seen projects and issues for integrating with Kubernetes but, meh. Read on.
KeyWhiz is great for its specific scope. Adopting it would be straight forward for teams already running many of its dependencies. If you’re stuck in a world full of static secrets and don’t need much in the way of audit logging it will surely fit well with your existing infrastructure.
Vault on the other-hand is the sort of tool that makes you want to change your infrastructure. There have been many articles covering it so I know I’m not the first to sing its praises, but maybe so in this context and in 2017. If you can use it without taking away from critical development of your product I highly recommend it.
Integrated Secret and Workload Management
Docker and SwarmKit (1.13)
I’d like to cover Docker first because it is a new kid on this block and many people may not be familiar with its secret management yet. Docker made some pretty significant shifts with their release of SwarmKit and Docker 1.12. At the time I felt that absorbing Swarm into the main project was a mixed bag. On one hand the feature set they released was clearly something that I’d rather have seen on an experimental branch. From what I could tell it also killed any momentum that composable Swarm v1 had going for it. On the other hand, based on my knowledge of the inner workings of this and other orchestration platforms and their bottlenecks I could tell that Swarm mode would scale much better and be an order of magnitude simpler to operate. It was a pivot to be sure.
The 1.13 release of 1.13 Docker has rolled out with integrated secret management. The foundation laid in 1.12 and Swarm mode included cryptographically identifiable nodes, encrypted traffic by default, with mutual authentication. In 1.13 setting up a cluster, creating a secret, and using the secret is so simple I’ve included the POSIX style commands for doing so here. Install Docker Machine and follow along for a bit.
Let’s create 4 machines for our cluster.
docker-machine create -d virtualbox m1
docker-machine create -d virtualbox m2
docker-machine create -d virtualbox m3
docker-machine create -d virtualbox w1
Now that the machines are created we have to initialize our cluster on one of them. I’ve included the “autolock” flag on cluster init so that restarting a cluster manager requires an unlock key.
# Please pardon the sed usage - wanted you to be able to copy
# Create a cluster and put the unlock key somewhere safe
read UK <<<$(docker $(docker-machine config m1) \
swarm init \
--advertise-addr $(docker-machine ip m1) \
| sed -n 's/.*\(SWMKEY.*\).*/\1/p')# Get a handle to both the manager and worker join tokens
read MJT1 <<<$(docker $(docker-machine config m1) \
swarm join-token manager | \
sed -n 's/.*\(SWMTKN.*\) .*/\1/p')read WJT1 <<<$(docker $(docker-machine config m1) \
swarm join-token worker | \
sed -n 's/.*\(SWMTKN.*\) .*/\1/p')
Now that the cluster has been initialized join two peers as managers:
# add the other two nodes as peer managers
docker $(docker-machine config m2) \
swarm join \
--token $MJT1 \
--advertise-addr $(docker-machine ip m2) \
$(docker-machine ip m1)docker $(docker-machine config m3) \
swarm join \
--token $MJT1 \
--advertise-addr $(docker-machine ip m3) \
$(docker-machine ip m1)# Congrats on the 3 manager Swarm cluster...
A cluster that only has manager nodes is a bit boring so go ahead and add a worker node as well and checkout the list of nodes in your cluster.
# Add your worker nodes
docker $(docker-machine config w1) \
swarm join \
--token $WJT1 \
$(docker-machine ip m1)docker $(docker-machine config m1) node ls
# Should show 4nodes in the list with 3 managers and 3 workers
You might be thinking, “I’ve done all this before… What’s new?” Well now you can use the docker secret subcommand to list, create, inspect, and update secrets.
docker $(docker-machine config m1) secret ls# Create a secret named "testsecret"
echo This is the plaintext \
| docker $(docker-machine config m1) secret create \
--label justatest=1 \
testsecret -docker $(docker-machine config m1) secret ls
# Should include your new secret in the list
You just defined a new “secret.” A secret is a first-class entity in Docker Swarm Mode clusters with a lifecycle independent from any specific workload. When you ran the “secret create” command above your shell piped the cleartext data into the docker command which (without writing anything to disk) made an API request over TLS to one of your Swarm manager nodes. The receiving node took the secret name, and cleartext material and created a new record in the cluster’s encrypted Raft log. In Docker 1.13 the database that is shared between masters is encrypted separately. This bears emphasis. Docker secret cleartext is never written to disk and never sent over the network.
Now let’s see it in action:
# Start a service and inject a secret
docker $(docker-machine config m1) \
service create -t \
--secret testsecret \
-p 1500:1500 \
allingeek/secret-leak# Query the service to see the leaked secret content
curl http://$(docker-machine ip m1):1500/
The cleartext secret material is never even written to disk on any node, manager or worker. Secrets are injected into containers running a Swarm service on a tmpfs volume (RAM that looks like a filesystem).
There is no means of recovering the secret from the docker CLI. Instead you have to actually read it from a service where it has been injected and expose the contents. Try not to do that. If fact you should almost never need to.
The reason is that the docker CLI makes it ridiculously simple to version secrets and rotate your material. If you forget a secret, don’t recover it, change it.
Enough demo… How does it stack up?
Secret management with Docker is pretty good, but it isn’t perfect yet. It is simple to configure. Cleartext never hits the network. Cleartext is never written to disk anywhere. Secrets are immutable. Secrets are “pushed” to nodes by centralized orchestration. Any single compromised node should have limited access to secret material. All actors in the system are cryptographically identified with automatic rotation. Manager nodes require an unlock key to rejoin a cluster. Operators can easily version/rotate secrets, and when they do service upgrades are automatically triggered making secret updates simple to rollback.
What OSS Docker is missing is operator RBAC, and a strong audit log. There is a 500kb file size limit per secret but that probably will not be an issue.
Docker Datacenter (Docker’s commercial offering) ships with several security enhancements like RBAC and LDAP/AD integration so you can identify your operators. That is a pretty critical value add for larger teams at enterprise class companies who want to run something in their own cloud or on-prem. Saving that feature for the enterprise does not weaken the OSS offering for small teams.
DC/OS by Mesosphere (Enterprise subscription only)
DC/OS (Mesosphere) is a collection of open source projects (Apache-2) that have been married well by people who really know what they’re doing. Starting in 1.8 they have included secret management APIs in the Enterprise version of Marathon (the primary control-plane and resource scheduler component).
Mesosphere also uses TLS 1.2 connections which means a dependency on a PKI. Mesosphere clusters have built-in certificate authorities for this purpose. Unfortunately the documentation makes several references to occasions where users need to interact with secured endpoints directly and talks about certificate pain. They do make an effort to instruct users how to import custom certificate authorities into their local trust chains, but I think they make it a bit too easy to disable TLS in the cluster and even instruct users to implement a MITM when interfacing with their own system. This is silly and reminds me of exactly why we are in our current security quagmire. That being said, the secrets implementation is pretty complete.
Secrets are encrypted (using AES-GCM) before being sent to and persisted in ZooKeeper (ZK is a central component of any Mesosphere installation and provides storage as well as clustering mechanics). This means that they are encrypted both on the wire (regardless of transport security) and at rest. The key used to encrypt raw secret data is encrypted at rest using an unseal key (4096-bit GPG).
Mesosphere does not appear to keep any hardened secret access or lifecycle audit log. However, it does have a built-in user authentication and RBAC system that controls access to secrets. That system can be used to implement per-secret level access control (which is really nice). You should also note that individual secrets are limited to 1MB. If you want to learn more about secrets in Mesosphere checkout their docs:
Rancher and Cattle (1.4.0) (Edit: Several modifications)
Rancher can be used to provision different kinds of platforms including Docker, Mesos, and Kubernetes. The default workload orchestrator — named Cattle — is also open source and included with any Rancher installation. Rancher 1.4 includes an experimental secrets management implementation. Command line and Rancher compose integration will ship with a later version. As of this writing you can manage secrets via the web user interface like you would storage, registries, etc..
The feature is experimental (1.4.0 hasn’t even hit the stable pipeline yet) and with that in mind I’ll forgive some of the more confusing parts of the documentation. Like other Rancher docs the secrets page is straight to the point, provides likely configuration tweaks, and highlights common use-cases concisely.
Secrets are encrypted on disk. Right away you’d read that — by default — it uses AES and a locally sourced key to encrypt the secrets before storing them in the Rancher MySQL backend. My main concern with the default implementation is that I was not prompted for, nor presented with the vault key. I don’t see any way to recover the key or any mechanism for rotating that key. The data is encrypted on disk… but that’s not all. The primary configuration change that the user docs describe is a Vault integration. The integration uses Vault transit to perform the encryption rather than the Rancher key and AES. Cool integration, but I’d at least like some interaction with the key in the default configuration. Use Vault.
The implementation provided by 1.4.0 can leak all secrets if the Rancher manager is compromised. The vulnerability is documented and that documentation suggests that this will be addressed in a future release.
[Secrets are decrypted by the manager prior to sending a workload to a target node.] Edit: This is not accurate. See a better description in the next paragraph. Once a secret has reached a node it is written to a tmpfs volume that is mounted into any containers where it has been attached. A single compromised host will leak all secrets that have been distributed to that host (they are in tmpfs volumes in cleartext). This vulnerability is noted in the documentation.
Mutual authentication is implemented between secret handling components. Note: Prior content in this section was incomplete and conclusions were inaccurate. Cryptographic identity is established per node. Like other similar platforms Rancher is weak to fake nodes joining a cluster using stolen tokens. Nodes join a cluster/environment by providing identical join tokens and at join time each node generates a unique cryptographic identity and public keys are exchanged with the manager. Those keys are used for encryption and identity going forward. Fraudulent nodes could join a cluster and wait until the manager delivers it a workload with a secret.
I cover key exchange vulnerabilities in the next section.
Secrets are transmitted in cleartext between a the GUI and the manager, not between nodes. Edit: This section has been heavily modified from the original. Secrets are dispatched to nodes that have established unique cryptographic identities by first dispatching the secret volume definition which is realized by the volume driver on the node by creating a tmpfs mount and requesting the secret from the manager. The manager fetches the secret from the secret store, decrypts it, and then re-encrypts that data using the node’s public key and returns it to the requesting node. The node then decrypts the secret and writes it into the tmpfs volume. There is a minimal weakness worth discussing.
Today launching Rancher with TLS is not the default. Enabling secure with the manager requires the introduction of an TLS terminating reverse-proxy. Depending on an external proxy introduces an undocumented vulnerability. An attacker that could intercept traffic between the Rancher manager and the TLS terminating proxy would be able to read cleartext secrets on ingest (coming from the user but not between nodes) even with such a proxy in place. This vulnerability is not documented in the secrets user documentation.
Other than potentially leaking secrets on ingest, intercepting node cryptographic identities during the initial handshake is a more difficult problem to solve. Those identities are used to encrypt secrets during communication between the manager and nodes. If an attacker were to intercept and manipulate the initial key exchange then they’d be able to decrypt secrets in transit. This is a super specific attack profile, and depending on your risk tolerance you might be able to treat this as if it is 100% encrypted on the wire. In either case, a solution for the key exchange problem is in the works.
Rancher validates access to secret material on a per node basis as part of the handoff.
Rancher provides an audit log. I have not been able to determine what secret lifecycle events are logged, but I could imagine that they are likely targets for 1.4.0 or future releases.
Rancher provides per environment access control tooling. Using that access control mechanism will control any individual operator’s ability to steal secrets even by pushing payload that intend to leak secret material.
Operator access to secret cleartext is limited. It does not seem possible to recover secrets from the user interface. However I haven’t seen any tooling for updating, versioning, or rolling a secret. You can update the name and description, or you can delete the secret. There is room for improvement here.
Rancher secrets are high value. I suspect that secrets will change infrequently because they are difficult to provision, update, and version. Long lived secrets are particularly high value because they can likely be used and reused many times.
The Rancher secrets implementation is on par with both the Docker and Mesosphere implementations. If you are considering using it in a production environment you should definitely bring your own Vault server and use the integration. Most of the vulnerabilities that have not been addressed are present in other platforms. Even without addressing some of the more serious vulnerabilities it is important to reduce secret value by providing simple key versioning or rolling tooling. With Rancher the tooling around that workflow could use a bit of work. I’m looking forward to checking back and seeing how the tooling develops at 1.5.0.
Kubernetes (1.5.2 and earlier)
Until late 2016 I had been advocating that community users who needed integrated secret and service orchestration look into Kubernetes for a long time. I had not used it for any projects that I would consider critical and so my depth on its secrets implementation was minimal. There are many other reasons to use Kubernetes and so directing people toward an integrated solution makes sense.
For one reason or another I had gotten my hands really dirty with Kubernetes secrets for the first time. Digging through the docs I saw something horrific. At the bottom of the ridiculous wall of text describing Kubernetes secrets and their operation is a fine print section labeled, “Risks.”
- “In the API server secret data is stored as plaintext in etcd.”
- “If multiple replicas of etcd are run, then the secrets will be shared between them. By default, etcd does not secure peer-to-peer communication with SSL/TLS, though this can be configured.”
- It is not currently possible to control which users of a Kubernetes cluster can access a secret.
- Currently, anyone with root on any node can read any secret from the apiserver, by impersonating the kubelet. It is a planned feature to only send secrets to nodes that actually require them, to restrict the impact of a root exploit on a single node.
Knowing that in addition to this list that authentication, authorization, and TLS are all optional when setting up a Kubernetes cluster means that, well, in my opinion there is no such thing as a Kubernetes secret.
Kubernetes does not encrypt secrets.
Etcd does not encrypt network communication by default.
Kubernetes will share your secrets with all users.
Kubernetes will give any kublet any data it wants no questions asked.
Even if you establish secure communication channels between components and within your etcd cluster you haven’t secured your secrets more meaningfully than your pod specs. They might not just fall out of the CLI’s STDOUT and into your terminal in cleartext and they won’t survive a node reboot. But, just… no. You have to at least encrypt at rest in the database. That is step #1.
This is a proof-of-concept quality implementation only useful for pretending what it would be like to consume an actual secret from a pod. Including secrets in your feature list a more than a year before anyone else by just not actually implementing secret protections is inexcusable. Of the features you would expect from any secret manager — not just a good one — stock Kubernetes delivers exactly zero. But that doesn’t stop them from using secrets as a selling point.
Kubernetes distros like OpenShift Origin might have better config out of the box but they are still Kubernetes and last I checked OpenShift Origin still stores secret data in cleartext. OpenShift and Google Container Engine are hosted PaaS that let you interact with their platform over a Kubernetes-like interface. But as covered in the next section hosted services might actually be worse depending on your needs.
My advice in 2017 is to skip Kubernetes secrets. They are secrets in name only and a really lame attempt to, “out feature box” other tools. If you want to use Kubernetes for a different reason then you should investigate integrating it with a real secret manager.
Hosted Secret Solutions
PaaS companies provide a significant value add. That being said, hosted secret management is only appropriate for a specific segment of users: the risk tolerant. If hacking is more than just a financial risk to your project then never use a hosted solution.
If hacking is mostly a financial risk then the only thing that matters in selecting a secret hosting platform is a clause in the service agreement that shifts liability for breaches onto the hosting provider. This is the same game that credit card companies have been playing for a while (PCI).
The worst hosted solutions will offer no such shift in liability and these companies have no mechanism for aligning their interests with your security interest. Meaning that their “secure vault” is probably publicly accessible and encrypted with a ROT13 cipher (basically cleartext). If a customer doesn’t own the systems they can never fully assess the risk of breach.
Integrated Platform Summary
If you’re only interested in comparing secret management implementations that are included with service/container platforms then I think in the absence of other factors you’d be well served by both Docker, Mesosphere, and now Rancher. You can use Docker and Rancher secrets today for free and enhance Docker’s offering later with a Docker Datacenter license or you can pony up for an Enterprise DC/OS license. Just avoid any Kubernetes related secrets. Don’t use them. Anywhere.
Both Docker, DC/OS, and Rancher secrets are relatively new. When I compare the relative momentum between the three projects I’m more excited to see what Docker puts out next. Maybe that is just because I have a bit of roadmap insight. I’ve had conversations about service level X.509 cert generation on the fly and others about enhanced payload signing with the Notary infrastructure in Docker. The more we can make security a force-multiplier instead of an overhead burden for developers the better.
I did not cover Nomad here because, as a HashiCorp product they delegate appropriately to Vault which was covered above. I have not evaluated the distribution mechanism. If secrets are moved over the network outside of the context of Vault then it should be evaluated more closely.
I think the most important thing to take away from this survey is that in 2017 there is some incredibly powerful secret management software that everyone is free to adopt today. Even more inspiring is that this is all open source software and everyone reading is encouraged to participate in pushing this concern forward on both a technological and prioritization front. These tools and new ones are going to keep getting better so don’t be afraid to adopt and iterate.
Keep it secret, keep it safe… but make sure to read the “Risks” section.