Secrets Management: why we love Hashicorp Vault!

Published in

Koodoo

10 min readApr 14, 2022

Introduction

As a Platform Engineer at Koodoo, one of the many challenges my colleagues and I face is management of secrets, ensuring that they are protected and that only the right people and applications are able to access them. We have been somewhat fortunate that in the early days of Koodoo we adopted Kubernetes which has its own secret storage configuration objects, however this does also present its own challenges. On top of the challenges presented in Kubernetes we also have to address the challenges of “single source of truth”, secret life cycles and access management, all of which become very difficult when there are multiple systems outside of Kubernetes also relying on secrets.

Challenges with Kubernetes Secrets

To write secrets to Kubernetes, it will require either a manifest to be written with the secret values base64 encoded (NOT ENCRYPTED!) or for a raw kubectl command to be called to write a secret from stdin.

The temptation to write the secret manifest to a Github repository is high, especially so that it is included in our pipelines. However, this could end very badly given the fairly open and distributed nature of the setup of our Github Organisation for our team. On the other hand, writing a secret using a kubectl command means one or more individuals are responsible for ensuring that secret exists and is backed up somewhere should it need to be re-created. This is a mess either way because we could very easily lose control of who has access to secrets, particularly when restricting access to secrets in-line with our Joiners/Movers/Leavers procedures. Also equally messy is knowing which version of the secret is the right one, knowing where the secret is ultimately stored outside of the cluster, and where we need to go to change or delete the secret when requirements change.

Once the secrets are in Kubernetes, you need to configure RBAC to ensure only the right people and principals can access the secrets, however when you manage multiple clusters it can become harder to administer as you scale out.

You also have to take steps towards safely configuring the storage. Secrets by default are stored unencrypted in etcd (the cluster’s underlying data store). Normally this is resolved by enabling Encryption at Rest for secrets, a capability GKE makes available which is fine and since we are using GKE, this capability is available to us out of the box.

Another challenge is how to manage secrets that are shared between different environments and services. In an ideal world, every environment would be completely separate and isolated: dev, test and RC should have absolutely nothing in common in terms of resources or data storage. Different services and workloads should also be isolated, which in Kubernetes can be achieved by creating namespaces. In reality though, some of the 3rd party services we rely on means that complete isolation is not always possible. While we do always ensure separation between production and non-production workloads, dev, test, and RC might be required to share certain secrets. For example, to be able to connect to an email delivery service. That secret might also need to be shared between services, so needs to be available across multiple namespaces. Imagine a situation where our provider comes along and says “you’re using a v1 format authentication token, we’ve moved on to v2 and you need to update by the end of the month”. That update has to be done multiple times across multiple namespaces and clusters. Yes, toil can always be automated away and tied into some sort of automated process, but you often have to first experience the pain before you do anything to fix it, balancing the time to automate against the manual effort given your working day is stretched across dozens of workstreams.

Lastly, there is also a need to access secrets outside of our Kubernetes cluster, for example in our CI/CD pipelines in Github, or for developers to share secret keys for their Postman calls, argh! Nightmare, how do we access those from outside of the cluster?!

Enter Hashicorp Vault

After some consideration, it was decided that we needed to look at a solution that would allow us to avoid some of the headaches associated with the above challenges and give us new capabilities that have not got a convenient solution within Kubernetes. What we decided to implement was Hashicorp Vault which promises to Manage Secrets and Protect Sensitive Data.

Predominantly we were interested in the following key features in Vault that would potentially be advantageous to our platform:

Configuration management — all of this can be defined as Infrastructure as Code (IaC) using Terraform!
Identity-based Access — Not only can we authenticate and apply policies to users, authorising them to manage secrets, but we can do the same for applications and machines.
Kubernetes Integration — secrets can be securely injected into our applications across multiple clusters, all obtained from our single source of truth.
Secrets Management — we are able to centralise our secrets into one place and make them available to multiple applications, systems and environments.
Storage backends — there’s a lot of choice as to where to store our secrets data!
“Encryption as a Service” — our applications can pass data to Vault to be encrypted before it is stored in an event or database.
Automated PKI Infrastructure — we can reduce the manual steps to update certificates for inter-service communications, for example with monitoring.
Dynamic Secrets — credentials can be created on-demand rather than having to store static secrets that are manually managed. Dynamic secrets often can be configured with a time-to-live (TTL), ensuring that they expire when no longer needed.

Configuration Management

At Koodoo, the Platform Team’s number 1 tool for Infrastructure as Code is Terraform. Whenever we look for a new infrastructure solution we tend to favour those that can be defined in code using Terraform.

The perceived advantages of adopting IaC are reduced cost in terms of the number of people and amount of effort required to configure a solution. Speed and efficiency means that new requirements and configuration changes are often faster to deliver, particularly with the use of modular reusable code. IaC also reduces risk in that automation can avoid human error from manual reconfiguration, this particularly works well when you include testing in your pipelines, but also increases reliability and repeatability.

Identity-based Access and Kubernetes Integration

This was a big selling point for us in the Platform team, as we have many squads working on multiple services and engineers responsible for application development, infrastructure and data pipelines. There are secrets that we would want our application to be able to access, however we’d want to also restrict who within the organisation can access those secrets, who specifically can access secrets in production and before you know it we have a crazy matrix of authorisation that needs to be applied to multiple environments. There are secrets that only the developers should have access to and be responsible for managing (such as for authenticating with third-party integrations), as well as secrets that only the data team should see (such as data warehouse credentials). On top of all this, we need to consider production support team members who are the only individuals who can access secrets pertaining to the production environment.

With Vault we are able to perform identity brokering with our main identity service provider that our users rely on for their day-to-day activities. This means that when a new engineer joins our team, as part of their onboarding process they are added to specific groups and their group membership grants access to specific secret mounts in Vault.

In a similar way, we use Kubernetes to perform identity brokering for our applications. Kubernetes service accounts provide an identity for processes running in a pod, what we have found awesome about Vault is that we can authenticate a service against Kubernetes, then an application is able to have a policy applied to it based on the service account name and namespace. This means service A cannot access secrets for service B because it does not have the correct service account and/or namespace! All of this is made possible with the vault-agent-injector which is able to mount secret data into the pod.

Secrets Management

What can I say? It’s a lot easier now to update secrets because we only need to do it in one place — Vault. All of our environments connect to and pull secrets from our shared Vault, so we have immediately reduced our workload. Secrets are versioned, so if someone mucks up we can roll back to a previous version of the secret. In addition to this, we can now start to consolidate the secrets used for Github Actions and other 3rd-party cloud services using AppRoles as an authentication method.

Storage Backends

Vault can write its secrets storage to so many different types of backend. At Koodoo we host a lot of our infrastructure in GCP, so it made sense for us to leverage Google Cloud Storage buckets as a storage backend. We then keep the GCS bucket in sync with an Azure Storage Account container as part of our backup procedure. On top of this, we use GCP Cloud KMS to unseal Vault, so we have this rather cool situation where the data in Vault can only be accessed when it lives in a specific GCP Project. It’s like having a safe that you can only unlock with your key when you’re in a specific location.

That’s cool and all, but given that we use GCP heavily, why not just put secrets into Google Cloud Secret Manager? Surely everything discussed above is available with the solution provided in GCP? In the initial stages Secret Manager was considered and would provide a lot of the features we required, however as discussed below, Vault goes that little bit further with regards to what it can offer.

“Encryption-as-a-Service”

When handling PII you should always ensure the best security practices of encryption at rest and even more importantly encryption in transit. For this we use Vault’s Transit secrets engine, and this is personally one of my favourites!

At Koodoo our applications are event-driven so we use tools like Google Cloud’s Pub/Sub to move the data from the application into the data pipeline for processing and analytics. Ultimately, the data does go on a journey that even traverses cloud providers! Whilst we use TLS to encrypt communications from the browser, and all the way until we reach the end of the data pipeline, we also encrypt any private data before it starts its journey to its final destination. Typically such an operation would require asymmetric encryption (such as GnuPG), where the backend API of Koodoo’s platform would have the public key, meaning it could only encrypt data, and the end of the data pipeline would have the private key allowing for decryption of that data. To avoid this overhead of managing PGP keys, we have written policies in Vault that makes the transit secret engine behave a bit like asymmetric encryption whilst using AES256 encryption (which is symmetric). Our APIs (and some of our engineers) only have access to the encryption endpoint, so they can pass data to Vault and receive an encrypted response body. If the API were then to attempt to post to the decrypt endpoint, it will receive an authorization error. All along the data pipeline the PII data remains encrypted until it reaches the data warehouse at which point it is decrypted by a job prior to loading into the data warehouse where it is re-encrypted at rest.

Automated PKI Infrastructure

As mentioned previously in “Encryption-as-a-Service”, we use TLS for encrypting our inter-service communication as well as verifying the identity of the service. At the moment we still often use static certificates from well known and trusted Certificate Authorities (CAs), however updating these certificates can be a burden as they appear in multiple places within our infrastructure. Vault has a solution! Vault is able to act as a Certificate Authority and issue certificates to services within our infrastructure automatically, this is particularly effective in Kubernetes when using something like cert-manager. This works well for authenticating and encrypting all communication between services used for monitoring, for example such as Prometheus, Loki, etc., where we hope to issue short lived certificates dynamically as part of the process. Why short-lived certificates? This is because if a certificate is leaked, the potential damage is limited to the TTL of the certificate and reduces the need for managing long revocation lists.

Dynamic Secrets

As is the strategy for our PKI Infrastructure, any chance to use short-lived secrets is a bonus! Vault provides you with a KV store used for keeping static secrets, however with static secrets there is a requirement to keep on top of the management of updating and deleting secrets as they age. Dynamic secrets on the other hand are great for providing short-term access to services, such as credentials for a database or service accounts for Google Cloud. These can be issued to both users and to applications with a short TTL, so that if credentials were to leak they would only be valid for maybe an hour or so before being deleted. It’s a lot harder to hit a moving target!

Where next?

At the moment, it often feels like we are only scratching the surface of what Vault can do. New releases bring some new functionality which might even replace or enhance the features of another third-party service or application. As Koodoo develops its platform and ventures into new areas we re-assess the way we do things and if there’s a use case will increase our usage of Vault.