Protecting infrastructure secrets with Keywhiz
Our newly open-sourced secret management and distribution service
Heads up, we’ve moved! If you’d like to continue keeping up with the latest technical content from Square please visit us at our new home https://developer.squareup.com/blog
Written by Justin Cummins.
At Square, our number one priority is security. We needed something to protect secrets, especially as their number increased with our adoption of a service-orientedmicroservice architecture. Although protecting infrastructure secrets is a common need, we weren’t able to find an adequate secret management system. (More on this under “Existing Practices.”) So, we built Keywhiz.
Keywhiz is a secret management and distribution service that is now available for everyone. Keywhiz helps us with infrastructure secrets, including TLS certificates and keys, GPG keyrings, symmetric keys, database credentials, API tokens, and SSH keys for external services — and even some non-secrets like TLS trust stores. Automation with Keywhiz allows us to seamlessly distribute and generate the necessary secrets for our services, which provides a consistent and secure environment, and ultimately helps us ship faster.
Assumptions around secrets
To better understand our secret management solution, it’s helpful to expound on what we require from a system.
- Secret content shouldn’t be widely accessible — neither on development systems or checked into a public GitHub repository.
- Services need raw access to secrets. Our opinion is that services should have access to the secrets they use. For example, internal Square systems use TLS extensively. An internal service running a web server should negotiate TLS directly rather than have multitudes of proxy servers. The alternative to direct secret access is for helper agents to use secrets on a service’s behalf — but writing an agent for each use-case, platform, and language isn’t feasible or simple.
- There are a few situations where our services don’t have direct secret access. For example, certificate authorities use hardware security modules (i.e. external hardware designed for this purpose). In this case, Keywhiz still supplies a secret needed to use the hardware security module to the certificate authority service.
- Centralized management is necessary to prevent “too many secrets.” A microservice architecture rapidly increases the number of secrets. Without a centralized system, secret files can be misplaced, copied, and/or forgotten over time — which makes these more susceptible to being leaked. A centralized system also allows us to more easily analyze secrets. For examples, we can monitor for weak keys and upcoming certificate expirations.
- Access to secrets should be auditable. It may be necessary to track down when and by whom a secret was accessed. A secret management system should provide a log of every access and in what context the secret was accessed. Dropping secrets on servers as files doesn’t provide audit capability. (Auditd can help, but it will take additional engineering effort.)
- The system needs to support a variety of services. Since its inception, Keywhiz has provided secrets for a multitude of services and tools at Square. To name just a few: Rails, Jetty, Netty, Nginx, GPG, curl, and MySQL.
- The system must be reliable. Our most important services can’t run without their secrets. The system for delivering secrets has to be reliable and highly available.
- The system should be easy for consumers to use. A system for managing secrets must be easy to use; otherwise, people will be tempted to find shortcuts.
- Key rotation must be detached from software deployment. Key rotation is a requirement for cryptographic systems, but there’s no one-size-fits-all solution. For example, some keys must be rotated before expiration, which may not fit the timing or frequency of software deploys. We need the ability to detach automated key rotation from software deployment.
We’ve found there are a few common patterns for storing infrastructure secrets, including storing secrets in source code, manually deploying to servers, and using configuration management.
Storing secrets in source code is a prevalent anti-pattern in security. Source code must be accessible on development, revision control, testing, and continuous integration systems — none of which are designed to securely store or distribute secret information. Additionally, updating secret content shouldn’t be tied to a revision of code; merely rotating keys shouldn’t cause system changes.
On a small scale, manually deploying secrets to servers is a reasonable approach. However, without a secret management system, this approach quickly becomes unwieldy as more secrets are inevitably created, replaced, and replicated across more systems. This approach leaves secrets prone to be left in home directories, temporary folders, and backup copies. Some secrets are inevitably not updated or stored with incorrect permissions. Auditing access to secrets or reasoning about them comprehensively, like determining upcoming certificate expirations, becomes difficult. If deploying from an encrypted store, there may be one master key to decrypt everything or a complex mapping of what should be deployed where. An old, misplaced, or improperly erased disk can lead to secrets being leaked.
Secret management schemes based on configuration management systems have the same disadvantages as storing secrets in source code and on server disks. Although they have the advantage of being able to decouple secret changes from code changes, configuration management systems are meant to be widely visible and replicated, and to retain change history — all of which are antithetical to secret management. Many projects have been made to encrypt secrets before placing them in a configuration management system, typically using GPG or a home-grown use of AES. Then, a trusted individual enters the key at deployment time and plaintext secrets are deployed onto server disks. Conclusively deleting a secret is hard (to nearly impossible). Key rotation in a configuration management system must have autonomous authority to make changes to configuration, and must sometimes have access to decryption keys. Also, coupling secret management and configuration management makes it difficult to migrate to another system in the future.
The Keywhiz system is primarily composed of Keywhiz servers and a FUSE filesystem client called KeywhizFs. FUSE enables a program to expose a virtual filesystem without actually storing anything on disk. Administration of Keywhiz servers is done through a web app, CLI, or an automation REST API. Communication between servers, KeywhizFs, and automation clients is protected using mutual authentication with TLS.
Within Keywhiz, access control is defined in terms of clients, groups, and secrets. Each certificate that authenticates for secrets is called a client. Clients are assigned membership to an arbitrary number of groups. To allow a client to access a secret, the secret must be granted to at least one of the groups the client is in. In practice, we create a group for each service on a specific server, a group for each service, and a group that everyone is included in. These three groups cover most use-cases.
To protect secrets stored on the server side, every secret is AES-GCM encrypted with a unique key before being stored in a database. This unique key is generated using HKDF. Square uses hardware security modules to contain derivation keys.
Services get access to secrets through KeywhizFs. At Square, each service on every host has a directory where a KeywhizFs filesystem is mounted. Services merely have to open a read-only “file” in that directory to access a secret. Performing a directory listing shows which secrets are accessible. Local access control is straightforward; traditional Unix file permissions are used for the secret “files.” The advantage of a file-based representation is that nearly all software is compatible with reading secrets from files.
KeywhizFs uses UNIX permissions to provide local access control and separation. KeywhizFs client certificates, processes, and virtual directories are owned by a special KeywhizFs user, distinct from the user a service uses. Assuming service users are non-privileged, the KeywhizFs mount point (owned by the KeywhizFs user) is the sole interface. To make secrets accessible to a service, the KeywhizFs mount point is assigned to the service user’s group and all secrets are group-readable. This works in the majority of cases, but the occasional software package has strict requirements on file ownership or permission. In such cases, extra metadata is stored with the secret on Keywhiz servers and instructs KeywhizFs to present special ownership or permissions.
Rather than copying files to a remote server, KeywhizFs actually queries Keywhiz servers for information and caches data in KeywhizFs process memory. In the event of a network disruption or Keywhiz server failure, KeywhizFs will continue to serve authorized secrets that were previously accessed. Secrets are never written to disk but cached in memory. This is an additional safety mechanism on top of clustering Keywhiz servers to ensure there’s not a foundational outage. If a server is powered off, no data is persisted to disk.
KeywhizFs has additional benefits over actual files. For example, every secret access is logged, including the user behind it. Other ideas — such as client-side cryptography and exposing older versions of secrets — are under consideration.
Deployment at Square
Keywhiz uses many TLS certificates, one for each server and each KeywhizFs mount point. That’s a certificate for every service on every server that it’s deployed to. This presumes a PKI system for creating trusted certificates and a deployment system that determines where software should be running.
Companies devise various PKI systems — from internal Certificate Authorities on special hardware to using the portal of a public Certificate Authority. Keywhiz only requires TLS certificates with particular Common Name fields, so it is compatible with most PKI systems. If you don’t have an existing PKI, certstrap is a simple starting point.
Square’s deployment system is authoritative for what software runs where. When a service is first being deployed to a server, the deployment system will insert secrets and authorize a new client via Keywhiz’s automation APIs. A new certificate for KeywhizFs is generated by our certificate authority, an fstab entry written, and KeywhizFs is mounted in a standard directory for the service to read from. On subsequent deploys, some secrets are automatically renewed, including the certificate used by KeywhizFs. When the service is decommissioned, the relevant access is removed and secrets deleted.
Keywhiz has been extremely useful to Square. It’s supported both widespread internal use of cryptography and a dynamic microservice architecture. Initially, Keywhiz use decoupled many amalgamations of configuration from secret content, which made secrets more secure and configuration more accessible. Over time, improvements have led to engineers not even realizing Keywhiz is there. It just works. Please check it out.