HashiCorp Vault and Terraform on Google Cloud — Security Best Practices
Use this guide when deploying Vault with Terraform in Google Cloud for a production-hardened architecture following security best practices that enable DevOps and the business to succeed!
HashiCorp’s Terraform is a tool for provisioning and managing resources through structured configuration files, an approach commonly called infrastructure as code (IaC). Security is always important and one of the most common security exposures involves storing credentials or other secrets in configuration files. HashiCorp’s Vault helps by providing secrets management which eliminates the requirement to store secrets such as credentials in configuration files.
In this post, I’ll describe a reference architecture for deploying and configuring Vault in GCP using Terraform tools that follows cloud security best practices and adheres to the Principle of Least Privilege. If you stick around long enough, I’ll also list out some security best practices for each of the components of this system.
Within GCP, you can isolate groups of resources into projects that have a hard boundary, which allows you to adhere to the Principle of Least Privilege. Each of the resources in these projects have their own set of permissions and can only talk to one another if explicitly allowed. In this case there’s a Shared VPC between multiple projects which allow the application projects to communicate over the network with the Secrets project.
This architecture lays out 4 major components which I’ll describe and then provide some best practices.
This is the GCP project where the CI/CD pipeline for Terraform should be deployed. This project is granted a large number of Cloud IAM privileges since it is responsible for creating and maintaining the rest of your infrastructure. If someone creates a trigger with undesirable behavior, the impact can be huge. As a result, monitoring of this service account as well code review become crucial.
As a caveat, this is only really necessary if you are using the Open Source version since Terraform Enterprise can handle automated deployments for you. In any case, this project should contain the Cloud Build configuration necessary to run terraform plan|apply in an automated fashion and send the right logs to the right folks when it fails. You should also store your Terraform state file in GCS within this project, protecting it with VPC Service Controls.
This GCP project includes the necessary infrastructure for a Vault cluster including the cluster itself (which could run on GCE or GKE), the storage backend, an internal load balancer, and a bastion host (running on a Compute Engine VM) used to maintain Vault using it’s API.
Typically a bastion is placed on the public internet as a hardened VM whose only responsibility is to accept SSH connections. With Cloud IAP SSH Tunnelling, you not only gain this functionality but also prevent DDoS attacks. You may be asking then, why do I need a bastion host at all? Well in many cases you don’t, but in the case of maintaining internal services over HTTP where you don’t need SSH, a bastion host becomes useful. This means I don’t have to make the Vault server itself listen on port 22, but only on 443 as it should. The same concept can be used to maintain private GKE clusters as well. You can also turn off the bastion when you aren’t using it to save some money.
You’ll notice Vault is also behind an internal load balancer, which though not depicted in the diagram explicitly, should be a TCP/UDP Load Balancer. The reason for this is that Vault allows you to terminate TLS within the process itself ensuring total end-to-end encryption. If you use an HTTPS load balancer, you would have to re-encrypt traffic to get the same effect. You might as well use TCP listener with TLS that Vault provides.
Version Control System
This is not a GCP project, but the system that stores your Terraform code. I’ll talk about some best practices around securely configuring Terraform for a production environment a bit later. In general, you should pick a version control system (VCS) that has a high level of control over access to the master branch. As an example, in many VCS’s you can enforce that multiple users are required for code review before merging into master or even that you cannot use the rebase command to rewrite history on a particular branch. This level of control is important, especially when moving toward an automated system where a merge to a branch triggers another automated job.
These GCP projects contain the GCP resources that are the consumers of this shared infrastructure. They are the projects that are maintained by the Terraform config files and need to access secrets from Vault to function. For example, let’s say we have a Java app running in a GKE container that needs to talk to a MySQL database. You might have a file that specifies the environment config like the host of that MySQL database in source code, but would not want to have the credentials in source. Instead of baking these values into the container, you can use Vault to pull them into the container at run time. The same process can be used for GCE images as well.
Flow of the Architecture
The flow of the architecture above indicates the primary flow of data or interactions from one entity to the other. Starting from the top-left:
- The Vault Admin goes through two flows: (a) Pushing configuration changes to the Terraform repo for Vault. (b) Updating secrets in Vault via the bastion VM (through Cloud IAP) since secrets should not live in Terraform, they must be added manually.
- The repo update triggers Cloud Build which pulls the Terraform code from that same repository and interacts with the Terraform state file on GCS. The net result is that the GCP resources are deployed or updated to match the state described in the Terraform config files.
- The Vault cluster stores or updates the secrets in the “storage backend” which is GCS in this case.
Finally, the application projects, depicted here as GKE clusters, pull secrets from Vault at startup as well as periodically using the Vault Agent.
So when should I use this?
This architecture should be applied when Terraform is used as the primary means to deploy Google Cloud infrastructure; part of which Vault is used for secrets management. Vault is not always an ideal solution for secrets management. If only static secrets are needed in certain contexts, you should consider Cloud KMS to encrypt secrets and store them in source code or GCS buckets. It’s perfectly fine to store secrets in source code if they are encrypted. Vault is an ideal solution for disparate teams storing secrets at a large scale or when you need some of Vault’s dynamic secret generation capability.
Terraform Security Best Practices
The IaC project should be the single source from which Vault Cluster/environment using Terraform is deployed. Terraform should run just like any other build step with Cloud Build. Once a merge happens in the relevant repositories, a Cloud Build job should execute to run terraform apply. We won’t go into details of a Terraform CI/CD pipeline here, but suffice to say that the IaC project is where that pipeline should live. (This is all assuming you are using the open source tool, since Terraform Enterprise handles automated deployment for you.) Some key points to make here are:
- Execute Terraform programmatically. From a security perspective, Terraform is a very powerful product that has vast control over your infrastructure. Much like you should use systems like Cloud Build and Spinnaker to deploy applications, you should deploy infrastructure with Terraform programmatically as part of a pipeline, preferably using Service Accounts instead of Cloud Identity users.
- Run pre-apply checks. When running Terraform in an automated pipeline, you can either use Google’s Terraform Validator to check the terraform plan output against existing Forseti policies or HashiCorp’s Sentinel, ensuring that an apply action will not cause security regressions.
- Run post-apply checks. Once the terraform apply command has executed, your deployment system should automatically run integration checks to verify the security of the deployment. Tools like Forseti, HashiCorp’s Sentinel, Inspec and Serverspec can all perform this type of check.
- Enforce separation of duties. Ideally, you should run Terraform from an automated system to which no individual users have access, except in a break-glass scenario. At the very least, you should separate permissions in GCP and directories in Terraform in such a way that you adhere to limit access to only those that need it. For example, a network project should correspond to a network Terraform service account or user who only has access to this project.
- Store your state remotely. Google recommend you store state remotely. On GCP, you can use Terraform’s GCS State Backend. Not only does this provide the ability to lock the state to allow for collaboration as a team, but also separates the state and all the potentially sensitive information from version control. This state file should also be protected with a VPC Service Control Perimeter to prevent exfiltration and access from other projects.
- Avoid storing secrets in state. There are many resources and data providers in Terraform that store secret values in plain text in the state file. It is best to avoid storing secrets in state, though at times automation may be sacrificed. Some examples of providers that store secrets in plaintext are: vault_generic_secret, tls_private_key, google_service_account_key, datasource_client_config
- Encrypt your state. The GCS backend in Terraform allows you to pass in CSEKs at runtime using the GOOGLE_ENCRYPTION_KEY environment variable. Even though Cloud Storage buckets are already encrypted at rest, this gives you an added layer of protection. And even though there shouldn’t be any secrets in the state file, you should always encrypt the state for additional defense in depth. As a side note, using a customer managed encryption key, or CMEK, doesn’t require extra IAM permissions to access the state file since only server-side encryption is supported by Terraform.
- Modularize where possible. Terraform can pass variables at run time. Modularizing reduces repetitive code, which can lead to configuration drift and errors over time. Also variable injection at runtime enables and encourages unit testing of Terraform code as part of a CI/CD pipeline
In general, if there are sensitive values being created and managed by a Terraform resource, or a sensitive value is being pulled in by a data provider, those secrets will be stored in the state. If you need to get around this and store secrets temporarily and in memory for a Terraform run, consider using null_resource, which does not store output in state.
Vault Security Best Practices
The Secrets project needs to be locked down further than most other projects considering the information it contains. In this case we are using a few key security controls including Cloud IAP, a bastion host, VPC Service Controls and others since this project will contain secrets for the entire environment (Prod, Staging, Dev, etc.). In addition to HashiCorp’s Vault Hardening Guide, here are some security best practices to keep in mind for using Vault with Terraform in Google Cloud environments.
- Isolate the installation with single tenancy. You can host Vault on Google Kubernetes Engine (GKE) or Compute Engine, but it should be completely isolated in its own project, cluster, and/or private VPC/subnet. If running on Compute Engine, Vault should be the only main process running on a machine. This reduces the risk that another process running on the same machine could be compromised and interact with Vault. If you’re using Cloud Storage, you can also use VPC Service Controls to ensure no other project has access to the Vault backend.
- Use a bastion host for Admin Access. Vault must be accessible by humans via API, but never by SSH or RDP. Identity Aware Proxy (IAP) is an ideal solution to allow Vault to run in a private network and still be accessible for administration and usage. Once you situate Vault on a private subnet, you can access it using a Bastion host. This allows you to disable SSH access to the Vault nodes altogether.
- Use Shielded VMs. A Shielded VM has verified boot and kernel integrity, helping to defend against boot and kernel level vulnerabilities as well as rootkits and bootkits. Recently Google announced support for Shielded VMs on GKE nodes as well. So now regardless of the underlying infrastructure, you can have the verified secure boot, vTPM and integrity monitoring that Shielded VMs enable.
- Restrict storage access. Vault encrypts all data at rest, regardless of which storage backend you use. Although the data is encrypted, an attacker with arbitrary control can cause data corruption or loss by modifying or deleting keys. To avoid unauthorized access or operations, restrict access to the storage backend to only Vault. When using a Cloud Storage backend, you can use VPC Service Controls to ensure no other project has access to the Vault backend..
- Run in high availability mode. If one of your storage backends supports it, take advantage of Vault’s HA to gain high availability for a production cluster. If using GKE, you should expose the Vault nodes via an internal load balancer. With Compute Engine however, you can use server-side redirection and avoid a load balancer altogether. In either case, Vault should be situated on a private subnet behind a Bastion host.
Application Projects Best Practices
In this architecture, I used GKE clusters as an example, but this could represent any type of compute product with an application running on it, such as Compute Engine or Cloud Run. These applications should exist on the Shared VPC with Vault and be able to pull dynamic secrets from Vault. These projects do not necessarily need to talk to one another, and if necessary, VPC Service Controls can be used to further isolate project resources.
- Authenticate to Vault with the proper method. When autonomous systems connect to Vault, you need to make sure that authentication keys are used and stored appropriately. Instead of downloading Service Account keys and storing them on disk, you should attach service accounts to the compute service using them. When a Compute Engine cluster authenticates, use the GCE authentication method. Similarly with GKE, use the Kubernetes authentication method or the GCE authentication method if you are using Workload Identity (Beta).
- Use Immutable Infrastructure where possible. When configuring an application deployment, an ideal pattern is to fully “bake” a Compute Engine or container image, minus the secrets. From there, you can use Vault to pull secrets onto disk at boot time using a startup script and Vault Agent. If secrets exist in any form outside of Vault (such as in plaintext in Terraform state), least privilege ceases to exist.