Zero to Cloud Part 2: Take your code and DEPLOY IT!
Written By: John Pribula
Welcome back to our discussion about how to successfully implement a cloud environment using a standardized infrastructure as code (IaC) design.
In this part, we will discuss:
- How we defined resource layout
- How we implemented Terraform and CI/CD pipelines
- How we used Open Policy Agent to enforce security and compliance policies
If you haven’t, I strongly recommend you read part one before continuing. There, we discussed:
- How we started our design
- Questions we had to answer about our cloud needs
- How the journey went
How did we define the layout of resources?
When reviewing our applications, we came to the following conclusions (which still apply today).
- We had several major external services we provided to customers.
- Each of our external services had a dev/stage/prod environment.
- Those services ingested, processed, and outputted data to several backend data warehouses.
- Most of these services DID (and still do) need to talk to one another, on-prem services, and SaaS providers. This required us to leverage GCP Shared VPC and VPC peering to meet our needs.
- Because of the use cases, these services were written in different languages based on the best fit.
Based on this knowledge, we decided on the following:
There were four main organizational units directly under our GCP organization.
- Workloads
This OU included our main applications and related projects.
i. Shared VPC projects
ii. Business unit workloads
a. Dev/Stg/Prod
iii. Monitoring and logging for workloads
iv. Data warehouses for the workloads - Corp
This OU consisted of all business unit services to run the company.
i. Corp IT
ii. Infosec
iii. HR
iv. Finance etc - Sandbox
This OU included all sandbox projects created for testing and experimentation. - Common
This OU contained GCP projects related to the overall management of our cloud as a whole.
i. Terraform service project
ii. Security tooling for ALL projects for GCP
When creating a new workload, we created individual development, staging, and production GCP projects. Each of these falls under the corresponding environment folder in GCP.
For example, a production workload project would have been under:
Workloads -> Business Unit -> Prod -> service-region-zone-provider-env
GCP UI access to these environments was restricted to read-only; no manual changes to the environment are allowed.
Each project also had some APIs enabled by default:
- BigQuery
- Cloud Logging
- Cloud Monitoring
- Compute Engine
- Kubernetes Engine (GKE)
- Cloud Storage (GCS)
Other APIs could be added on a per-project basis via Terraform module. Each API and service outside of the default list had to go through a review before being provisioned to ensure secure configuration.
How did we implement this strategy?
At this point, we had identified our applications/services, what the applications do, and who needed to access them. But how do we actually make this work today?
Terraform seed project
First, let’s define a seed project. Terraform automation in GCP starts with a base seed GCP project located in the Common folder of our GCP organization. This project has a service account that provides the ability to manage all base resources from a central location. It is also where we store our centralized TF state files.
The service account has a set of IAM roles that allow it to create, get, update, and delete the base resources required for a workload. Examples of required permissions are, but are not limited to:
- GCP project administration, including enabling APIs
- Network connectivity
- Setup of additional service accounts for later use.
This project and service account is also responsible for managing our folder/OU structure in GCP, as well as org-level policy and IAM. Additional services managed with Terraform have their own service accounts for controlling those resources and are created when modules are instantiated.
Access to this project and service account is highly restricted to reduce the potential of an attack via this service account or from insider threats.
Code layout
Now that we have a seed project, we use it to deploy with our GitOps tooling. All IaC tooling at DoubleVerify is created and managed under a Terraform path in our GitLab.
Resource module repositories
Each module is versioned with Git tags and releases so that downstream usage and rollout of features can be scheduled and controlled by the end user of the module. All resource modules used at Doubleverify are managed under the Modules subgroup.
Accessing resources between the various projects is handled via TF remote state or using data sources.
Once a module is created, it can be consumed in Shared or Workload infrastructure modules to define what resources are created.
Examples of repositories in the modules group include:
- Networks
- Folders
- Projects
- IAM
- GKE
Workload infrastructure repositories
Workload infrastructure repositories are created and managed under the Workloads subgroup. This is to maintain a consistent set of permissions and processes across all use cases.
As an example, a GKE cluster workload is housed under GitLab at DoubleVerify -> Terraform -> Workloads -> GCP -> Measurement -> Dev -> terraform-measurement-dev, as shown in the above diagram.
The infrastructure repository GitLab project template creates a GitLab CI file and a standard set of files for the workload type. For example, a typical workload will include:
- project.tf
Defines the GCP project module config and includes IAM access, enabled APIs, and subnets. - gke.tf
Defines the GKE module config to deploy a Kubernetes cluster. It creates the cluster, network ACLs, node pools, etc. - bootstrap.tf
A module that connects the GKE module/cluster to our internal tooling. - gcs.tf
Defines GCS buckets for storage data in a project. - service_accounts.tf
Defines any additional service accounts needed in the project not created by other modules. - .gitlab-ci.yml
- README.md
These files are initially commented out with a link to the module README. We do this to ensure old templates with outdated information are not in use.
Workloads shared infrastructure workload repositories
Workloads shared infrastructure resources used across GCP projects are managed in the Workloads/Shared subgroup. This space has strict controls that allow approvals to changes only by the cloud infrastructure team, as these resources are the core services required to run workloads.
A few examples are:
- Shared VPCs
- Monitoring
Corp and Sandbox workload repositories
Our Corp and Sandbox Terraform code is very similar to our workload code, and we re-use it here. The main difference is Sandbox has a separate but similarly configured shared network layout with a few isolated projects with no cross-project network access.
Tooling repositories
All custom docker containers, GitLab CI pipelines, and other script repositories related to the Terraform automation and infrastructure are housed in the Tooling subgroup.
Templates
As mentioned previously, we use templates to guide the usage of our environment, and we have multiple GitLab templates to manage Terraform.
These templates MUST link to a comprehensive README that demonstrates required and optional usage. This documentation, along with internal training materials, is key to allowing developers to manage their own infrastructure.
Pulling it all together
So far, we have discussed our cloud environment, how we use it, and how we define our code. Now how do we create something?
Our .gitlab-ci.yml is the glue that holds it all together. We have three end-user stages in our CI file as well as a base setup stage.
Base/Setup
This is a custom container image that runs our Terraform processes. It is built to interact with the Terraform GCS state backend for us, with few additional changes required.
We set up SSH access to our private GitLab projects that contain our Terraform modules. This is required when one Terraform project references another module via Git. This is further explained in Hashicorp documentation: https://www.terraform.io/docs/language/modules/sources.html#generic-git-repository
This stage is not represented in the image above but is part of the included terraform.yml file.
Format
We begin by running an `fmt` or formatting check on the Terraform code for syntax and formatting errors. This check causes the pipeline to fail if it detects an issue and will require it to be resolved. This check only runs prior to merging to main, in order to reduce pipeline
Plan
In the planning stage, we begin by pulling the credentials to access the GCP environment from our internal Vault server. These are stored in Vault instead of Gitlab to keep a separation of ownership and reduce the potential risk of exposing them.
This stage runs a terraform plan which means it:
- Connects to GCP
- Reviews existing resources found in any existing state files
- Compares that to the proposed changes in the merge request
- Exports the proposed changes in a report that is added to the merge request to be reviewed
- Runs policy and security checks on proposed changes (See policy enforcement)
At this point, the pre-merge pipeline is complete.
Apply
The apply stage only happens post-merge. Once the main branch is updated, a new GitLab pipeline is started. This pipeline has two stages, a Terraform plan and a Terraform apply. This is the point at which changes to the infrastructure in GCP actually take place. Once apply is complete, all infrastructure in GCP should be up to date and match that of our code in Git.
Terraform resource policies
DoubleVerify uses an Open Policy Agent (OPA) to determine what resources and scale of those resources can be created by departments within our GCP environment. A policy is defined in a policy folder in our Terraform/Tooling subgroup. Policy files are written in REGO which is the domain-specific language that OPA uses. Adding the policy check to a GitLab pipeline is defined by adding Conftest (the OPA client) as part of the planning stage, and it will enforce the policy.
An example of an OPA policy is to allow only certain resources to be created.
package main
allowed_resources = [
"google_folder",
"google_folder_iam_binding"
]
array_contains(arr, elem) {
arr[_] = elem
}
deny[reason] {
r = input.resource_changes[_]
action := r.change.actions[count(r.change.actions) - 1]
array_contains(["create", "update"], action) # allow destroy action
not array_contains(allowed_resources, r.type)
reason := sprintf(
"%s: resource type %q is not allowed",
[r.address, r.type]
)
}
An example policy to deny deleting resources.
package main
denied_actions = [
"delete"
]
array_contains(arr, elem) {
arr[_] = elem
}
deny[reason] {
r = input.resource_changes[_]
action := r.change.actions[count(r.change.actions) - 1]
array_contains(denied_actions, action) # don't allow destroy action
reason := sprintf(
"Resource deletion is not allowed for %s",
[r.address]
)
}
Glossary
Google Cloud Platform (GCP): A cloud computing resource provider
GCP project: A project in GCP is a way to organize multiple compute resources.
GitLab project: A location in GitLab that contains resources such as a Git repository, CI/CD pipelines, and GitLab services.
Folder: A GCP term to separate and organize multiple GCP projects.
Workload: A DoubleVerify term used to refer to a GCP project and its related resources (GKE, buckets, etc.). Each piece of software/microservice will have a corresponding GCP project(s) and be considered a workload.
Infrastructure as Code (IaC): Managing compute resource lifecycle via configuration files.
Terraform (TF): An IaC tool created by Hashicorp. It has become one of the most popular IaC tools with huge community support.
Hashicorp Configuration Language (HCL): Configuration language used by TF to define infrastructure.
Resource (TF): A block of code in TF that describes an infrastructure object, such as a computer instance, GCP project, DNS record, etc.
Resource (Cloud): An asset, such as a VM, Kubernetes cluster, CloudSQL instance, that is deployed in a cloud service provider environment.
Resource module: A TF construct that contains a definition of multiple resources used together. You can think of it as similar to a class in object-oriented programming. You instantiate a module to create resources from another configuration file.
Infrastructure module: Builds resources based on the definition from a resource module.
Provider: A plugin that allows TF to interact with remote services to create resources.
Data source: Exposes arbitrary data for use elsewhere in TF configuration. This can be an external program or Terraform remote state.
Terraform State file: A file that maps resources in your infrastructure to the configuration in your modules and any related metadata. State files can also include sensitive information like passwords and access credentials.
Remote state: A state file that contains information about a different set of resources but has relevant information needed by the resources you are working with.