Zero to Cloud Part 2: Take your code and DEPLOY IT!

Published in

DoubleVerify Engineering

9 min readSep 27, 2023

Welcome back to our discussion about how to successfully implement a cloud environment using a standardized infrastructure as code (IaC) design.

In this part, we will discuss:

How we defined resource layout
How we implemented Terraform and CI/CD pipelines
How we used Open Policy Agent to enforce security and compliance policies

If you haven’t, I strongly recommend you read part one before continuing. There, we discussed:

How we started our design
Questions we had to answer about our cloud needs
How the journey went

How did we define the layout of resources?

When reviewing our applications, we came to the following conclusions (which still apply today).

We had several major external services we provided to customers.
Each of our external services had a dev/stage/prod environment.
Those services ingested, processed, and outputted data to several backend data warehouses.
Most of these services DID (and still do) need to talk to one another, on-prem services, and SaaS providers. This required us to leverage GCP Shared VPC and VPC peering to meet our needs.
Because of the use cases, these services were written in different languages based on the best fit.

Based on this knowledge, we decided on the following:

There were four main organizational units directly under our GCP organization.

Workloads
This OU included our main applications and related projects.
i. Shared VPC projects
ii. Business unit workloads
a. Dev/Stg/Prod
iii. Monitoring and logging for workloads
iv. Data warehouses for the workloads
Corp
This OU consisted of all business unit services to run the company.
i. Corp IT
ii. Infosec
iii. HR
iv. Finance etc
Sandbox
This OU included all sandbox projects created for testing and experimentation.
Common
This OU contained GCP projects related to the overall management of our cloud as a whole.
i. Terraform service project
ii. Security tooling for ALL projects for GCP

When creating a new workload, we created individual development, staging, and production GCP projects. Each of these falls under the corresponding environment folder in GCP.

For example, a production workload project would have been under:

Workloads -> Business Unit -> Prod -> service-region-zone-provider-env

GCP UI access to these environments was restricted to read-only; no manual changes to the environment are allowed.

Each project also had some APIs enabled by default:

BigQuery
Cloud Logging
Cloud Monitoring
Compute Engine
Kubernetes Engine (GKE)
Cloud Storage (GCS)

Other APIs could be added on a per-project basis via Terraform module. Each API and service outside of the default list had to go through a review before being provisioned to ensure secure configuration.

How did we implement this strategy?

At this point, we had identified our applications/services, what the applications do, and who needed to access them. But how do we actually make this work today?

Terraform seed project

First, let’s define a seed project. Terraform automation in GCP starts with a base seed GCP project located in the Common folder of our GCP organization. This project has a service account that provides the ability to manage all base resources from a central location. It is also where we store our centralized TF state files.

The service account has a set of IAM roles that allow it to create, get, update, and delete the base resources required for a workload. Examples of required permissions are, but are not limited to:

GCP project administration, including enabling APIs
Network connectivity
Setup of additional service accounts for later use.

This project and service account is also responsible for managing our folder/OU structure in GCP, as well as org-level policy and IAM. Additional services managed with Terraform have their own service accounts for controlling those resources and are created when modules are instantiated.

Access to this project and service account is highly restricted to reduce the potential of an attack via this service account or from insider threats.

Code layout

Now that we have a seed project, we use it to deploy with our GitOps tooling. All IaC tooling at DoubleVerify is created and managed under a Terraform path in our GitLab.

Resource module repositories
Each module is versioned with Git tags and releases so that downstream usage and rollout of features can be scheduled and controlled by the end user of the module. All resource modules used at Doubleverify are managed under the Modules subgroup.

Accessing resources between the various projects is handled via TF remote state or using data sources.

Once a module is created, it can be consumed in Shared or Workload infrastructure modules to define what resources are created.

Examples of repositories in the modules group include:

Networks
Folders
Projects
IAM
GKE

Workload infrastructure repositories
Workload infrastructure repositories are created and managed under the Workloads subgroup. This is to maintain a consistent set of permissions and processes across all use cases.

As an example, a GKE cluster workload is housed under GitLab at DoubleVerify -> Terraform -> Workloads -> GCP -> Measurement -> Dev -> terraform-measurement-dev, as shown in the above diagram.

The infrastructure repository GitLab project template creates a GitLab CI file and a standard set of files for the workload type. For example, a typical workload will include:

project.tf
Defines the GCP project module config and includes IAM access, enabled APIs, and subnets.
gke.tf
Defines the GKE module config to deploy a Kubernetes cluster. It creates the cluster, network ACLs, node pools, etc.
bootstrap.tf
A module that connects the GKE module/cluster to our internal tooling.
gcs.tf
Defines GCS buckets for storage data in a project.
service_accounts.tf
Defines any additional service accounts needed in the project not created by other modules.
.gitlab-ci.yml
README.md

These files are initially commented out with a link to the module README. We do this to ensure old templates with outdated information are not in use.

Workloads shared infrastructure workload repositories
Workloads shared infrastructure resources used across GCP projects are managed in the Workloads/Shared subgroup. This space has strict controls that allow approvals to changes only by the cloud infrastructure team, as these resources are the core services required to run workloads.

A few examples are:

Shared VPCs
Monitoring

Corp and Sandbox workload repositories
Our Corp and Sandbox Terraform code is very similar to our workload code, and we re-use it here. The main difference is Sandbox has a separate but similarly configured shared network layout with a few isolated projects with no cross-project network access.

Tooling repositories
All custom docker containers, GitLab CI pipelines, and other script repositories related to the Terraform automation and infrastructure are housed in the Tooling subgroup.

Templates
As mentioned previously, we use templates to guide the usage of our environment, and we have multiple GitLab templates to manage Terraform.

These templates MUST link to a comprehensive README that demonstrates required and optional usage. This documentation, along with internal training materials, is key to allowing developers to manage their own infrastructure.

Pulling it all together

So far, we have discussed our cloud environment, how we use it, and how we define our code. Now how do we create something?

Our .gitlab-ci.yml is the glue that holds it all together. We have three end-user stages in our CI file as well as a base setup stage.

Base/Setup

This is a custom container image that runs our Terraform processes. It is built to interact with the Terraform GCS state backend for us, with few additional changes required.

We set up SSH access to our private GitLab projects that contain our Terraform modules. This is required when one Terraform project references another module via Git. This is further explained in Hashicorp documentation: https://www.terraform.io/docs/language/modules/sources.html#generic-git-repository

This stage is not represented in the image above but is part of the included terraform.yml file.

Format

We begin by running an `fmt` or formatting check on the Terraform code for syntax and formatting errors. This check causes the pipeline to fail if it detects an issue and will require it to be resolved. This check only runs prior to merging to main, in order to reduce pipeline

Plan

In the planning stage, we begin by pulling the credentials to access the GCP environment from our internal Vault server. These are stored in Vault instead of Gitlab to keep a separation of ownership and reduce the potential risk of exposing them.

This stage runs a terraform plan which means it:

Connects to GCP
Reviews existing resources found in any existing state files
Compares that to the proposed changes in the merge request
Exports the proposed changes in a report that is added to the merge request to be reviewed
Runs policy and security checks on proposed changes (See policy enforcement)

At this point, the pre-merge pipeline is complete.

Apply

The apply stage only happens post-merge. Once the main branch is updated, a new GitLab pipeline is started. This pipeline has two stages, a Terraform plan and a Terraform apply. This is the point at which changes to the infrastructure in GCP actually take place. Once apply is complete, all infrastructure in GCP should be up to date and match that of our code in Git.

Terraform resource policies

DoubleVerify uses an Open Policy Agent (OPA) to determine what resources and scale of those resources can be created by departments within our GCP environment. A policy is defined in a policy folder in our Terraform/Tooling subgroup. Policy files are written in REGO which is the domain-specific language that OPA uses. Adding the policy check to a GitLab pipeline is defined by adding Conftest (the OPA client) as part of the planning stage, and it will enforce the policy.

An example of an OPA policy is to allow only certain resources to be created.

package main


allowed_resources = [
   "google_folder",
   "google_folder_iam_binding"
 ]


array_contains(arr, elem) {
 arr[_] = elem
}


deny[reason] {
   r = input.resource_changes[_]
   action := r.change.actions[count(r.change.actions) - 1]
   array_contains(["create", "update"], action)  # allow destroy action
  
   not array_contains(allowed_resources, r.type)


   reason := sprintf(
       "%s: resource type %q is not allowed",
       [r.address, r.type]
   )
}

An example policy to deny deleting resources.

package main


denied_actions = [
   "delete"
]


array_contains(arr, elem) {
 arr[_] = elem
}


deny[reason] {
   r = input.resource_changes[_]
   action := r.change.actions[count(r.change.actions) - 1]
   array_contains(denied_actions, action)  # don't allow destroy action


   reason := sprintf(
       "Resource deletion is not allowed for %s",
       [r.address]
   )
}

Glossary

Google Cloud Platform (GCP): A cloud computing resource provider

GCP project: A project in GCP is a way to organize multiple compute resources.

GitLab project: A location in GitLab that contains resources such as a Git repository, CI/CD pipelines, and GitLab services.

Folder: A GCP term to separate and organize multiple GCP projects.

Workload: A DoubleVerify term used to refer to a GCP project and its related resources (GKE, buckets, etc.). Each piece of software/microservice will have a corresponding GCP project(s) and be considered a workload.

Infrastructure as Code (IaC): Managing compute resource lifecycle via configuration files.

Terraform (TF): An IaC tool created by Hashicorp. It has become one of the most popular IaC tools with huge community support.

Hashicorp Configuration Language (HCL): Configuration language used by TF to define infrastructure.

Resource (TF): A block of code in TF that describes an infrastructure object, such as a computer instance, GCP project, DNS record, etc.

Resource (Cloud): An asset, such as a VM, Kubernetes cluster, CloudSQL instance, that is deployed in a cloud service provider environment.

Resource module: A TF construct that contains a definition of multiple resources used together. You can think of it as similar to a class in object-oriented programming. You instantiate a module to create resources from another configuration file.

Infrastructure module: Builds resources based on the definition from a resource module.

Provider: A plugin that allows TF to interact with remote services to create resources.

Data source: Exposes arbitrary data for use elsewhere in TF configuration. This can be an external program or Terraform remote state.

Terraform State file: A file that maps resources in your infrastructure to the configuration in your modules and any related metadata. State files can also include sensitive information like passwords and access credentials.

Remote state: A state file that contains information about a different set of resources but has relevant information needed by the resources you are working with.