Zero to Cloud Part 1: How to build a production-ready IaC solution from scratch

Published in

DoubleVerify Engineering

6 min readSep 27, 2023

DoubleVerify has been rapidly migrating from on-premises compute resources to Google Cloud Platform (GCP). One of the main goals of this move was to have a standardized, secure, repeatable way of deploying in the cloud rapidly. To make our move to the cloud a success, we knew we needed a strategy.

This series focuses on how we created our strategy, the prep work it took to get us where we are today, and what we learned.

In part one of this series, I will discuss:

How we started our design
Questions we had to answer about our cloud needs
How we answered those questions
How our move to the cloud went and what we learned

In part two, I will deep dive into the technical solutions we implemented. Additionally, a glossary is included at the end of each article of this series.

Where did we start?

Before we began writing code, we had to understand our infrastructure, how we used it, and define standards with that understanding.

There are some key things we needed to consider when designing our environment:

Would our cloud environment be internal only, external only, both?
Did our company have multiple applications? Were they multitenant? Did our applications need cross-team/application access?
Did we have existing cloud resources, and how were they being managed?

When you start your strategy, make sure each of these questions is answered BEFORE you start thinking about your code. This is not the only set of questions you need to answer; you should also think about security, external third-party integrations, compliance reports for your industry, etc. All of this information will guide you in building your IaC code.

How did we answer these questions?

Would our cloud environment be internal only, external only, both?

We chose to host both internal and external services in the cloud in a single multi-application, multi-tenant cloud solution. We implemented four base environments in our cloud:

Development
Staging
Production
Common (used as a central location for CI/CD, IaC, security, and monitoring tooling)

Did our company have multiple applications? Were they multitenant? Did our applications need cross-team/application access?

The way you answer this question completely depends on your business case and how you build your applications. In the case of DoubleVerify, we deploy a suite of applications across industry verticals.

Each of these verticals requires different application performance, real-time analytics, and in some cases, extremely low latency. In addition, we have multiple shared data lake houses.

Due to this, we created a flexible cloud architecture suitable for 80% of our use cases. The remaining cases still follow our standard model but with slight modifications, such as additional DMZ layers or fully standalone networks.

Did we have existing cloud resources, and how were they being managed?

We reviewed our existing cloud resources and found that we:

Were manually creating resources via the GCP UI
Were manually executing command line tools via end-user devices(laptops)
Had some Terraform code that was also run manually from end-user devices

NOTE: While these methods work for small-scale deployments to the cloud, they also introduce significant risks and challenges in managing the cloud environment’s overall state. Without a clear and consistent management approach, ensuring stability, security, and repeatability is difficult.

What conventions did we create?

Using the answers to these questions, DoubleVerify defined conventions regarding how we handle our various services and applications. For our cloud environment, we needed to define a basic unit of deployment for the cloud. We call it a workload.

A workload is defined as a GCP project and its related resources (GKE, buckets, etc.). Each piece of software/microservice has a corresponding GCP project and is considered a workload. Multiple microservices can be hosted in this workload unit, or it can be a single application or microservice. We also have “helper” workload units for shared resources such as networking, organization-level policies, etc.

The choice for DoubleVerify is to have one workload git repository mapped to one GCP project. All resources for the infrastructure of that workload are defined in that git repository when and where possible (exceptions will be discussed later). Each unit will, for IaC purposes, also have one state file to manage the resources.

DoubleVerify also standardized naming conventions for all resources where possible. I won’t reveal our convention for security reasons, but these should be clear, human-readable, and programmatically parsable. An example would be:

service-region-zone-provider-env

gke-useast1-b-gcp-dev

You need to decide what your basic unit is. For example, in AWS, you may base your deploy unit on VPCs, or each service may have its own AWS account. Maybe each customer is the unit, and you deploy as a single tenant. You need to decide this in your environment.

About DoubleVerify’s journey

We started this journey in February of 2021. Along the way, we mostly accomplished our goals quickly, but there were some issues.

Existing cloud resources

DoubleVerify began using the cloud very heavily in mid-2020. In Feb of 2021, when we began this process, we had some previously built infrastructure like our shared VPC networks, some production systems, and basic GKE usage. We had to decide to either import existing resources to our new layout or delete everything and start from scratch. Because we had some production workloads already, and they were mostly aligned with our design, we chose to build tooling around making it easy to import existing resources to IaC.

As you go about designing your environment, keep this in mind. The overhead to import and manage resources that are out of spec to your design can add tech debt to the modules or pipelines you build.

Terraform state rate limits

When we first began deploying our Terraform code, we used the GitLab Terraform state backend built into the platform. Initially, this was a good choice. It kept our state files very close to the repository and was a standard HTTP backend.

Because the Terraform state backend uses the same REST API as all other GitLab APIs, as we used TF more, we increased our API calls and began getting rate limits. Due to this, we pivoted our state files to the GCS bucket backend. This infinitely scalable solution provides better security because access to state files is NOT controlled by GitLab repository access.

Where is DV going now?

Our journey to the cloud and IaC has been successful, but it is not over. We are pushing forward with new ideas and concepts. We are now looking into additional IaC options for easier deployment of resources and applications that are more ephemeral.

We are taking the knowledge and concepts we learned to manage the cloud and returning it to our on-prem data centers.

Lastly, we are working closely with our InfoSec and AppSec teams to build security into our infrastructure code instead of remediation after the fact.

Glossary

Google Cloud Platform (GCP): A cloud computing resource provider

Workload: A DoubleVerify term used to refer to a GCP project and its related resources (GKE, buckets, etc.). Each piece of software/microservice has a corresponding GCP project(s) and is considered a workload.

Infrastructure as code (IaC): Managing computing resource lifecycle via configuration files.

Terraform (TF): An IaC tool created by HashiCorp. It has become one of the most popular IaC tools with huge community support.

Resource (TF): A resource is a block of code in TF that describes an infrastructure object, such as a computer instance, GCP project, DNS record, etc.

Resource (Cloud): A cloud resource is an asset, such as a VM, Kubernetes cluster, or cloud SQL instance, that is deployed in a cloud service provider environment.

Make sure to also check out part two of this post where I dive into the technical solutions we implemented.