Network Infrastructure At O’Reilly

Using Terraform to its Fullest Potential

Published in

O'Reilly Media Engineering

7 min readDec 3, 2019

Migrating to the Cloud

O’Reilly Media’s engineering organization is at the tail-end of migrating our Online Learning platform from a self-hosted data-center to a cloud-hosted solution on Google Cloud Platform (GCP). This involves translating our deployment methodologies from a physical host-centric model to a cloud-centric model. To that end we’ve migrated most of our platform to GCP, where we’re heavily leaning on Google Kubernetes Engine (GKE), which has enabled us to update our tooling, expedited the move from a monolithic application into a microservice-oriented architecture, and decreased the time it takes to turn an idea into a feature that customers are utilizing.

Through the course of this migration, we’ve learned a lot about the benefits and complexities of deploying and maintaining applications in a cloud-hosted environment. Along the way we’ve developed ways to mitigate the complexities and developed some practices that work quite well for our organization.

A 10,000 Foot View

Before going over how we deploy things, let’s take a high-level look at the architecture of our learning platform from front to back:

Our microservices are written in various languages (mainly Python and Javascript) and are based on a Python microservice framework called the Chassis.
These microservices leverage Redis for caching, Celery as a task broker, RabbitMQ as a message queue, and Postgres as a datastore.
Deployments are done via Jenkins, which leverages Vault for secret management and Nexus as a package repository.
With the exception of RabbitMQ and Postgres, all of this runs on GKE.
We manage all of this with a Python CLI tool called InfraCTL (pronounced “infra-see-tee-ell”).

That’s a lot of information, but what it means for our engineers on a day-to-day basis is that their local, staging, and production environments are all very similar and that there’s a single, streamlined process to get code from their laptops to production.

Thanks to the Chassis, which provides easy access to the basic components of a web application, we can spin up new microservices very quickly without having to write the boilerplate application and deployment code. This abstract approach to managing and deploying applications has served us extremely well, so we thought we’d like to mimic it in the way we manage and deploy our network infrastructure.

Terraform and Infrastructure-As-Code

The CTO of HashiCorp explains Infrastructure-As-Code

Like so many other teams, we decided to use Hashicorp’s Terraform as the underlying tool with which we manage our infrastructure. If you’re not familiar with it, Terraform is a tool written in GoLang that couples a declarative language with state files so that it can provide users with a means to deploy and manage network infrastructure. Its main advantages are its state file and its plan-then-apply workflow, which makes deployments to the different components of our platform much safer and easier to manage.

A terraform file that creates a DNS entry might look like this:

module "clouddns-record" {
    source = "git@github.com:myGithubOrg/tf-dns-module.git"    dns_record_prefix = "terraform-test."
    dns_record_type   = "A"
    dns_zone_name     = "my-test-zone"
    rrdatas           = "10.0.0.1"}

The beauty of this declarative language, HCL, is that users who are familiar with the underlying networking concepts can easily grasp what the Terraform script is designed to accomplish. This module will create a DNS A record for terraform-test. that points to the IP address 10.0.0.1 in our my-test-zone DNS zone.

Because of the advantages it provides, we use Terraform to deploy just about all of our infrastructure in the cloud. Providers — Terraform plugins that allow it to interact with many different APIs— exist for almost all of the tools we use including many of GCP’s services, Helm Charts, RabbitMQ, and more.

Over time we’ve learned that while Terraform makes managing infrastructure easy, it does not make managing its state files easy. State files are sometimes large, often complex JSON objects that Terraform uses to track the desired state of anything you’ve deployed with it and sometimes things you haven’t.

Trading Problems

Using Terraform means trading the problem of deploying safely for the problem of keeping your state files safe. It can be a frustrating endeavor at times, but Terraform is a new tool and both it and the community around it continue to mature.

To minimize the risk of working with state files we split them up by environment and by module, which keeps them rather small. These state files are all stored and versioned in Google Cloud Storage, where we have a bucket per environment and a folder per module in each bucket. Taken together, we’ve added distributed storage and minimized the blast radius if state file corruption does occur. For us the end result of that decision looks something like this:

➜  ~ gsutil ls gs://tf-state-test-env
gs://tf-state-test-env/core/
gs://tf-state-test-env/datadog/
gs://tf-state-test-env/gcs/
gs://tf-state-test-env/istio/
gs://tf-state-test-env/k8s/
gs://tf-state-test-env/rabbitmq/➜  ~ gsutil ls gs://tf-state-test-env/k8s/
gs://tf-state-test-env/k8s/default.tfstate

The modules you see listed above are based on smaller modules which serve functions like creating Compute Instances or provisioning DNS entries. These modules are all versioned and tagged which allows us to propagate upgrades of the downstream modules through environments one by one.

Inside our modules we rely heavily on templating and Terraform’s tfvars files. By combining the two we can use the same Terraform module to deploy to multiple environments. Over time, we’ve noticed that this approach led to a proliferation of tfvars files in our repos, making things difficult to manage.

Liberal use of variables and small state files make Terraform a bit easier to work with, but a typical workflow still looks like this:

cd terraform-modules/gcp/k8s
terraform init -backend-config=test-env-k8s-backend.tfvars
terraform plan -var-file=test-env-k8s-vars.tfvars 
terraform applyterraform init -backend-config=prod-env-k8s-backend.tfvars
terraform plan -var-file=prod-env-k8s-vars.tfvars 
terraform apply

These Terraform invocations are actually condensed versions of what we’d use in a real environment, which means commands tend to grow long and difficult to read. This workflow accurately mimics how we might use vanilla Terraform to deploy a change to our development Kubernetes cluster and then to our production cluster.

At this point, we’ve done a lot to mitigate the issue of managing state files, but we’re left with the problem of messy Terraform invocations and proliferating tfvars files.

In Comes InfraCTL — Managing State at Scale

In order to make Terraform easier to work with we developed the aforementioned InfraCTL. It’s a Python CLI tool that uses the Click Python library to wrap Terraform and a few other APIs. The flags provided are entirely optional, and the tool will prompt you for any flags which you do not submit. An explicit invocation of InfraCTL for the same workflow as above looks like this:

infractl terraform plan --module k8s --env test-env
infractl terraform apply --module k8s --env test-envinfractl terraform plan --module k8s --env prod-env
infractl terraform apply --module k8s --env prod-env

When you invoke infractl terraform plan the Python CLI will:

Confirm that you’d like to run Terraform with the selected flags.
Run a terraform init with the variables corresponding to the flags you entered.
Run a terraform plan and output a state file.

When you invoke infractl terraform apply the CLI will:

Check if a binary plan file has been generated
Check if this file is the most recently created plan
If both of these conditions are met, it will run terraform apply

By wrapping Terraform in Python, we’ve removed the need for the user to manage their local state files and streamlined the process for switching between environments and modules. This is a great time-saver, but it does nothing to mitigate the proliferation of tfvars files.

Templating Variable Files

One of the things we noticed about our tfvars files is that a high percentage of the data was duplicated. For example, we’d have a variable, environment = "prod-env" in every production tfvars file. The fact that the data is spread out over multiple files also makes the process of discerning differences between tfvars files rather cumbersome.

To solve this problem, we migrated our variables to a single JSON file per module. These files each contain a dictionary with the structure:

{
    "k8s": {        "global": {"common": "variables"},
        "test-env": {"some": "variables"},
        "prod-env": {"some": "different variables"},    }
}

Grouping our variables by module makes it easy to see the different configurations we’re applying to each environment. The global key allows us to store values that are common among all environments such as subdomains or module names. This structure clarifies the variable changes, and has simplified the process of reviewing pull requests.

Terraform can’t read these files, so we must convert them to tfvars files at runtime. When we invoke infractl terraform plan we automatically transcribe the variables to a temporary tfvars file which is used to generate the plan and then deleted.

In order to make sure that developers can still use Terraform without the wrapper, we added infractl terraform render. This command takes in a module and an environment just like the plan and apply commands and uses those inputs to generate a tfvars file with the variables found in the corresponding JSON file.

Using the render command and then running Terraform manually is as follows:

infractl terraform render --module k8s --env test-env
cd terraform/gcp/k8s
terraform init -backend-config=backend.tfvars
terraform plan -var-file=terraform.tfvars
terraform init

This flow is very similar to the original Terraform workflow but without the complexity of managing many variable files. It’s worth noting that render is not a real Terraform command; InfraCTL lets us staple our own functions onto APIs it uses!

Looking Forward

The addition of the render command and subsequent deletion of some 250 tfvars files allowed us to shrink our codebase by over 2000 lines. At this point, InfraCTL is a great boon to our engineering organization and we’re starting to see adoption from other teams who are adding Terraform modules and increasing its functionality. Most importantly, we use InfraCTL to deploy all of our Cloud VPC’s, Firewalls, Kubernetes clusters, and more.

In the year and a half we’ve been using it, we’ve learned quite a lot of Terraform best-practices, and InfraCTL has given us a reliable way to manage all of our deployments. In the future we have hopes of moving the majority of the functionality of InfraCTL to a REST API that calls Terraform for us. The intention is for the API to integrate with our microservice Chassis so that it can be used to provision resources for our microservices (DNS entries, databases, etc.) as well as for the underlying network infrastructure.

How is your engineering organization managing network infrastructure in the cloud? Are you using Terraform? What are some of the challenges you face? How do you think these tools and concepts will develop in the coming years? We’d love for you to reply or reach out to us via Twitter!