Microservice centric Infrastructure as Code with Terraform at Voi

Published in

Voi Engineering

10 min readDec 15, 2022

by Ronan Barrett and the Voi platform team

Voi is a Scandinavian micro mobility company offering electric vehicles in partnership with cities and local communities around Europe. At Voi we run hundreds of microservices running mainly on Google Cloud Platform (GCP) using Google Kubernetes Engine (GKE) and a mix of managed database and messaging solutions.

In this blog post we discuss our microservice centric Infrastructure as Code (IaC) journey and the solution that has proven a massive success in our Engineering teams. Our solution puts the engineers in control using self-service, where teams are in control of provisioning everything their service needs, without being over reliant on a central platform team.

At Voi we follow the principle of shared nothing for the hundreds of microservices we run. All microservices have clear ownership and resources, like databases, are not shared between services. When looking at our Terraform setup two years ago this principle was not respected. We had a single repository for all Terraform infrastructure as code and a single state file for each deployment environment. Whilst this setup has some obvious advantages, such as state planning across the whole environment and is the logical way to start using Terraform, it presents a number of critical problems as an organization scales:

Teams can accidentally destroy resources owned by other teams
Teams must coordinate when they want to make changes to the Terraform code
If teams don’t coordinate changes one team will get blocked as concurrent infrastructure pipelines are not permitted (we added this constraint using GitHub Actions concurrency control)
A mistake by any team breaks the pipeline for everyone
Ownership of resources is often categorized by infrastructure type and not team ownership e.g. all database resources together instead of segregating resources by microservice
A broken pipeline means critical issues cannot be fixed causing incidents to be longer running than necessary
Build time of the IaC CI/CD pipeline gets longer and longer over time as the number of states increases
As build times are slow planning failures are slow to be reported making the full build cycle very slow and tedious
Slow pipelines cost more in terms of resource utilization

In our case we saw many hours wasted coordinating between teams every week. The platform team was called on frequently to fix broken pipelines as the engineers couldn’t locate the root problem, which was often outside the changes they were making. These problems are not compatible with our vision of providing a self-service platform to our engineers. The setup was clearly platform centric.

A pain point shared by both engineers and platform engineers was the time taken to run the pipelines, which often failed, and needed to be re-run many times. This time could be up to an hour for a given environment.

Before embarking on our solution we looked for a similar microservice centric solution and were surprised to find no publicly documented solutions. There are surely many solutions in place in many companies but none we could directly borrow. In fact, the closest well document solution was published only recently by slack, long after we had implemented our solution. The Slack solution shares some of the benefits of our solution and is well worth reading.

Before talking about our solution it is good to mention a few good practices that we follow:

We use Google Cloud Storage to version and store the state files
We use Workload Identity Federation between GitHub and GCP for keyless secure authentication and authorisation
We store terraform IaC in a single GitHub repository
We require peer reviews of all PRs before merging is allowed
We enable auditing on all GCP resources using Cloud Audit Logs
We use versioned Terraform modules stored in a separate GitHub repository
We use versioned Open Policy Agent constraints stored in a separate GitHub repository

Now that we have covered the basic foundations we can discuss the requirements for the new solution:

Clear ownership of Terraform resources
No coordination between teams required when changing infrastructure
As many parallel infrastructure CI/CD build jobs as microservices should be possible
No need for platform experts when changing infrastructure
No git branch per environment which causes configuration drift
No special git branching strategy which can confuse engineers
A failure in one team’s infrastructure CI/CD pipeline should have no effect on other teams ability to deploy infrastructure
A mistake by one team cannot destroy infrastructure owned by other teams
Use normal Terraform code so we can benefit from the large community and documentation
Fast build time i.e. Sub 10 minutes including linting, planning & applying
Tooling must be testable locally and not require complex CI/CD coding or bash scripts i.e. no complex error prone GitHub Action code

Our solution is as follows:

Clear ownership of Terraform resources in folders using GitHub CODEOWNERS file
Resources owned by one team only in each folders using GitHub CODEOWNERS file
No GitHub concurrency control is required and many infrastructure CI/CD pipelines can run concurrently
The git main branch is the desired state for all environments
Use Open Policy Agent rules to replace platform expertise and enforce best practices
Separate state files per folder and environment are used to isolate microservice resources from each other
Separate state files per folder and environment are used to ensure a small blast radius when things go wrong
A clearly defined folder structure is used to build up a simple dependency graph between resources
An internally developed golang binary tool (called deploy-tool) that wraps the Terraform binary used to lint, plan & apply changes
An internally developed golang binary tool that can be run and tested locally using best practices

Folder Structure

A specific folder structure is required to define a simple dependency graph between resources. The tools we build rely on a known directory structure and known naming conventions:

Each microservice has a separate folder in this repository & state file in GCS
Each domain has a separate folder in this repository & state file in GCS, where a domain is a logical grouping of resources that are shared in a domain e.g. A domain could define a common BigTable cluster resource for the Wireless domain
Each platform has a separate folder in this repository & state file in GCS, where a platform provides support for microservices and domains e.g. A platform could define a Kubernetes cluster and another platform could define networking resources
Each environment (GCP Project) has a separate folder
A common directory for common variables
Each microservice/domain/platform & environment combination has a separate Terraform state file

The file structure in the GitHub repository would look like this:

+-- service-a
| bucket.tf
| sa.tf
| main.tf
| terraform.tfvars
| variables.tf
| versions.tf
+-- service-b
| bucket.tf
| sa.tf
| main.tf
| terraform.tfvars
| variables.tf
| versions.tf
+-- domain-a
| bucket.tf
| sa.tf
| main.tf
| terraform.tfvars
| variables.tf
| versions.tf
+-- platform-kubernetes
| main.tf
| cluster.tf
| nodepool.tf
| terraform.tfvars
| variables.tf
| versions.tf
+-- platform-networking
| main.tf
| vpc.tf
| ips.tf
| terraform.tfvars
| variables.tf
| versions.tf
+-- environment-dev
| terraform.tfvars
+-- environment-stage
| terraform.tfvars
+-- environment-prod
| terraform.tfvars
+-- common
| terraform.tfvars

The file structure in Google Cloud Storage, where the states are stored, would look like this:

+-- terraform-states-bucket
| +-- service-a
  | dev.tfstate
  | stage.tfstate
| +-- service-b
  | dev.tfstate
  | stage.tfstate
| +-- domain-a
  | dev.tfstate
  | stage.tfstate
| +-- platform-infra
  | dev.tfstate
  | stage.tfstate

Naming Conventions

The name of the directory for environment configurations must be environment-$GCP_PROJECT
The name of the folder for service resources should be the same as the application name (we use Argo CD for microservice deployment) of the service e.g. service-bank
The name of the folder for domain resources should express the logical grouping of a number of services e.g. domain-wireless
The name of the folder for platform resources should express the logical grouping of some resources that form the base foundation of a platform e.g. platform-networking

Ownership

Each directory in the infrastructure repository has an owner as defined and enforced by the CODEOWNERS file. Owners of service and domain folders are engineering team(s). Owners of platform folders are platform engineers.

Interdependencies

The recommended way of working with dependencies is as follows:

microservice resources in the service folders can depend on resources in both the domain and platform folders
domain resources in the domain folders can depend on resources in the platform folders
platform resources in the platform folders should not have dependencies in either the service or domain folders

The recommended way of referencing a dependency is as follows (in order of preference):

Reference the resource in another directory by its string name
Create a data source and dependency to the resource in another state file
Create a terraform_remote_state data source and reference the output of another state file

This way of working means platform(s), domain(s) and then finally service(s) should be “terraform applied” in this specific order when projects are created or resources are added. This ordering is managed by our golang tool wrapper on top of Terraform. The tooling will plan and apply the infrastructure in the desired order when our ways of working are followed.

Tooling

To support our way of working we developed a simple golang wrapper around the Terraform binary. The tool is called deploy-tool as it assists in deploying microservices and infrastructure. The tool replaces complex, hard to write, hard to test and hard to maintain bash code and GitHub Action code. The tool is tested using traditional golang best practices. The binary is called from our GitHub Action CI/CD pipelines.

The tool has 3 main commands, as follows:

lint-structure (lints the folder structure to ensure its layout is as expected by the tool)
plan-all (identifies folders with changes in a PR and prepares a terraform plan with output written to a GitHub PR as comments)
apply-all (identifies folders with changes in a merged PR and applies each based on a dependency graph)

The tool in action looks as follows:

❯ deploy-tool terraform --help
Run terraform on a pull request for each modified directory

Usage:
 deploy-tool terraform [command]

Available Commands:
 apply-all Run terraform apply
 lint-structure Checks the directory structure
 plan-all Run terraform plan

Flags:
 -h, --help help for terraform
 --project string GCP project ID

Use "deploy-tool terraform [command] --help" for more information about a command.

The tool invokes the OPA API to ensure constraints and best practices are followed by the engineering teams when creating their infrastructure. Our OPA constraints, written in REGO, are defined in a private GitHub repository and were inspired by the EmbarkStudios OPA Policies.

An example of one of our policies that ensures a specific minimum TLS policy is always used is as follows:

package terraform.gcp.network

import data.lib as l
import future.keywords.in
import input as tfplan

deny_min_tls_version[result] {
 resource_change := tfplan.resource_changes[_]
 
 # Assert type
 resource_change.type == "google_compute_ssl_policy"
 
 # Check if the resource is being created or updated
 l.is_create_or_update(resource_change.change.actions)

 # Check if the TLS version is "TLS_1_2"
 not resource_change.change.after.min_tls_version in ["TLS_1_2"]

 msg := sprintf("Error: `DENY_MIN_TLS_VERSION` - The TLS version `%v` is not allowed. Resource in violation: `%v`", [resource_change.change.after.min_tls_version, resource_change.address])
   result := {"msg": msg, "resource_address": resource_change.address}
 }

Migration

Once we had the tooling in place we began to migrate the IaC and state files from the old structure to the new structure. This phase of the migration took a lot longer than expected as the work required going through each microservice and importing its state to a new state file and then recreating the matching Terraform code in the new Terraform GitHub repository.

Using an automated tool was not possible as all the microservices had different Terraform code. For example, some used modules, some didn’t, and the versions of the terraform resources varied widely. We also did not want to risk destroying infrastructure and causing incidents so the time spent doing the job carefully was worth it.

The time spend migrating the IaC resources and states was probably double the time taken to build the tools to support the new way of working.

Limitations

As we will no longer have a single state file per environment we cannot see immediately when child resources have been broken by changes to their parent resource. For example deleting a service account from the platform directory, that is used by a service, will not fail when running “terraform apply”. The error will only be seen when a “terraform apply” is next run on the service.

This limitation hasn’t proved to be a major problem in the year we have being running the system.

Alternative Solutions

Alternative solutions are available such as the frameworks Terraspace and Terragrunt. They are not used for now as we wanted to avoid any special domain specific languages above and beyond Terraform to avoid any additional complexity. We want to empower engineering teams that are not infrastructure experts to be comfortable making infrastructure changes safely.

The time and effort put into our microservices centric IaC solution has already paid for itself many times over in the last year. We have seen increased productivity from our teams deploying IaC and we have seen a significantly reduced dependency on our platform team’s expertise when issues arise due to the safe isolation provided by the new ways of working and tooling.

We can highly recommend using a binary instead of bash and CI/CD pipeline specific code like GitHub Actions. Our golang binary has proved very robust and easy easy to maintain, unlike our GitHub Action code.

Finally, we can highly recommend the self service model we live and breath at Voi. Good tooling, policy as code using OPA, and a clear way of working with documentation make this possible and highly desirable for all engineers.

Huge kudos to the platform engineers at Voi for being a great team to work with and specifically to Iury Alves de Souza for your dedication in making this migration happen.

Originally published at https://ronan-barrett.medium.com on December 15, 2022.