Terraform at LumApps: Part 1

Published in

LumApps Experts

8 min readMay 23, 2022

Context

During the early years, our product was running entirely on Google App Engine with Google Datastore as the database engine. App Engine allows for a planet-scale autoscaling of our web applications without the hassle of managing the infrastructure.

In 2018, we started seeing the limits of our model: running a monolith was causing issues. Using a PaaS solution wasn’t very cost-effective and scaling our Engineering organization was going to be a struggle.This is where we started our journey towards a Service-Oriented Architecture. The goal was to overcome our current struggles and to be future-proof.

Now that the migration to our new architecture is well underway, we want to show what we have done on the infra side, over a series of articles.

The Platform domain

The Platform domain is responsible for building the platform hosting LumApps. It brings together developers and platform engineers. Our focus is not only on stability, but on cost-efficiency and developer velocity too.

We chose to build our platform together with the feature teams to tailor it to the LumApps product and its developers needs.

The Platform domain’s roles are pretty diverse:

Build and maintain the platform hosting LumApps.
This is a traditional CloudOps function: infrastructure setup and maintenance, CI/CD toolchain, SaaS solutions analysis and implementation, etc.;
Define (together with the feature teams) and enforce best practices (through tooling) for development, deployment and run of the different services composing the platform;
Allow feature teams to focus on building features with added value by removing as much complexity as possible from infrastructure-related processes;
Monitor and improve continuously the developer experience;
Work tightly with features teams to provide architecture advices early in a service development;

Use cases and requirements

Our first requirement is to automate and secure the building of the platform.

The platform is built around Kubernetes, PostgreSQL, Redis and Elasticsearch. The last two running on Kubernetes.

Each customer is deployed into a cell which is a whole deployment of LumApps (with its Kubernetes, PostgreSQL, Redis and Elasticsearch clusters). By correctly sizing the cells, we aim to remove the noisy neighbor phenomenon and limit the blast radius in case of an outage. This is obviously a trade-off with maintenance and infrastructure cost.

Our second requirement is to provide developers with autonomy using GitOps practices.

From day one, we wanted to provide feature teams with self-service infrastructure following the motto “you build it, you run it”. We also needed to enforce best practices and be opinionated about what we allow (or not) on the platform.

Giving feature teams more autonomy on their services built-up their confidence and allowed the Platform domain to focus on providing expertise and improving the platform itself.

Implementation

LumApps is available on both Google Cloud Platform and Microsoft Azure as our customers are usually linked to one or the other.

Self-service flow for infrastructures and services deployment

Cells are independent, located in different regions, and run the entirety of LumApps. They are all deployed the same way and they all run the same version of our services.

The infrastructure

The Platform domain is composed of several teams, including the infra team which is responsible for the deployment of all the underlying infrastructure. We chose to use Terraform as our means of industrialization.

Deploying everything with Terraform (and versioning in Github) allows us to ensure consistency across cells, perform code reviews, audit our code and its changes, and quickly revert a change if needed.

Common infrastructure

First, using Terraform, the Infra team creates the cell itself (the GCP project or the Azure Resource Group) and all the storage for later use by Terraform for its states. Every cell and every service has its own state stored on the provider’s object storage service.

Then we create all the common infrastructure components:

The virtual network and the subnets we will use;
A Kubernetes cluster (Google Kubernetes Engine or Azure Kubernetes Service);
The API-Gateway;
All Message Bus elements (PubSub or EventHub);
An Elasticsearch cluster for our search engine;
An HashiCorp Vault cluster;
The Datadog agent;

Services infrastructure

Finally the service infrastructure can be deployed. Each service have (or may have, depending on its needs):

A Kubernetes namespace;
A set of Network Policies to restrict access to the services on a need-to basis;
A Redis cluster;
A PostgreSQL cluster (Google CloudSQL or Azure Database for PostgreSQL) with PGBouncer in front;
An access to third-party services: Elasticsearch, PubSub/EventHub and Vault;

This part is GitOps-ed using Atlantis (more on that in our second article). Every developer has the ability to create a service’s infrastructure and modify it to have it tailored to the service’s needs.

The service

The service itself is deployed using ArgoCD, reading its configuration from the corresponding repository on Github.

This gives feature teams a total agency to deploy whatever they want and whenever they want. We do limit the available Kubernetes objects to an extent through ArgoCD configuration.

Architecture of a typical in-house service

A classic service usually creates several Kubernetes resources: a deployment for the main API, some event processors deployments to subscribe to the message buses, a Kubernetes service (obviously), multiple ConfigMaps and secrets, some HPAs and various (Cron)Jobs (depending on the service and its infrastructure).

Our main repository

Structure

Most of our infrastructure-related code is stored in a single repository. It works around two main components: our cells and our modules.

Since we are in an era of big changes at LumApps, having everything in a mono-repo lets us perform quick iterations and deployments.

Organization of our infrastructure mono-repo

Terragrunt is a huge component in reducing our code base by providing reusable elements, code include and dependencies between modules.

Modules

We rely on both Terraform and Terragrunt modules to deploy our infrastructure. Terraform modules are used to package a set of tightly-coupled Terraform resources while Terragrunt modules are used to manage all resources for a given service (including Terraform modules).

Cells

The cells folder contains all our cells spread out into their respective environments. Every cell contains the same structure which is a set of subfolders. Each subfolder is in charge of calling a Terragrunt module to create something specific.

A subfolder doesn’t contain much beside a terragrunt.hcl file which instantiates the Terragrunt module with the appropriate inputs.

All in-house services are deployed in their own subfolder (with a name starting with svc_). They all instantiate the same Terragrunt module (svc-go for Google cells and svc-ms for Microsoft cells).

Call graph of a subfolder, its Terragrunt module and its Terraform modules

We created a Terragrunt module generic enough to be able to handle all our use cases for these in-house services:

database deployment, provisioning and configuration;
Redis configuration;
message queues access;
synchronous call to other services;
custom secrets;
etc.

Timeline towards the self-service

While we chose to do self-service from the get-go, we iterated and improved a lot of things throughout the whole project, which is still ongoing for that matter.

Late 2018: The beginning of the project with our first Terraform modules at LumApps;
Mid 2019: Design is still ongoing for the new architecture. Realization that we will need to do a lot of copy/paste to make Terraform work;
Early 2020: Terragrunt prototyping and implementation to reduce said copy/paste;
Late 2020: Heavy effort towards factorization and generalization of our Terragrunt modules;
Early 2021: Atlantis prototype and MVP;
Mid 2021: Infracli prototype (see below) and self-service for the Platform domain only;
Late 2021: Self-service MVP go-live for the whole Engineering department;

We chose to assume the role of an Ops team for the first half of this timeline in order to really understand what the platform should be. Working together with the feature teams, we completed our design of what is a LumApps Service.

Only in mid-20, we started to focus our efforts on making the self-service project a reality. That’s very important as it allowed us to answer 95% of the needs of the feature teams through the self-service alone.

Issues

We encountered a few roadblocks during our journey towards deploying and automating our infrastructure.

Bootstrapping

Currently, bootstrapping anything requires human intervention.

The infra repository’s initial remote state and the Google project storing it (along the associated service accounts) all have to be created and stored by a human being.

We said earlier that each cell stores its Terraform state (and the state of its subfolders) in a dedicated bucket. But where does the cell store its state when creating its own bucket? This is a chicken or the egg dilemma.

Every cell contains a subfolder named cell_bootstrap which is used to bootstrap the cell itself. That means it creates either a new project along with a bucket to hold the cell’s state (and the cell’s subfolders’ states) for GCP, or for Microsoft cells, it creates a new resource group and a storage account with the associated container.

The state of the cell_bootstrap folder is, however, stored in an environment-specific bucket outside of the cell’s project or resource group.

How to authenticate on Azure using AAD’s service principals ?

We use Azure Active Directory to authenticate both machines (using service principals) and humans (using our own accounts) when performing Terraform operations on Azure.

However, granting permissions to Service Principals on certains services requires a delegation that only a human can grant. To simplify the delegation process to its maximum, and like we did for the cell bootstrapping, every Azure cell has a special subfolder handling this and printing the URL for a human to click.

How to handle a lot of terragrunt.hcl?

As we’re writing this, we have 15 cells and north of 50 subfolders in each cell. Even if a subfolder contains almost nothing, it still means we have around 750 terragrunt.hcl files !

Barney doesn’t like having a lot of terragrunt.hcl files

If we want to modify even a small part of these files, it’s still a tremendous amount of work and that would greatly reduce the adoption of the self-service as we conceived it.

We designed a small command-line interface called “infracli” to make this procedure as simple as possible while reducing code duplication. More on this command in the second article of this series :)

We are hiring :)