(source: unsplash)

Shaping LeaseLock’s Infrastructure with Terraform (Part 1)

Timothy Ng
LeaseLock Product & Engineering

--

This is the first part in a series of articles chronicling the LeaseLock Engineering team’s efforts in migrating from a hand-managed AWS infrastructure to a fully-automated Infrastructure as Code setup.

Building software on cloud infrastructure is a lot like putting together your own desktop computer. It may seem trivial on the surface, but the process is quite involved if you want to optimize for what you have; one must not only find the right components but also make sure that the system ends up greater than the sum of its parts.

As one of the first engineers to join LeaseLock, a nascent startup at the time, one of the things that excited me most was the prospect of not only building the foundations of that desktop PC but also improving on them as the company matured. Towards the end of 2021, we’d reached a stage of the company where it made sense to rethink the cloud infrastructure that powers the entire business — here’s the story of our “desktop rebuild”.

LeaseLock’s Desktop PC

Our business experienced phenomenal growth over the course of 2021. As an engineering team, we can take pride in having built a platform that facilitated this growth with ease: tripling our insured lease value from $1B to $3B in just eleven months and quadrupling the number of Zero Deposit homes on our platform year-over-year are no feats to scoff at.

Like many rapidly growing companies, we made tradeoffs along the way. In working quickly, we incurred technical debt that would have to be repaid somewhere down the line. Many of those tradeoffs were not in favor of infrastructure — LeaseLock’s very own “desktop computer.” Its core AWS components like EC2 and RDS still stand strong since their initial provisioning in 2016. However, we’ve since loaded several new programs (products) and data into our figurative storage drives, in addition to tacking on some new components (infrastructure) into our figurative case.

Technology moves fast in the hardware world. As consumer demands evolve, new and improved products are released at lightspeed, leaving businesses in need of more processing power, memory, and storage. Before you know it, you’re longing for more processing power, more memory, and more storage. The prospect of upgrading introduces a dilemma: despite meticulously building the existing system components with intention, the temptation to place new parts where easiest (but not where it’s ideal) can lead to cutting corners.

Cloud infrastructure is no different. After all, the cloud is just hardware hidden under various API layers.

We hastily added more parts to our “desktop PC” to cope with our growth rate; think messy cables and components mounted in places they shouldn’t be. While functional, this culminated in an AWS infrastructure in a state of disarray: among the prevalent issues were outdated platform versions on core infrastructure components, overly permissive network design, and less-than-stellar identity and access management (IAM).

A desktop PC with exposed internals showing messy cables
Our infrastructure became more analogous to this picture over the years… (source: Cooler Master)

Above all, managing infrastructure by hand was not the smoothest experience. In a sea of configuration options in the AWS dashboard, it’s easy to miss a detail when making a change, especially when you have to do so for every deployment environment.

From a documentation and auditing perspective, every infrastructure change was accompanied by a Jira ticket; it was technically possible to use Jira or AWS CloudTrail to see why and when a change occurred, but those workflows don’t hold a candle to the simplicity of a Git commit history that we have in our application code.

We knew we’d reached the point where our desktop PC had too much going on; the waning harmony among our infrastructure components was becoming more and more noticeable. Identifying this as a ticking time bomb for our business, we proactively decided that the somewhere down the line to pay down this tech debt was now.

This image is also accurate in the microcosm that is LeaseLock’s tech. (source: xkcd)

For an engineering team of about 30, Infrastructure as Code (IaC) was a natural evolution of our tech stack. The fewer manual workflows we have as a team, the easier our lives are as developers. Instead of building a PC by hand, why not have a robotic arm do it for us? We’re already writing code to automate away manual business workflows — IaC merely extends that power to cloud infrastructure: give it a set of cloud resources to provision, and it does as you ask.

The benefits of IaC are clear. When all your cloud infrastructure is defined in a handful of parameterized configuration files, you gain the ability to…

  • Version control those configuration files
  • Reproduce your infrastructure in any region or account
  • Automate infrastructure changes via continuous integration pipelines

Heading into what would become a months-long battle, we chose Terraform as our weapon of choice. We’d considered other options like Pulumi and AWS CloudFormation, but ultimately preferred the framework that was more battle-hardened. We didn’t have any DevOps specialists on our relatively small team, nor did we have experience with IaC, so we figured that Terraform’s larger community would benefit us when we inevitably encounter problems that require domain expertise to solve.

After some deliberation, we came up with an outline of how to best start managing our AWS infrastructure with Terraform. Our main goals for this infrastructure migration were as follows:

  • Low-friction transition: Minimize downtime on existing production resources, and minimize the impact on existing developer workflows.
  • Better change management: Get developers out of the AWS dashboard; all infrastructure changes should be managed through Terraform. Keep our Terraform config under version control and handle infrastructure changes through CI.
  • Clearer documentation: Each change to our infrastructure (each pull request) must be accompanied by a feature/bug ticket. Git commit history can be used to quickly get a bird’s-eye view of past infrastructure changes.
  • Refactoring: Clean up unused resources, update platform versions and instance classes, and apply industry best practices to our infrastructure (some components, particularly on the networking side, were built to a “just make it work” spec)

Workflow Changes

Before diving into the infrastructure itself, let’s talk workflow.

We wanted infrastructure changes to be treated with the same (if not more) scrutiny as code changes, so we kept all our Terraform configurations version-controlled in a GitHub repository. We structured this repository in the same way we do our application repositories, providing a familiar workflow where we could carry out code reviews for infrastructure just as we do for our application code.

This alone was a huge win security-wise. Priority numero uno was to get our teammates out of the AWS dashboard; the dashboard should be read-only for everyone, and we didn’t want to make any infrastructure changes outside of Terraform except in an emergency.

Hand-in-hand with the developer workflow is our continuous integration pipeline. Our setup is pretty simple: we create feature branches that merge into our default branch develop . Once reviewed, merges into develop are deployed to the development environment automatically. Changes are then promoted to the staging and production environments by merging develop into staging and staging into master respectively. To mirror this in the Terraform world, our devised continuous integration workflow is as follows:

Our Terraform GitHub + GitHub Actions Workflow

Note: To keep this article from growing longer than it already is, I’ve left out how we specifically set up Terraform Cloud to run our Terraform changes for us. For the purposes of this article, imagine that all it does is run Terraform CLI commands when triggered by GitHub Actions.

New Infrastructure: Same But Different

Now, back to the infrastructure.

For the reasons previously stated, our battle plan was to “build a new PC” alongside our existing one.

To minimize downtime while learning the ins and outs of Terraform, the idea was to spin up a near* mirror image of our existing infrastructure with Terraform. This would allow us to start off by running Terraform changes against a sandbox AWS account, mitigating the risk of affecting production infrastructure.

Once we were comfortable with our Terraform configurations, we moved on to provisioning this mirror infrastructure in our production accounts where existing (legacy) resources are provisioned by simply switching out the AWS credentials that Terraform Cloud uses.

Without revealing too much about what goes into our secret sauce at LeaseLock, a rough overview of our AWS account is as follows:

An overview of our AWS account, showing abstractly the components that power our platform together with the newly-provisioned Terraform mirror.

Note that while all of these resources are provisioned, they were not yet used or referenced anywhere else outside of Terraform — how we migrated over from our legacy infrastructure over to Terraformed infrastructure is a story for another time!

Remember that asterisk when I said, “near* mirror image”? One of the biggest benefits of building a completely new set of infrastructure components was that we could identify which parts of our existing infrastructure were no longer in use. We could then avoid provisioning said resources in Terraform, saving us precious time and money.

The greenfield nature of our strategy also provided us with a couple of low-hanging but significant wins not shown in the figure above:

Better VPC & Subnet Structure

Our VPCs and subnets were set up to meet the bare minimum requirements of getting our platform up and running: a basic strategy of publicly accessible resources being placed in public subnets, and then placed in private subnets otherwise.

With a nagging feeling that we could beef up our overall durability on this front, we scoured the internet to see what practices were recommended by the industry and ultimately happened upon this fantastic blog post by Gruntwork.

We followed most of their recommendations while also adjusting them for our needs. The key improvements in the networking layer for us were:

  • Isolation of environments: Each deployment environment should have its own VPC. A dev resource should only ever be aware of the existence of other dev resources as if other environments (like staging and production) don’t exist.
  • Tiered subnets and subnet access: The defense in depth strategy. Some resources like Redis have no business being in a public subnet or even a private subnet adjacent to it, so it can be tucked away deep in our VPC in a subnet that only internal resources can access.
  • Multiple-Availability Zone (AZ) deployments: For redundancy, every resource that can exist in a multi-AZ configuration, such as RDS, should be provisioned as such.

Consistent Resource Naming & Tagging

Much like with variable names in code, good cloud resource naming is crucial for understanding its purpose. While most of our resources were aptly named over the years, we didn’t really have a convention for it like we do in code.

Using the terraform-null-label module, we guaranteed that each resource provisioned with Terraform would be named in a consistent format. We were standing up the Terraform infrastructure in the same AWS account as our legacy infrastructure, so this naming convention is useful in easily identifying what was a “Terraformed” resource and what was not.

A tagging convention also came for free with the terraform-null-label module. Consistent tags across resources help us catalog our resources and also simplify the generation of various reports, such as billing costs by deployment environment.

To recap, we’ve now reached a point where we have an improved version of our existing infrastructure running alongside it. Without even using the new infrastructure, the benefits of IaC were already showing as we iterated on our Terraform configuration. From redeploying our entire infrastructure in a sandbox account to rolling back a bad change, the usefulness of having our entire cloud infrastructure defined in code cannot be understated.

Our next task was to start utilizing this new “Terraformed” infrastructure with our application code. This needed to be done with surgical precision to keep downtime at a minimum. To find out how we did it, keep your eyes peeled for a follow-up blog post — follow our page!

If you’re interested in tackling problems like this with us, visit our careers page for information about available roles. If you don’t see the role you’re looking for, you can also reach out to us directly at talent@leaselock.com.

--

--