What, Why, How and Who? A story about Review Apps!

CAB

Published in

Yaguara Office Hours

7 min readSep 22, 2020

The What

I have often thought to myself:

Wouldn’t it be great if I could deploy a brand new application hosted in a production-like environment that I can share to the different departments of my company without causing any side-effects?

Lucky for me, that’s precisely what Review Apps helps to solve.

The Why

The Review Apps concept came about when we decided to rebuild our
infrastructure. As a philosophy, we always like to build software with the fail-thoughtfully approach.

Failing thoughtfully is the process of building something while testing the unknown / critical path and see if the assumption works. If it doesn’t work, learn from it and attempt a different approach iteratively until you hit something that works.

In the case of an infrastructure redesign, we had to come up with a specific use-case to test the design with, and it was the Review Apps. Fortunately, it checked both boxes of solving a problem that we had internally and testing a new infrastructure.

The problem we were faced with internally is that when a feature was ready to be sent to the quality assurance team so that our customers wouldn’t see broken features, we would deploy onto Staging and test it there.

This implies that many different features were tested at the same time and causing confusion. More importantly, our Staging environment was the one being used to demo to clients. That said, a feature might have been pushed to staging for testing, had some bugs that would reduce the quality of our product, and we would have to revert it rapidly to ensure that demos could still happen.

In summary, the problem that we were trying to resolve was to have environments that can easily be deployed for a specific branch.

Consequently, we’ve used that same concept to create demo environments that create spaces for every member of the sales teams that can be used to demo our platform to our clients. Each demo environment starts with already created companies, users and OKRs that reset every day without a significant amount of lift.

To be transparent, the first we heard about that concept was by Heroku Review App functionality, and I found the idea pretty thoughtful.

Unfortunately, AWS is our provider and we want the Review Apps to be as close to our production environment possible. So we ended up not picking Heroku.

The How

By going iteratively.

Using Terraform, we’re able to put common concepts such a service of our multiple micro-services into modules. These modules can then be parameterized to be deployed easily depending on the environment

For instance, we know that for our application to work we need:

A Database (Amazon RDS)
A Cache (Amazon ElastiCache)
A Network (Amazon VPC)
Our services (Amazon ECS)

So we built these pieces of the infrastructure and created modules where we could in order to not repeat ourselves.

We started with our API service as it did require the cache to queue specific events and required a few other dependencies such as load balancers.

The first iteration was one big file that would contain all the different resources that the API required to it to run. But we’ve hit our first problem real quick.

Challenge 1: Code Modularity

When we would run terraform apply on the API service, it would spin up an ElastiCache instance and other resources, which is what we wanted in theory. In practice, when we deployed the Web Sockets service, which also required a cache, we would spin up another ElastiCache instance.

The solution was better to modularize our terraform files so that we could reference a unique ElastiCache instance that would be shared between both services. Here is what our folder structure would look like

An example of what your folder structure looks like.

Challenge 2: Shared States

So how do we share the state between the different resources? For example, how does the API know about the hostname of the ElastiCache instance? The answer is simple, Backends!

Backends allow you to save the “state” or “information” about the resources that you’re building.

An example of what one of our services might look like.

There are multiple backends available, by default it saves the state in a local file relative to where you ran the terraform apply command. We’ve opted to save the state on Amazon S3 as we have many engineers that may require to use the same cache instance and to have the state on one’s machine, then it would create a brand new ElastiCache instance to another’s machine.

Challenge 3: Ensuring build orders

How do you ensure that the cache is built before accessing it? Tough question! That said, you could manually deploy the cache by hand by running terraform apply in the cache folder, then when it’s done, do the same for the API service.

Unfortunately, that doesn’t bode well in terms of automation. When you run terraform apply, you’ll see all the changes that are being applied, and you need to manually approve them by typing yes or by declining them by typing anything else.

An alternative is to use the auto-approve flag, which will automatically approve any changes that may occur.

Having clarified that, we’ve opted to automate out Review Apps deployment using GitLab Pipeline. While it’s possible to make a Terraform change that could break, it’s a risk that we’re reading to take for our Review App deployment as it’s not business-critical.

We’ve created multiple stages in our GitLab Pipeline that reflect the different dependencies that are required for the resources to be built in order.

There are many other states that we’ve created such as post-deploy and post-build that serve a different purpose such as automatically generating an internal changelog for each release.

For critical environments such as Production, we’ve opted for the standard approach that is running terraform apply when we want to update its resources.

Challenge 4: Concurrent runs

All is great, we’ve automated our different services and thing is being deployed in order. But what happens with the state when an engineer decides to manually update the Review App ElastiCache while there is an automated GitLab job that runs that also modified the ElastiCache? We’d run into undefined behaviours.

Fortunately, Terraform also has an answer for that. By using State Locking, which “locks” the state from being modified and potentially corrupting itself.

With our backend, this can easily be done by simply adding a dynamodb_table value.

Saves the states of that specific piece of infrastructure to S3 and uses a dynamo table to manage the state lock.

Challenge 5: Manual automation

We’re in good shape! Every time we would create a Merge Request (or a Pull Request, in GitHub’s lingo), we deploy our application “automatically” using Gitlab’s pipeline feature. The states are saved online and can’t be run concurrently, which would potentially cause unexpected side-effects, what’s left?

A few things!

The whole deployment flow would take approximately 5–7 minutes. That is not an extended amount of time, but it still is “some” time. We didn’t always want to deploy a review app automatically for every git push and wait for an additional 5–7 minutes. Sometimes we would want to rebase the base branch, and we’d have to wait.

Fortunately, GitLab offers “manual” pipeline jobs. Unfortunately, while you can manually start all jobs in a stage with the click of a button, it wasn’t useful for us. For example, we’d have a stage named “deploy” that would push our container to ECS for every single service, effectively summing up to 7 jobs and for each of these jobs, there would be a “destroy” job that destroys these resources. If we were to click start all jobs within a stage, it would both start and destroy the services, which isn’t really useful.

We came up with the idea of creating a job that would call the Gitlab API via a curl request with a specific environment variable. Consequently, we’ve modified each job so that they only run if the environment variable value is set to true. That way, we can have a manual job that, when pressed, will spin another Gitlab pipeline with a specific Environment Variable that will automatically run all the jobs required to build and deploy the review app.

An example of the manual “deploy to review app” button once the linter and tests have passed for a given merge request.

Challenge 6: Multiple of the same

Now that we can automatically deploy our review app by pressing a button. We’re running in one last challenge. Remember when we were saving the states, using the Terraform’s Backend feature?

Imagine the specific scenario where you have two feature branches, and you want to deploy a review app for each. Every time you would press the deploy review app button, it would override the most recent API because the “key” of the state is static.

Configuring our services based on variables passed from the CI/CD pipeline.

Unfortunately, you can’t use terraform variables on a backend object to set the key dynamically, but there is a “hacky” workaround that works just the same.

You can use the -backend-config flag during the initializing phase of Terraform, which takes care of installing all the modules for a specific set of resources. That flag allows you to modify the key of the state; in other words, it will enable you to dynamically set where the state is going to be saved for a group of resources.

This then allows us to deploy many of the same services under a different key, thus allowing us to have a unique API for every review apps.

A visualization of the hosts for our Review Apps using Datadog. Each hexagon represents an EC2 machine hosting one of our services.

The Who

While these review apps are specifically targeted to improve developer’s happiness, there are also other clear benefits with having that concept in place.

It increases efficiency towards testing and demoing features
It isolates the data (i.e. changes made on one review app will not impact another review app)
It gets cross-functional teams involved quicker (better feedback loop during development time)
It makes testing accurate. It uses similar infrastructure as Production

It also facilitates the data team’s day-to-day activities. In a usual setting, tests are being made on production or staging where the data team needs to constantly filter that data to provide a more accurate picture of key metrics. In other words, it separates test data from real data.

What are your different environments and what are the problems that they are solving? We’d love to hear your thoughts!