How to Scale Your Terraform Quickly and Easily

Published in

HiredScore Engineering

6 min readOct 15, 2020

In this article, I will share with you how we manage to scale out our IaC (Infrastructure as code) across teams, and our approach for handling & maintaining Terraform changes.

When we first started using Terraform, we created one repository and one state file to manage all our cloud resources (from all services and all environments).

For running Terraform operations, we had a dedicated Jenkins job that we triggered manually on every change to master.

What’s the problem?

As we grew, we added more and more services and infra resources to this single repo. This became a bottleneck and resulted in:

Having all our cloud resources in one state is a high risk on production infrastructure. For instance, when someone changes a resource in a non-production environment, it can have an impact on our production. Remember, we had one state for everything.
The state got very big and it took longer and longer to run Terraform operations (plan & apply).
We started to have more developers running changes in Terraform (not only the DevOps team). So we had to coordinate with each other on changes. We couldn’t run anything in parallel.
This is a huge bottleneck when you have to move fast and scale out the R&D.
We created the ECS (Amazon Elastic Container Service) resources using Terraform, but we didn’t have a proper way to update configurations of the task-definitions.
This was mainly because we manage secrets and images outside of Terraform.
As a result we had to change things manually in the AWS console, and suffered conflicts between manual changes and Terraform state.

Considering all of the above, we realized we need to change our current Terraform workflow.

What do we want to achieve?

Reduce the risks of accidentally impacting production and critical infra
Deliver faster
Zero manual changes in the process — GitOps
Allow people to use Terraform in parallel, without blocking each other
Have better clarity on what cloud resources each repository has
Gradual updates, we can’t break our current flow

The new workflow

Each service has its own Terraform code in its own repository, and a dedicated Terraform state.
Terraform operations are now part of our CI, and we update ECS configurations automatically on deploy time.
No manual operations anymore

How does it work?

Let’s deep dive into the implementation details -

Splitting the Terraform state & code

The main change we did was to move from one single repository to multiple repositories.
We separated the code of each service into its own repo — meaning that the application code & the infra code are in the same repository.
So the old approach looked like this -

We decided to split the state by service and by environment, while still keeping the core components and the modules in the main Terraform repository.
So the new hierarchy looks like this -

What is the benefit from this change?

Faster Delivery
Splitting the state helps us to reduce the time of running Terraform operations dramatically (less resources in each state) ⏰
In addition, it enables to work in parallel on different parts without blocking each others. Any change we run affects only a specific service and a specific environment.

Make it more accessible to developers
Instead of using a big Terraform repository with tons of code, only the relevant infra for their service is visible. Now it’s much easier to apply changes (see the code below, it’s self-explanatory).

Lower risk on production and critical infra
In the old setup, when all the resources are in the same state — every change or human error can affect critical infra and lead to downtime.
Splitting the Terraform repositories into multiple repos enabled different states, which lowers risk, and enables us to separate responsibilities.

How did we make this change?

To achieve this we needed a way to share modules & variables (as the code now resides in separate repositories).

For sharing modules, we decided to use GitHub source (instead of using local paths)

In our case, every service in each env is using a specific module (for example, HTTP service module which contains all the relevant resources)
So using the module will be as easy as -

This is still not enough, as we also need a way to share variables.
For example, in the main Terraform repository we created
core components like VPC.
Now, we want to be able to consume its outputs (vpc_id) by all the relevant services.

So here we decided to leverage the remote_state feature of Terraform.
This feature is helpful in case you need to inject outputs of Terraform configuration as input to other Terraform configurations.
Now our code for a micro-service will look like this -

and of course, we need to make sure we have the outputs defined in the root-level (as only the root-level outputs from the remote state are accessible, see more about it here)

Automate the ECS configuration updates (instead of manual changes)

As I mentioned before, changes to the ECS resources were done manually. These were mainly related to the task-definition configurations — CPU/Memory/Logs configurations/Volumes, etc.
To fix this situation, we added a Terraform apply step to create the relevant resources on deploy time.
Later in the process, we inject it to the ECS deploy command and no manual operations anymore!

What is the benefit from this change?

Reduce manual operations to a point of eliminating these completely, hence removing the risk of human error
Manual configurations cost time — now this time is saved and the process is more efficient
Single source of truth — changes are managed only in git
No conflicts between Terraform changes and manual changes

How did we make this change?

First, we added another step in our deployment process for creating and updating the relevant configurations on the ECS resources (by running Terraform apply).
Remember we have dedicated code for each service — so it will change only the relevant infra of this service, in this environment.

Now, after we have the updated resources in ECS (meaning we have a new task-definition, created by Terraform, with the updated configurations), we just need to inject it to the ECS deploy command as a “base” task-definition to start from (by default, it takes the active one).

We are using ecs-deploy CLI , so the command for deployment is now including the task-definition:

$ ecs deploy "cluster-name" "service-name" 
--task "task-definition-from-terraform"

But how do we get the task-definition that was created by Terraform? 🤔
We set the task-definition ARN as an output in our Terraform code and then using the terraform output command we can extract it and inject it into the ECS deploy command.

$ terraform output task_def_revision

Conclusion

In this article, I described our journey with Terraform. Here is a recap with the most important points you should take:

Consider automating the Terraform process in your team as much as possible, and have it as part of the CI. You don’t have to pay or give access to managed services, it’s very easy to get a good experience by yourself with a little bit of work.
Having all your Terraform resources in one state can be very risky. Consider separating your states by services and environments, as this is a much healthier and safer approach.
Best practice is to apply Terraform changes gradually.
Improve the DX (Developer Experience), things like separating into modules, and having the infra code together with the application code can really help.

I hope you enjoyed this post, and that it inspires you to rethink and improve your Terraform processes! ❤️

Interested in this type of work? We’re always looking for talented people to join our team!

Thanks to Avner Cohen, Regev Golan, Yossi Cohn, Zion Sofer, Tal Suhareanu and Ezra Wanetik.

How to Scale Your Terraform Quickly and Easily

What’s the problem?

What do we want to achieve?

The new workflow

How does it work?

Splitting the Terraform state & code

What is the benefit from this change?

How did we make this change?

Automate the ECS configuration updates (instead of manual changes)

What is the benefit from this change?

How did we make this change?

Conclusion

Written by Lee Chechik