Finding the right balance between Terraform and CloudFormation
At Melio, we built our system on top of AWS’ cloud native infrastructure. When you do so, you shouldn’t manually configure the environment and applications but use tools that were designed for that job instead. So, to help us do that, we’re using a combination of Terraform and CloudFormation. It might sound counterproductive, but eventually, this combination brings the best out of both. In this post I’ll try to explain why and how we did it and some of the challenges we faced.
A contextual introduction
Terraform, one of the industry’s leading tools, was our first choice. What we liked about it was its flexibility and vast ecosystem of modules and plugins. We used this tool to manage Melio’s entire stack and system, including our databases, the configuration, queues, Lambda functions, you name it. There were, however, a few compromises.
A general recommendation of terraform is to avoid using it in a fully automated manner. This means that having it as our deployment tool had to come with a couple of policies:
- Unsupervised (automated) runs of terraform had to use a very narrow “-target” flag, to avoid resource destructions.
- To reduce our usage of the “-target” flag, Lambda code was directly pushed from our deploy pipeline.
So terraform was mainly managing the structure of the environment, and code was pushed from “outside” to resources allocated by terraform. This created a chicken or the egg problem of various degrees. If a code update required an infrastructure change, we had to meticulously coordinate the code deployment with a terraform apply
which requires a big part of the deployment process to be manual, and tedious as a result.
Alongside these technical challenges, we also hit a few bumps in the form of operations. The goal was to let the teams manage their stack, and when Melio was small enough this was fairly straightforward. However, once we passed a certain number of people and teams working on the same system, coordination became a real issue.
We needed to adjust the process, and find a tool that would help us package code changes along with the related cloud resources’ changes.
Enter CloudFormation (and SAM)
For a while, we tried avoiding CloudFormation. We didn’t want to mix tools that solve similar problems, and we thought CloudFormation was a complex and non-flexible tool. However, CloudFormation’s ecosystem has some very neat features.
- It’s hosted and managed by design, which fits our overall cloud native approach.
- It provides direct access to use CloudFormation with permission granularity similar to regular AWS APIs.
- With SAM, services’ code and resources can be packaged and made ready for deployment when needed, regardless of the deployed environment.
Combining the above, this means that we can also give engineers and teams the ability to manage and deploy their own stacks. But of course, as with any technological change, there were some challenges to overcome first.
As a side note, we considered using CDK instead of SAM to create our CloudFormation templates. Eventually we decided to go with SAM, but this decision is beyond the scope of this post.
Splitting ownership and responsibilities
One of the first challenges was deciding which component or resource will be managed through Terraform and which will be managed through CloudFormation. Going back to our reasoning of using CloudFormation we devised a very simple policy to guide us:
- Database resources are managed by Terraform.
- Infrastructural resources (like SNS topics or VPCs) which are shared by a lot of services, are managed by Terraform.
- Application level resources (S3 buckets, Lambda functions, Step Functions, EMR clusters, DynamoDB, etc.), and anything that doesn’t fall into the shared category above, is managed by the teams’ CloudFormation stack.
As with any policy, there are always exceptions. For example, although DynamoDB tables are essentially a database they’re still managed by the teams’ CloudFormation templates. This is more of a tactical decision to make it easy for the teams to manage more of their resources.
One of the guiding principles was that resources managed by Terraform, being managed by a single team, should update on rare occasions if at all. VPCs, for example, are created once with a set of subnets and almost never updated, similarly to SNS topics and similar resources.
So now that we had Terraform managing infrastructural components, and CloudFormation managing the application stacks, we needed a way to share resource identifiers from Terraform to CloudFormation (and possibly between CloudFormation stacks).
Referencing resources from Terraform to CloudFormation
One thing that Terraform does well is managing dependencies between resources, even when fetching references from external resources. This is a key feature that we needed in our CloudFormation templates, since our application stacks depend on environmental resources (like SNS topics, RDS endpoints and VPC ids).
External references from the environment are required during the CloudFormation’s convergence phase and not after the stacks are deployed. We needed this so that we could grant exact permissions between services of Melio, create resources which depend on a resource reference like a SNS topic subscription, and inject environment variables to a Lambda function.
There are a few ways to fetch or pass external resources using a CloudFormation template.
- Passing them in a template’s Parameters section.
- Using “ImportValue” intrinsic function.
- Using dynamic references.
Passing parameters to a template is a good place to start, but you quickly realize that you need to either manually lookup values from the system, or write some kind of wrapper around CloudFormation that does that for you before applying. We did not really want to do this, because we wanted our CloudFormation templates to have as little requirements as possible to use them.
Besides the complexity of wrapping CloudFormation calls with reference lookups, our guideline is that CloudFormation templates should require a single “Environment” parameter. Using that parameter they should be able to provision themselves and get any parameter they need.
ImportValue is also a very nice feature. It is used when another CloudFormation stack has exports that you would like to use. It works as expected, but it has a very big gotcha. You cannot update an exported value if another stack has imported it. It basically locks the exported value, preventing modifications.
Dynamic references are a little trick CloudFormation has under its sleeve. It allows for a simple “string replace” of external references from either SSM or SecretsManager. You simply put {{resolve:ssm:PARAMETER}}
anywhere in your template and it’ll fetch the value for you. There is a small caveat, however. This resolving happens once a resource is provisioned, and stays as it is. If you update the SSM parameter, or secret, CloudFormation will not re-fetch the value on the next CloudFormation run. It will only re-fetch the reference if something changes in the template which triggers an update to the resource itself.
Another limitation of dynamic references is that intrinsic functions of CloudFormation (e.g. Fn::Split) will process the pre-resolved string and not the resolved one. For example, splitting a string expecting to get a list of subnets will not generate what a template author would normally expect.
# Consider the following part
...
SubnetIds: !Split [ ",", "{{resolve:ssm:/subnet-ids}}" ]---# Although this is what you expected
...
SubnetIds:
- subnet-7b5b4112
- subnet-7b5b4115---# The above actually happens in reverse# 1. CloudFormation runs the Fn::Split function...
SubnetIds:
- "{{resolve:ssm:/subnet-ids}}"# 2. And only then resolves, so you end up with
...
SubnetIds:
- subnet-7b5b4112,subnet-7b5b4115
In the example above, !Split
generates an array of one value. Only then that value is resolved using the dynamic references feature. Obviously this is not what we wanted.
This behavior is counter-intuitive to what we were used to from Terraform, so we looked for something that could provide both.
Early resolving references
We wanted the simplicity of a dynamic resolve, but we also wanted it to stay up-to-date if one of the references changed. We wanted real dynamic references, but also have the intrinsic functions applied to them. Going with the SAM theme, we wrote a very simple macro that performs an “early resolve” and mutates the template to include referenced parameters.
The way that it works is very similar to dynamic resolve. It will fetch values from SSM, but it modifies the template directly to include the referenced values early in the CloudFormation run. When the macro is specified, it will lookup structures like {{early-resolve:ssm:/some/path}}
and will replace them with the resolved value. It will also do simple parameter replacement, similar to the Fn::Sub
intrinsic function, to allow for environment specific lookups.
Transform:
- AWS::Serverless-2016-10-31
- EarlyResolve
Parameters:
Environment:
Type: String
Resources:
SomeResource:
Type: AWS::Resource::Type
Properties:
PropKey: "{{early-resolve:ssm:/${Environment}/infra/vpc-id}}"
With this, we can safely export parameters from Terraform (or other CloudFormation stacks) for other application stacks to use. We use that to point applications to our RDS proxy, our SNS topics, and other infrastructure shared objects.
Our CloudFormation templates are now melio-specific, but environment-agnostic. They allow us to provision most of our applications (stacks) in any Melio-enabled environment. Of course, we still have a few pre-requisite requirements, like the macro mentioned above and some infrastructure resources.
Summary
Although the solution above seems fairly straightforward and simple, this is why it’s so powerful. It keeps the templates simple, where they require only a single parameter as an input, and it also fits in the view with other CloudFormation features (dynamic resolve, macro).
It decouples the infrastructure layer from the application stack, allowing us to update independently, while still maintaining good integration between them. This frees the DevOps team to focus on infrastructure and tooling, and at the same time gives engineering teams the power to build their applications as they see fit.
Sometimes there’s a missing piece from the puzzle that makes everything work together. In this case this macro was what we needed. I hope this post inspires you to look for this puzzle piece in the next project you’re working on.
As part of this post, we also released the early resolve macro source for anyone to use.