Scaling Infrastructure Workflows
At Flatiron Health, we unite clinicians, designers, product managers, data scientists and engineers in building novel technology that powers oncology practices and helps curate real-world datasets to accelerate cancer research.
In order to outsmart cancer, we need our platforms and cloud infrastructure to meet our business and technology demands.
As we grow and innovate, we continue to face an important challenge: designing infrastructure workflows that scale with our business and technology growth. If we can’t respond to our ever-increasing infrastructure needs, our technology will lag behind.
This is the story about how we’re overcoming this challenge and learning from our past mistakes.
Like many engineering organizations, we heavily used Continuous Integration and Continuous Delivery (CI/CD), powered by our Jenkins automation servers. We’ve been reaping the many benefits of CI/CD, from fault isolation to repeatability, in order to collaborate effectively on the technology we were building.
While we used Terraform for managing infrastructure, that was only half of the equation. We needed to orchestrate our Terraform workflows in a way that leverages the collaborative benefits of CI/CD to support our infrastructure SDLCs.
Our First Solution
We were still using Jenkins for our CI/CD needs, and we required CI/CD for Terraform. Simply enough, we combined them by running Terraform on Jenkins.
As we had expected, this solution worked quite well and fit alongside our other CI/CD workflows, helping engineers hit the ground running.
Engineers also needed to configure certain settings for Terraform — which will require some background knowledge to understand.
Now, a brief lesson on Terraform.
A key component of Terraform is its state, which maps code to real-world resources in order for Terraform to accurately apply infrastructure changes. Sometimes it may contain sensitive data, like database passwords. State is absolutely critical to Terraform’s operation, and by default is stored locally on disk unencrypted. Terraform backends are often used to remotely store and sometimes encrypt state, enabling teams to work together on infrastructure without having to manually share sensitive state information with each other. Certain backends support state locking — which permits only one terraform apply at a time — to prevent concurrent operations from conflicting with each other and potentially corrupting state or having other undefined behavior. Terraform supports many different backend types with varying features.
They could easily configure an S3 backend in their Terraform settings block as such:
After this setup, they were all set. Engineers onboarded to this system by configuring their Jenkins pipeline and Terraform settings, allowing them to execute Terraform on Jenkins and leverage the benefits of Terraform’s S3 backend.
Engineers could safely and effectively collaborate on infrastructure, or so we thought.
Terraform on Jenkins technically worked, but a lot of engineers were pretty unhappy with it, for a few reasons:
All Jenkins pipelines had to use the same Terraform version installed on the Jenkins worker. Engineers had to install the exact same Terraform version for local Terraform inits and plans, or would hit version mismatch errors. The entire company had to align on a single Terraform version and coordinate version updates, which hampered engineers from leveraging functionality of newer versions.
Each Jenkins pipeline required tedious setup. Between writing Jenkinsfiles (which we somewhat mitigated with cookiecutter templates) and configuring the pipelines themselves, engineers spent more time toiling and less time fighting cancer.
We had little to no insight into our Terraform infrastructure costs. Infrastructure as code was nice, until we saw our cloud provider bill each month. We had no easy way to aggregate and analyze costs from each Terraform configuration, so we were flying blind.
All Terraform changes required detailed code review by our internal security team. Our infrastructure touches Protected Health Information (PHI) and therefore must conform to Flatiron’s cloud infrastructure security standards: IAM permissions for accessing specific PHI must be explicitly approved, EC2 instances must use a Flatiron security-hardened AMI, security group ingress CIDR blocks must be restricted to its VPC only (10.0.0.0/8 or stricter), databases and object stores must enable encryption at rest, etc. Unfortunately there was no easy way to do this all programmatically, leaving engineers to wait for security code reviews on every Terraform code change.
Terraform on Jenkins was a functional system, but had considerable friction for engineers. Something needed to change.
Build vs. Buy
We started our usual Build vs. Buy deliberation to evaluate how we’d solve these issues with in-house or external solutions.
As any engineer would, we first brainstormed solutions to Build our way out.
Some of the problems were easy to fix, like our Terraform versioning issue — engineers could specify their desired version through a Jenkins build parameter, and the Jenkins worker would install that Terraform version at runtime, or use a pre-built Docker image with the correct Terraform version.
Some were more difficult, like our internal security code reviews — we could create some automatic rule-based review system to parse Terraform code and evaluate if it requires security review, based on our security standards. However it would be yet another complex system to maintain, and at best will only decrease, but not eliminate, these code reviews.
Almost every problem with our system came with a large cost in person-hours, and unfortunately, nearly every proposed in-house solution only added to that cost through development and maintenance.
If we chose to Build, we would allocate months of engineering time towards inadequate solutions, and take on the huge opportunity cost of not having a good solution months earlier. This was unreasonable, since we have more important things to do, like fighting cancer.
So we decided to Buy.
Our Better Solution
We acquired a license for self-hosting the application, and deployed it to serve as our central service for Terraform execution.
Terraform Enterprise fit our existing CI/CD patterns, and also opened up our infrastructure workflow options with support for direct Terraform execution.
With Terraform Enterprise, each project requires setup of a Workspace to apply infrastructure changes.
Instead of the S3 backend before, engineers can use the remote backend to store encrypted state and enforce state locking in Terraform Enterprise. In addition, it offers the option to locally execute Terraform commands remotely in Terraform Enterprise. Just like any other backend, it’s configured in the Terraform settings block:
This was a decent improvement, but more importantly, Terraform Enterprise solved our critical issues with Terraform on Jenkins, in a few ways.
Simple Terraform version management. Terraform Enterprise executes each Terraform run in a Docker container, so each Workspace can independently choose its Terraform version, and we no longer need to coordinate on a company-wide version.
Minimal setup for each Workspace. Configuration for a Terraform Enterprise Workspace is much less verbose than a Jenkins pipeline, and can live adjacent to the source Terraform configuration.
Cost Estimation for each plan and apply. Now we can better understand our Terraform infrastructure costs.
Sentinel (policy-as-code) for automated security checkpoints. Sentinel parses Terraform plans, state and configuration, enabling us to replace internal security code reviews with automated security checkpoints for each plan and apply.
Extra layers of security through SAML Single Sign-On and granular access controls. Nice-to-haves for aligning our identity management through our identity provider (Okta), and controlling permissions for each Flatiron team’s Terraform configurations.
Terraform Enterprise offers us a nice interface with great features, but more importantly it helped us overcome our infrastructure workflow challenges, and eliminated some of our major sources of engineering friction.
Most of all, it empowered us to quickly deliver MVPs to our early alpha and beta users — who’ve been yearning for a better Terraform experience — and release a generally available service to Flatiron shortly afterwards.
As Terraform Enterprise recently became Flatiron’s preferred engine for Terraform execution, we’re currently leading a company-wide effort to migrate Terraform workflows off Jenkins onto it.
We’ve also built internal tooling and documentation to assist with onboarding engineers to Terraform Enterprise, helping make our infrastructure SDLCs even smoother.
For our situation, the Build vs. Buy answer was clear — Terraform Enterprise vastly improved our infrastructure workflows by solving our critical issues and enabling us to meet our growing business and technology demands.