Box CMF: Infrastructure as Code, Then What?!?

Garth Booth
Box Tech Blog
Published in
13 min readAug 8, 2022

Co-Authors: Xaviea Bell, Matt Bowes, Raul Flores, Jared Newell, Quynh Tillman

Illustrated by Madeline Horwath/ Art Directed by Erin Ruvalcaba Grogan

Welcome back to our blog series on Box Cloud Management Framework. In our previous blogs, we have introduced our approach to delivering a Box Cloud Management Platform, conducting Buy vs Build analysis for Box CMP, and our Multi-Cloud Identity and Access Methodology: Part 1, Part 2, and Part 3.

In this blog series, we will focus in on one of the most critical aspects of this Framework, Infrastructure as Code (IaC). This initial blog will describe the foundational elements of building IaC pipelines to support lifecycle management of cloud resources. Specifically, we will explore the three key areas of the Box IaC pipeline Framework:

  • IaC Development Standards using Terraform and Terragrunt
  • IaC pipeline technology stack
  • IaC pipeline Development and Governance

One of the key questions you should ask your teams when building an IaC pipeline Framework is what comes next after you have established a base model for delivering IaC. The next two blogs in this series will help answer this question and describe the need to establish a focus on Shift Left Testing and DevOps Reporting in your IaC model.

Establishing IaC Development Standards, The Early Days

Infrastructure as code is a well-known practice that brings Software Development Lifecycle (SDLC) practices to managing and provisioning of infrastructure. Just as with SDLC, it is critical to establish standards for how to develop IaC. When we embarked on this effort to create an IaC Pipeline and Development Framework, we decided to define clear standards around how Box will design and develop IaC automation. One of the first decisions we made was to use Terraform and Terragrunt as the foundation for this standard. Most Infrastructure developers are well aware of Terraform, but may not be familiar with another open source tool called Terragrunt. This tool was developed and released back in 2016, by gruntwork.io, to help solve some key problems inherent in Terraform: lack of locking of state and configuration of state as code. Both those problems have long since been solved, but other problems emerged in terms of how to keep your code DRY (Don’t Repeat Yourself) and maintainable. We will not go into a deep dive on Terragrunt, but will provide some details on how we used it to significantly improve our IaC code structure.

Before going into how we developed our IaC standards, it’s important to share some of the ways Box was using terraform, prior establishing our standards. Early on in our terraform journey (circa 2015), Box developers used terraform to automate a number of cloud infrastructure operations. One of the primary problems we discovered was around the practices of structuring and executing the terraform code. Although this code was maintained in GitHub Enterprise (GHE) with pull request reviews to release code, that’s really where any structure or standards stopped. There was some basic GHE code hierarchy for some repositories, but not consistently across all terraform code. In addition, there were no IaC Pipelines to support secure execution of the terraform code, so developers typically executed the operations from their laptop. This resulted in inconsistent practices around securing and managing the service accounts required to execute the terraform code.

Another problem area was the way the actual terraform code was being developed. One example that illustrates this problem was our initial focus to build the IaC to support management of GCP projects. We developed our initial terraform to provision GCP projects and provide basic access to project owners. Each time a new project was needed, it required the copying over of tfvars and terraform code to a new project based folder to build out the environment. In addition, if we wanted to add another resource to every project, we had previously created, we had to add terraform resource code to all of those projects. Further, if we wanted to ensure ordering of resource creation, we had to use the terraform ‘depends_on’ meta argument. While this is construct is not necessarily bad, using it excessively, which would be required as we scaled our IaC code base, would lead to confusion on resource dependency mappings. Collectively, these practices can lead to large terraform modules and potentially long deployment times.

In the following section, we’ll discuss how we systematically addressed these and other problems to help establish better practices around how we structure, develop, and execute IaC at Box.

Box IaC Standards and Best Practices

One of the early standards we defined was to use Terragrunt to help solve the code replication issue and provide clear separation of the terraform input values from the terraform modules themselves. This allowed us to adhere to DRY principals and easily extend resources across multiple projects with minimal inputs. We also leveraged Terragrunt to manage dependencies using the dependency block construct. This made it very intuitive and simple to express module ordering and execution sequences.

The ability to cleanly separate terraform module input and output variables, enabled us to update one input file or change one terraform module and make changes to multiple resources. As an example, in our current structure, GCP projects are the main object. Every resource and service (outside of a few examples such as billing) in GCP needs to be based in a GCP project. The inputs to create projects are all defined in a standard way and the output of those project creations result in project ids. These project ids can then be easily passed down as an output variable to create custom roles, IAM bindings, service accounts and Google API enablement. The following example Terragrunt directory hierarchy illustrates these key points:

├── .terraform-version

├── box-development

│ ├── box-development-tenant-terragrunt

│ │ ├── box-dev-project-1

│ │ │ ├── project

│ │ │ │ └── terragrunt.hcl

│ │ │ ├── project.hcl

│ │ │ ├── rolesets

│ │ │ │ └── terraform

│ │ │ │ └── terragrunt.hcl

│ │ │ ├── service_account

│ │ │ │ └── orchestrator_gsa

│ │ │ │ └── terragrunt.hcl

│ │ │ └── vault_orchestrator

│ │ │ └── terragrunt.hcl

│ │ └── box-dev-project-2

│ │ ├── project

│ │ │ └── terragrunt.hcl

│ │ ├── project.hcl

│ │ ├── rolesets

│ │ │ └── terraform

│ │ │ └── terragrunt.hcl

│ │ ├── service_account

│ │ │ ├── orchestrator_gsa

│ │ │ │ └── terragrunt.hcl

│ │ │ └── wavefront-integration

│ │ │ └── terragrunt.hcl

│ │ └── vault_orchestrator

│ │ └── terragrunt.hcl

│ ├── common_vars.json

│ ├── dev_vault.hcl

│ ├── environment.hcl

│ └── terragrunt.hcl

├── box-production

├── box-staging

└── organization.hcl

The tree structure above shows a top level input file (i.e. organization.hcl) that is applicable to all environments (i.e. box-production, box-staging, and box-development). In addition, there are several common files (i.e. common_vars.json, dev_vault.hcl, environment.hcl, and Terragrunt.hcl color coded in orange in the above diagram) that apply to all GCP projects that are provisioned in the box-development environment. This model provides a very simple and easy way to apply common changes across all our environments and/or a specific environment.

Another important standard we focused on was defining a secure execution environment that all IaC would be required to execute in. This meant that developers would no longer be able to use their laptop to deploy critical infrastructure to our cloud environments. This would also have the added benefit of adding more controls to secure and manage service accounts required to support execution of our terraform code. We will go into much more details of these capabilities in the IaC Pipeline Technology Stack section below.

Defining and documenting the best practices for developing IaC code was another key focus of the standardization process. We defined the following best practices:

  • Develop custom Terraform modules for reusability and locate in a central IaC GHE Organization. Not all custom terraform modules will be widely reusable, but it is still a best practice to develop with reusability in mind. In addition, locating those modules in a centralized GHE organization makes them easy to discover and leverage as new IaC developers build new infrastructure deployments
  • Maintaining proper version control and version numbering for custom terraform modules. External terraform resources can change without a users knowledge so its best to test against that version and hard code the version into a configuration or Terragrunt code. We review the released versions quarterly and update for added security or functionality purposes. Always include specific version numbers for terraform modules (both provider and custom) in all module references. This will prevent automatic module updates, which will often break infrastructure deployments!!!
  • Define Naming Standards for the following resources: Folders, Projects, General Service Accounts, Terraform specific Service Accounts, and Terraform State Buckets. For example, in terms of Folder naming conventions, Terragrunt can be used to separate the different resources that can be created based off location in a folder hierarchy. The folder structure allows you to take advantage of resource hierarchy in terms of properties (i.e. a service account cannot exist without a project). For this reason its best to come up with documented naming standards for resources. This will enable users to have a general idea of what a resource is used for or where its located. This will also enable a flexible and robust folder hierarchy definitions for deploying Terragrunt code. With this naming scheme and folder hierarchy shared variables can be set at the highest point for use in code in more specific resources, conforming to the DRY principle.
  • Define a standard GHE IaC Hierarchy. Terragrunt can easily remove tedious configuration of Terraform backend and provider configurations. With a proper GHE IaC hierarchy, we are able to create separate configs for different cloud providers, different steps in the software development lifecycle, and different projects/accounts. All of these are important to build a proper IaC service to scale to hundreds or thousands of different cloud resources.

Core Principles of Box IaC

Defining our IaC Development Standards and best practices first provided a set of guidelines that allowed us to set clear requirements for all of our IaC pipelines. In addition, we defined a set of core principles that will ensure we maintain our development standards and also drive future enhancements to our IaC pipelines

  • Use Terraform/Terragrunt as our primary IaC development languages.
    – All IaC is automatically executed using Terraform/Terragrunt “plan” and “apply”
    – All IaC will run in a Secure Execution Environment developed for executing terraform and Terragrunt plan and apply
  • Centralized Governance and automated policies (see our next blog, Box CMF: Shift Left Testing, Infrastructure as Code for details on specific policies)
    – IaC Development Standards are enforced as part of GHE pull request reviews
    – IaC policies to enforce key parts of the standards (see the next blog on “Shift Left Testing” for details)
    – Auditability to ensure proper governance of pipeline events.
  • Follow the Principle of Least Privilege access
    – Restrict actions to only those permitted by a provided Service Account (No “Super Admins”, even for the pipeline)
    – Self-hosted keeping credentials onsite and in secure secret-management infrastructure
  • Ensure Ease-of-Use
    – Provide a familiar interface for developers to execute and see output from Terraform/Terragrunt plan/apply.
    – Historical data for deployments, tests, etc
    – Hooks for Box’s custom business logic (customizability)
  • Start with Shift Left Testing
    – Native support for best-in-class test frameworks
    – Ability to execute developer-defined tests for terraform modules, end-to-end tests and more

Defining the IaC Pipeline Technology Stack

Atlantis

The heart of our IAC pipeline tech stack centers around Atlantis. Atlantis meets many of our IaC standards requirements out of the box, providing a familiar interface for developers (Github Pull Requests) and a secure execution environment for developer’s deploying Terraform/Terragrunt code using plan and apply commands. Atlantis also has native support for Open Policy Agent and ConfTest which are key technologies in our focus on “Shift Left Testing”.

Pylantis

While Atlantis had a lot of functionality straight out of the box, we still had a number of internal systems which we needed custom support for various capabilities. Pylantis is a Box custom-written library that provides the bridge between open-source Atlantis and Box’s custom internal systems. Specifically, we needed to support a token-minting process that met Box’s needs, and we needed integrations into internal API’s like our Global Technical Operations Centers (GTOC) Change-Freeze API and our DevOps Reporting Frameworks.

Box operates on the Principle of Least Privilege, only giving required permissions out to service accounts individually, and this policy extends into the IAC Pipeline. This meant that we would need individual service accounts for all Terraform Repositories and cloud projects, and each account would only be permitted to a subset of available actions. While there is significant overhead in management of these accounts, there is significant security benefits as well. One pipeline-specific benefit to this approach, is that that pipeline itself does not need to worry about actions being authorized per repo, but instead rely on the underlying permissions of the given service account.

Policy Enforcement

We will cover this topic more in the next blog in the series “Shift Left Testing”, but it’s worth mentioning briefly here.

Using software testing practices as a guide, we knew roughly what testing capabilities our IAC pipeline would need to support. During development and early deployments, Atlantis didn’t support any testing frameworks (although it now supports ConfTest), so we had to spin up our own testing capabilities in custom Atlantis workflows. Using Atlantis’ custom workflows we developed test capabilities for:

  • Static Analysis: Lint and policy enforcement
  • Component Tests for custom Terraform Modules
  • End to End tests

We’ll discuss each of these in more detail in the IaC “Shift Left Testing” blog.

Distributed Terraform Repositories (dedicated GHE IAC and IAC-Modules Organizations)

Previous deployment pipelines at Box taught us about the scalability issues of mono-repo deployment designs. We knew early on that we wanted every service to have their own individual Terraform Repository. These individual repositories enable granular access control and reduce the number of merge conflicts and queues as engineers manage infrastructure through our deployment pipelines. Additionally, the blast radius is reduced to just the individual Terraform repository-defined infrastructure, and ownership is built-in by design. Patterns are important to engineers, and individual repositories allow different teams to manage their code in their own preferred way. We do enforce certain global standards that the pipeline expects across all repositories, but some team organization preference is permitted.

Distributed Terraform Repositories also introduced some challenges for us along the way. Our source control management tool Github Enterprise supports webhooks and “Checks” that are configured per-repo. Blocking checks are enabled on individual branches within a repo (like “Main”). If you want to enforce policy across dozens or hundreds of IAC repos, those Github Enterprise “checks” must be configured on all IAC Repositories individually. At first we solved this with repo-creation automation, but as time went on and new checks were introduced we needed a better way of updating existing IAC Repositories to all be identical for governance. It is important to note that this challenge is not fully solved and ongoing work continues to optimize policy enforcement across our IaC repositories.

Service Catalog (Box’s internal Service Catalog)

Service Catalog is a custom-built Service Catalog that we use at Box to define every Service and Owner, and every Service must have an Owner defined. A Service can’t be deployed at Box unless it has a well-defined entry in ServiceCat. ServiceCat is central and critical to Operations including Infrastructure and Service Deployments. It is the source of truth not only for the Service and Owners, but all of the metadata about the service itself, like:

  • Globally Unique “Service Integration Name” or Identifier
  • Owner
    – JIRA
    – Project
    – Email
    – Manager/Director (Integrated with Workday)
    – Slack Channel
    – Slack Alerts Channel
    – LDAP Group for Permissions
    – PagerDuty Escalation Policy (Oncall Information)
  • Service Criticality
  • Service Class
  • Data Classification
  • Lifecycle of the Service
  • Service Dependencies
  • Related Links:
    – Documentation (Runbooks, Architecture, etc)
    – Deployment Pipeline Job (Typically Jenkins)
    – Metrics Dashboard
    – Logs Dashboard
    – GHE Repo (Code for the Service Itself)
    – GHE Repo (IAC Terraform Repo)

Within the context of our IAC Pipeline, every Service in ServiceCat that is deployed to our cloud environments has a link to the GHE Repo which contains the Terraform Code responsible for deploying infrastructure for that service. In some cases, multiple Services point to the same IAC Terraform Repo if the services get deployed to the same project.

Cloud Compute Lifecycle Management and IAC Drift Detection

There will be an upcoming blog entirely on the tool we use for Cloud Compute Lifecycle Management and Terraform Drift Detection, but it’s worth mentioning here. This tool is custom-built and is used for managing day-to-day activities and information about IAC Repos / Projects. It has capabilities for visualizing Terraform Code, but also for performing Cloud Compute (e.g. Google Compute Engine) deployments. In context to this blog, though, the primary feature is IAC Drift Detection.

While we have policies and procedures in place that expect all Cloud Assets to be managed via IAC code and deployed through our pipeline, exceptions do come up. We needed a way to discover when a project had drifted from its declared state for a few reasons:

  • From a security perspective, if an unexpected change happened in our Production environment, we needed to know about it and alert on it.
  • From an IAC Pipeline Owner perspective, we had an interest in understanding why Service Owners might need to deploy or modify a cloud asset outside of our standard IAC Pipeline. If the pipeline didn’t have a needed feature, we would want to work on adding it. If Service Owners were running into common problems, we needed to know to resolve them. These sorts of things do come up when working with a cloud provider like GCP which has features that Terraform might not support yet, and we needed to keep an eye on them.

The other features of this tool will be discussed in detail in future blogs on Automated Stage Deployments and Shift Left Testing.

Conclusion

Using IaC can provide significant benefits to your cloud operations, but it is critical that you establish some clear standards and frameworks on how to actually structure, develop, and execute IaC. As we shared earlier in this blog, failure to put these standards in place will lead to a number of problems that will make it difficult to deploy and manage cloud infrastructure in a secure and operationally efficient manner. Our journey is not finished and we had many “false starts” in a number of areas, including technology choices, processes, and framework definitions. These were expected and have only helped to improve our overall approach to IaC.

We hope you enjoyed this blog and learned from some of our lessons and improvements for how to develop a baseline IaC Framework. So, what’s next?!? Well, as mentioned earlier, the initial IaC Framework is only the beginning and we look forward to you discovering our efforts around “Shift Left Testing” and “DevOps Reporting”.

Interested in learning more about Box? We are hiring. Checkout our careers page!

--

--