Box CMF: DevOps DORA, Infrastructure as Code
Co-Authors: Xaviea Bell, Matt Bowes, Raul Flores, Jared Newell, Quynh Tillman
This is the last blog in our series on Box Infrastructure as Code (IaC). If you have not yet read our first two blogs in this series, Box CMF: Infrastructure as Code, then What?! and Box CMF: Shift Left Testing, Infrastructure as Code, then I’d encourage you to do so. Together all three of these blogs provide a detailed perspective on how Box thinks about Infrastructure as Code.
In order to gain deep insights into how our overall IaC model is performing, we need to measure various aspects of our processes and technology stack that drives delivery of Cloud Infrastructure at Box. This will enable the ability to focus on making actionable changes to improve the velocity of managing infrastructure and the operational efficiency of all aspects of our IaC approach. In this blog, we will dive into how we leveraged the concept of DevOps DORA metrics to define and instrument various components of Box IaC.
DevOps DORA applied to IaC
As part of our overall technical strategy discussions, we wanted to take a data driven approach to how we measured how well we were delivering on our framework for developing IaC at Box. One of the key methodologies we leveraged was DevOps DORA metrics. As stated in the most recent Google State of DevOps 2021, DORA metrics allow you to measure both the velocity and stability of how you are delivering software.
Our objective of applying these metrics to our IaC framework is to improve processes, quality, and technology decisions related to how we build, test, and deploy Infrastructure at Box. The approach we took was to define an initial set of use cases, specific metric types, and a set of high level dashboards (for each use case). This allowed us to be very specific about exactly where we needed to instrument in the IaC framework to collect the right data to drive our metrics reporting.
Metrics and Stakeholders
At Box, we categorized our metrics into three key stakeholders. Although DORA metrics are the primary focus, we also added some additional metrics to track other aspects of our IaC framework.
- Management Facing Metrics (DORA focused)
– “Lead Time For Changes”
– “Deployment Frequency”
– “Change Failure Rate”
– “Time to Restore Service”
- Service-Owner Facing Metrics
– Time to push changes through the pipeline
– Number of Test Case errors Per Push
- Platform-Owner Facing Metrics
– Plan Run Times
– Apply Run Times
– Deployment Times
– Total Run Times
– Memory Usage
– CPU Usage
More detailed representation of these high level metrics will be defined throughout this blog.
The primary use cases we decided to focus on are: General IaC Pipeline metrics, IaC Validation Metrics, and IaC Drift Detection Metrics.
IaC Pipeline Metrics
The IAC Pipeline is the heart of how we deploy all Cloud Infrastructure at Box. Therefore, it is critical that we continuous measure how the various IaC Pipelines are performing. As described in our first blog in this series, Terraform, Terragrunt, Atlantis, and Pylantis (a custom Box service) are the primary technologies that drives all of our IaC Pipelines. Our goal was to focus on both velocity and stability DORA metrics. However, one of the challenges with DORA stability metrics is linking failed deployments with actual IaC changes. This creates a bit more complexity in figuring out how to instrument the various IAC delivery systems to correlate these metrics. In order to show early value of this effort, we decided to limit the immediate focus to velocity DORA metrics.
The velocity metrics cover the following areas: Terraform/Terragrunt Plan time, Terraform/Terragrunt Apply time, and GitHub enterprise pull request (PR) and merge times. Together, these metrics provide a good measure of the lead time for changes and deployment frequency for IaC Pipelines. We measured these results per IaC Pipeline as well as an overall IaC average.
In addition to the above metrics, we will also incorporate IaC Validations into the overall velocity metrics. The time it takes to complete validation tests is a key metric to ensure we are not significantly impeding the ability to release IaC changes in a timely manner. In the sections below, we’ll go into more detail on the specifics of how we instrumented our Pipelines as well as some example Dashboards with these metrics.
IaC Validation Metrics
In our previous blog, Box CMF: Shift Left Testing, Infrastructure as Code, we described the importance of focusing on validation of IaC. We created static, component, and end-to-end validation frameworks to ensure we are deploying high quality and compliant infrastructure into our Cloud environments. The static (except OPA), component, and end-to-end validation frameworks all run in Jenkins jobs. This allowed us to focus on measuring the total time it takes to run Jenkins as way to collect metrics that would enable us to determine the impact of these validation tests on the velocity of our IaC Pipelines.
As of this blog, we have not yet started measuring the impact of OPA based static analysis tests. These are executed in Conftest, so we will need to extract the metrics from that part of our IaC validation framework to include them in our measurements for velocity. The end-to-end validation framework does not impact velocity metrics as it is executed post deployment. However, we will still collect these execution time metrics to understand how the framework is performing and whether there are any scaling or functional issues.
IaC Drift Detection Metrics
IaC Drift is encountered when there are changes to the cloud infrastructure, post deployment, that did not initiate from the corresponding IaC Pipeline. This poses a potential security and compliance risk as our cloud infrastructure is not consistent with the state that is represented in the source of truth IaC Pipeline git repository. Box implemented a special service that monitors and detects when there is drift in our environment. This allows us to send alerts to security and the IaC Pipeline owners to remediate the drift.
Although the metrics we ultimately measure for IaC Drift are not necessarily DORA specific, we need to measure various aspects of drift to ensure we can understand if there are any systemic issues in our environment. This is an area that is still very much under investigation and will be included in future updates to our metrics collection process.
IaC Metrics Implementation and Dashboards
IaC Pipeline Dashboards
The following image depicts the various points in our IaC Framework where we focused on instrumentation:
The following IaC flow durations are currently being measured:
- Time-to-plan: time from GitHub Enterprise Pull Request (PR) Opened to first Atlantis Terragrunt Plan done
- Time-to-approval: time from first Atlantis Terragrunt Plan to PR Approved. This reflects all pre-PR-approval planning attempts and time to get PR approval
- Time-to-first-deploy: time from PR Approved to first Atlantis Apply done. This reflects time for the PR requester to decide to start an Atlantis Apply (deployment)
- Time-to-last-deploy: time from first Atlantis Apply to last Atlantis Apply. This reflects time taken for the entire deployment in the PR
- Time-to-merge: time from last Atlantis Apply to PR Merged. This reflects time for the PR requester to decide to merge the PR after the deployment is done
- Overall time: time from start of Time-to-plan to end of Time-to-merge
IaC Validation Dashboards
The following diagram illustrates the high level process we use for exporting metrics produced by the static, component, and end-to-end test cases. The Docker image that runs the actual tests will forward defined metrics to our Dashboard Tool (via tool APIs) for general consumption by key stakeholders.
The following IaC Validation metrics are currently being considered. This list is not final and will evolve as we continue to define and create new IaC test cases:
- Percentage of Test Cases completed by category, by team: this will serve as a general metric for determining the percentage of test cases implemented, based on an overall goal. Unlike software, infrastructure does not have code coverage (per se). However, we can pre-determine the types of tests we need to implement and track the percentage of test actually completed over time.
- Time to complete Test Cases by category (Static, Component, E2E): this will provide us an indication of how our test frameworks are scaling as we add new IaC pipelines and associated test cases
- Time to complete all pre-deployment Static and Component Test Cases: this metric provides and aggregate our the impact of our overall test Framework on deployment velocity. It’s important to note that we have removed E2E test time from this metric (as those are post deployment).
IaC Drift Detection Dashboards
The following IaC Drift Detection metrics are currently being considered. This list is not final and will evolve as we continue to evaluate the stability of our infrastructure (post deployment):
- Number of IAC drifts, by pipeline, detected over time (i.e. running infrastructure delta’s from GHE): this metric is important as it will provide a sense of how stable our infrastructure is without having unexpected post deployment modifications (that bypass our IaC pipelines).
- Ratio of Drifted resources over Total resources: this metric is similar to the drift detected over time, but provides a roll-up of all IaC pipeline drift.
- Time to complete drift detection scanning: this will provide us an indication of how our test framework is scaling as we add new IaC pipelines and associated test cases
- Percentage of MIGs not running the latest images: this will provide a view of the percentage of our compute fleet that is out of compliance (with regards to maintaining updated CVE updates).
Focus on Actionable Results
It is important to emphasize that the collection of these metrics is not to create a competition or penalize teams for not having similar metrics as other teams. The overall goal is to find opportunities to improve the processes, quality, and technology decisions of our IaC pipelines. Otherwise, we will create an environment where “gaming” the metrics might become the norm and thereby defeat the entire purpose of collecting them. Therefore, caution and continuous positive reinforcement must be taken to ensure the engineering teams and managers do not view the metrics as measure of success or failure by any specific team.
This concludes our blog series on the Box Infrastructure as Code. We hope that you were able to learn some useful information that will help you and your teams define or refine your approach to Infrastructure as Code.
Interested in learning more about Box? We are hiring. Checkout our careers page!