Testing Infrastructure as Code

Published in

The Mindbody Dev Report

4 min readOct 4, 2019

“How do I test my Infrastructure that was built using code?” That was the question we had when we started building our kubernetes platform.

Our goal was “Continuously deliver Infrastructure (kubernetes, istio, DNS, Prometheus) (almost every day) with a high level of confidence.”

To achieve the above goal, we had to write automated tests to ensure that the expected state is what we wanted.

Our platform is built on Cloud Native technologies, which are kubernetes (k8s), Prometheus, and other similar projects.

Because we were building code for k8s and k8s, Prometheus has first-class support for `go` we are using `go` for testing our Infrastructure. We use Pulumi is a platform for building cloud infrastructure using modern programming languages, for deploying our Infrastructure, and use typescript for building the Infrastructure with pulumi.

Delivery of IaC and Testing

The steps involved in our delivery pipeline are

Pulumi preview — The Pulumi Preview option provides us insights into what are changes across deployments(desired state against the existing state), that gives confidence on the changes that we are deploying.
Approval — The approvers within the team look at the pulumi preview output, then approve the changes. This is a manual process. This is to make sure that there aren’t any unknown changes to the environment.
Pulumi up — Based on the manual approval step, the pulumi update runs against the environment.

This pulumi up creates or updates resources in a stack. The new desired goal state for the target stack is computed by running the current Pulumi program and observing all resource allocations to produce a resource graph. This goal state is then compared against the existing state to determine what create, read, update, and/or delete operations must take place to achieve the desired goal state, in the most minimally disruptive way. This command records a full transactional snapshot of the stack’s new state afterward so that the stack may be updated incrementally again later on.

Run integration tests — Our tests are Integration tests that would execute after we have deployed the Infrastructure. These integration tests would validate the state of the Infrastructure for the expected state. For example, Istio had a CVE that had to have all the envoy proxies run across in all namespaces needed to be updated to the correct version. Without repeatable automated tests, it would have hard to ensure compliance. Another example is to make sure that we have High Availability on Prometheus and Prometheus Alert manager because we depend on these to our monitoring.
Slack notifications — Slack notification sent out when things succeed or fail.

We are delivering code to multiple environments, alpha (integration environment for platform engineering team.If there were a deployment or test failure we would likely catch it in alpha first where it doesn’t affect our customers), development, staging, and production. Like most of the projects, there are differences in features and versions we were delivering across these environments. So we had to make sure that tests were independent for each of these environments.

Code layout structure

Here is an example of the ree structure of our tests code folder. We built the test with go modules.

tree

Go Modules

The common package within tests has shared codebase across for every environment. The alpha, dev, stage are environment-specific tests. The common folder has local go module for being referenced by the environment-specific tests.

Here is an example of the common package go.mod

common go.mod

Also, we use the above module within the main go.mod file using https://github.com/golang/go/wiki/Modules#can-i-work-entirely-outside-of-vcs-on-my-local-filesystem

Here is the example of the main go.mod, which is in the root tests folder.

main go.mod

The magic is really the replace mindbody.com/aws-arcus/kitchensink/common => ./tests/common . The replace command makes the go mod tool look for resources within the local folder instead of the actual URL.

Tests

We try and write tests for most of the features that we would end up manually validating part of the PullRequest.

Here are some of the tests.

k8s

init code

In the above code, the init function loads the kube config from the path and the CheckNodePool function looks for node pool based on the label and count.

So here are the tests for our Alpha environment which validates the NodePool count by using the common package code.

test code for alpha

Prometheus

Prometheus has its challenges in testing because of the API does not support authentication/authorization though there is a go API. The only option was to port-forward it dynamically, so we used https://github.com/justinbarrick/go-k8s-portforward package for doing that.

prometheus tests

istio

Istio had its challenges. We deployed the patch for Istio CVE https://istio.io/blog/2019/announcing-1.1.2/, and we had to make sure that the envoy proxy was updated in every namespace that had istio not just istio-system namespace.

What we couldn’t Test

There are times that we couldn’t validate with tests because there wasn’t an API to validateverifyne of the things that we couldn’t test was, we had to make changes to kubelet by setting — protect-kernel-defaults=true for kubebench. However, there wasn’t an easy way to test these changes using direct integration tests because there wasn’t an API for testing. We decided to rely on kubebench results as validation for this fix.

What did we learn

We learned that testing some of these systems wasn’t out-of-box solutions. It also helped us understand these systems better and gave us higher confidence in deploying changes often.

The next steps in the process are to address the unit testing and anything that would potentially give us the confidence to remove manual gates.