Test-Driven Development for Infrastructure

Rosemary Wang

12 min readMay 2, 2019

I’ve been answering the same question a lot lately, more specifically:

How do I do test-driven development (TDD) for infrastructure? It’s impossible!

My answer is usually:

Great question! It’s not perfect but there are some TDD techniques we can adapt for infrastructure. Here’s how.

I finally decided that my rather lengthy response and explanation should go in one place. Here, I’ll cover:

What is test-driven development (TDD)?
How it can help with software development?
How it can help with infrastructure?
What’s an example for infrastructure?

In the example, I’ll walkthrough the TDD of a complicated AWS S3 bucket.

What is Test-Driven Development (TDD)?

Test-driven development, or TDD, is an approach to software development that involves writing the tests for some functionality first and then implementing the functionality to pass the test. The expectation is that the first test will fail at the start and we build functionality to get it to pass. We continue running the tests and fixing our implementation until the test passes.

Let’s walk through an example. I want to implement a calculator that returns the absolute value of the sum of two numbers. The functionality is outlined as follows:

My calculator should return positive absolute value for two positives.
My calculator should return positive absolute value for two negatives.
My calculator should return positive absolute value for one negative and one positive.

In TDD, I write my first test as follows:

func TestCalculatorShouldReturn5AsSumOfTwoPositives(t *testing.T) {
    assert.Equal(CalculateSum(3, 2), 5)
}

My function is pretty minimal. It pretty much just returns 0. I expect my test to fail.

func CalculateSum(x int, y int) int {
    return 0
}

I’ll add the functionality to calculate_sum, as follows.

func CalculateSum(x int, y int) int {
    return x + y
}

My tests should pass! Next, I’ll write a test for the second function (absolute value for two negatives).

func TestCalculatorShouldReturn5AsSumOfTwoNegatives(t *testing.T) {
    assert.Equal(CalculateSum(-3, -2), 5)
}

When I run my tests, it will fail for function 2. Let me implement it in my calculator.

import "math"func CalculateSum(x int, y int) int {
    return math.Abs(x + y)
}

It will pass! When I write my test for function 3 (one negative, one positive number), I see that all three tests pass. This is because my logic already covered the third function.

How can TDD help with software development?

There are many arguments to TDD and not to TDD. When I first started to do TDD, I had a lot of complaints:

My time is better spent building functionality over writing a bunch of tests.
It feels counterintuitive to write the tests firsts.
Testing is hard. Every time I tried writing tests, I always end up re-writing a bunch of my functional code to make it testable.

When I first started writing code with tests, I was eternally frustrated because I would write my functional code and spend two days trying to mock out everything. It got to a point that I really disliked testing. I started to learn TDD and pretty quickly, I noticed my code got much cleaner and my interfaces clearer. Now, I like the way I develop with TDD. That’s because:

It makes me think about what functionality I am supposed to implement.
Testing is easier because I make it testable from the beginning.
My code is cleaner because I know what functions I need to minimally address.
I take less time because I don’t have to remind myself what a function was supposed to do.
When I read someone’s code, I can read the tests first to figure out what their functions are supposed to be doing.

Of course, there are pros and cons to doing TDD. For me, using TDD is a personal choice. The next challenge was figuring out how to apply it to infrastructure.

How can TDD help with infrastructure?

The core principles of TDD in software development is to:

Only develop functionality that is needed.
Express the functionality in human expectation rather than code.
Organize the smallest unit of code to implement the functionality or logic.

Based on these concepts, infrastructure is pretty similar:

Only create / configure the infrastructure resources that are needed.
Express configuration declaratively, using tests as a reference.
Organize the smallest set of infrastructure resources to meet a security, resiliency, or operational requirement.

I think that applying TDD to infrastructure helps with overall testing approaches to infrastructure. Sometimes, engineers tell me they are discouraged with how difficult it is to test infrastructure. They may not have the access to a sandbox cloud environment or infrastructure resources. As a result, changes to infrastructure are blindly pushed to production. The lack of “testability” for infrastructure relates to the testing pyramid. It posits that the higher we go in the testing pyramid, the more expensive in time and resources (and thus, cost) for the type of test.

Testing Pyramid. The type of tests in the top of the pyramid are more costly to run than the tests at the bottom.

In the case of TDD, we are writing as many unit tests as possible to rapidly check our logic without complicated integrations or interactions. In the case of infrastructure, what if we tested the content of the configuration for minimal functionality rather than the integration of multiple components? Writing the tests before implementation forces us to think about the best way to check for a small amount of configuration quickly, without expending additional resources to run a full integration or end-to-end test.

What’s an example for infrastructure?

Let’s TDD a request to create a complicated S3 bucket in AWS. This S3 bucket has the following requirements:

There should be a “MyBucketWriteUser” that can write anything into the bucket.
There should be a “MyBucketReadUser” that can read anything from the bucket.
Anyone with the “MyBucketRole” should have administrative access.
Deny everyone else. Bucket should not be publicly accessible.

Note: This example is hosted in its entirety on Github.

If I were not doing this with TDD, I would look up references to implement these policies, push them, get them created in AWS, and then inspect each of the policies. This runs into a few issues:

I have no clue if my policies are correct.
I have to create and check everything in AWS, running up my bill.
My security team doesn’t have a way of telling if my bucket is secured as we expect, other than examining AWS.

I think that TDD is best used for functionality with lots of logic, like code with if-else statements. In the case of my S3 bucket, there is a lot of logic embedded with the way bucket policies are handled. Not only am I looking for specific users but I’m also looking for very specific access control.

With TDD, I begin by figuring out what policies I want and write the tests to ensure those policies are followed by both AWS IAM and the S3 bucket. This example will be using the following:

Golang (Note: I had to write my own structures to unmarshal the AWS JSON. I couldn’t find a suitable AWS Golang library for my purposes.)
Terraform
AWS
Ruby & awspec

I start with a basic shell for my Terraform module, complete with a main.tf, *.tfvars, outputs.tf, and variables.tf. I also add a folder called policies, which contain my various AWS policies in JSON so I can access them in my tests.

Unit Tests

Recall that unit tests validate configuration and syntax. They’re inexpensive tests, so I’ll begin with them. I create a tests/unit directory, complete with a starter test file called policy_test.go.

> policies
    # bucket.json, eventually
> tests
    > unit
        policy_test.go
main.tf
mybucket.tfvars
outputs.tf
variables.tf

Let’s start by creating the unit tests to check my policy. My intent is to ensure that my bucket policy’s statements are correct. I have three statements to implement: write user, read user, and admin role. Thus, my test checks that all three statements exist.

After I’ve written the test, I’m going to write a “straw man” policy that will fail my test. My bucket.json file is pretty much empty. There is a dummy policy but not much more (see below).

When I run the test, it will fail! So let me edit the bucket policy to reflect the three statement IDs (Sids) I want. I run my test again.

The output of my first test passes!

Next, I’ll start to work on the test checking each policy statement’s permissions. These tests will tell me if the policy statements have the correct access controls, users, and resources.

TestPolicyHasMyBucketWriteUserStatement checks for the correct policy I want, which is that MyBucketWriteUser can put objects into MyBucket. When I run my tests, I notice that they fail! I need to fix my AllowWriteUser policy statement (see below) to get the tests to pass.

I’ll repeat this process with the AllowReadUser and AllowAdminRole, meticulously writing the correct policies so that my tests pass. My final bucket policy will look something like this:

Awesome! I’ve written some unit tests to check my bucket policies.

Contract Tests

Let’s traverse up the pyramid and write some contract tests. Why? Well, my bucket policy requires a MyBucketWriteUser, MyBucketReadUser, and MyBucketRole. These are not created as part of the bucket policy but are created by AWS IAM requests. The point of contract tests is to check that the output of one service matches the expected input to my service. I’m going to write some quick tests to make sure that the usernames of the principals in my AWS IAM specification match the ones I put in my bucket.

A contract test will help me evaluate if my bucket policy matches the IAM declarations.

In this case, assume that I’ve already created the Terraform files to create my IAM read user, write user, and admin role. As a result, I’m going to write a contract test that:

Triggers terraform plan for the my IAM users and roles.
Checks the plan for the naming and permissions.
Matches the naming and permissions to those I’ve added to my bucket policy.

There are a few flaws in this plan, namely that terraform plan requires my AWS credentials in order for the provider to trigger. However, it’s a small price to pay to make sure that the output of my IAM users and roles matches the input to the bucket’s policy.

When I ran these tests, the last test TestIAMHasExpectedAdminRoleAndPolicy failed! The output revealed that I named my role incorrectly in my bucket policy.

$ go test ./tests/integration/...+ aws_iam_role_policy_attachment.bucket_admin_role
                                      id:                    <computed>
                                      policy_arn:            "${aws_iam_policy.bucket_admin_role.arn}"
                                      role:                  "MyBucketAdminRole"
...
                            
                                " does not contain "MyBucketRole"

Thus, I correct the role’s name in my bucket policy and move forth with writing some integration tests to check that my bucket has been created.

Integration Tests

I tested the output of my IAM users to match the input to my bucket policy. Now, I can create them in AWS. For these integration tests, I deviate from my original Golang tests to swap to awspec in Ruby. Yes, I went polyglot. There isn’t a useful tool in one language so I often swap to take advantage of a more useful testing tool. awspec uses Ruby’s RSpec to check the AWS components that have been created. I’ve found this useful not only for integration and component testing but also for security and compliance checking. Running this test on a regular schedule helps check for manual changes or deviations!

Basically, my awspec tests consist of simple checks against the IAM users, roles, and S3 bucket to be created. I always start with a should exist statement just to make sure that the component even exists. In the case of TDD, I expect them not to exist and the tests to fail.

After I’ve written these tests, I’m ready to create my Terraform file for my bucket.

At this point, I don’t have to put too much into my Terraform file because I’ve already tested some key logic encapsulated in my bucket policy. First, I run terraform plan to dry run my configuration. Then, I run a terraform apply to build my S3 bucket, IAM users, and IAM roles.

$ terraform apply -var-file=test.tfvars
...Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

I run my awspec tests again to make sure all of the policies and users have been properly attached.

$ cd tests/integration && bundle exec rake spec
..................Finished in 3.77 seconds (files took 3.19 seconds to load)
18 examples, 0 failures

They passed!

(Manual) End-to-End Test

My final question is, “Does this bucket allow the right users and roles to access it?” We’ll see! I opt to do this manually since I’ve contract and integration tested already.

I start with my write user. In theory, MyBucketWriteUser should be able to put objects but not read objects.

## As MyBucketWriteUser
$ aws s3 mv test.txt s3://mybucket/test.txt
move: ./test.txt to s3://mybucket/test.txt$ aws s3 cp s3://mybucket/test.txt ./hello.txt
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden

Notice that I could upload an object but I couldn’t read it from the bucket. Let’s use MyBucketReadUser to retrieve the object from the bucket.

## As MyBucketReadUser
$ aws s3 cp s3://mybucket/test.txt ./hello.txt
download: s3://mybucket/test.txt to ./hello.txt

I’m allowed to download my file that I uploaded, which is good! If I try to upload, I’ll get an error.

## As MyBucketReadUser
$ aws s3 mv test.txt s3://mybucket/test.txt
move failed: ./test.txt to s3://mybucket/test.txt An error occurred (AccessDenied) when calling the PutObject operation: Access Denied

Now, let me assume the MyBucketAdminRole. This should allow me to delete the file object in the bucket.

## As MyBucketAdminRole
$ aws s3 rm s3://mybucket/test.txt
delete: s3://mybucket/test.txt$ aws s3 ls s3://mybucket/
# Returns empty

It works! I’ve confirmed that my bucket is available and my users can correctly access my bucket. Note that I tried this manually, since my integration tests should sufficiently cover most of my actions. If this is a repeated set of AWS resources with more components, I would use terratest to assume a role, create objects, list objects, and more with my different users and roles.

In Summary

Above is just one simple example of how I approach TDD for infrastructure. Admittedly, writing the tests for the example did take a chunk of time. I could have written the bucket policy and checked it manually, achieving the outcome in a few hours. However, TDD compelled me to write some unit tests first and run them locally. I actually only had to run terraform apply once, which was after the integration tests! In some ways, I had saved some money on my AWS bill. I felt more confident deploying to AWS because I had checked:

The functionality of my bucket policy via my unit tests.
The contracts between my IAM users and my bucket.
The integration of my security policies.

Some caveats and important considerations I had while writing the tests…

I could have used terratest instead of awspec for integration tests. terratest is written in Golang and would have been more consistent in language with the rest of my tests, except that it currently doesn’t have a good way of checking for bucket policies. I do think terratest would be good for end-to-end testing.
I wouldn’t keep my bucket policy in a JSON document. In this example, I used a JSON file because (1) it had less set-up code for the test and (2) I wanted to show that unit tests could be agnostic of the infrastructure-as-code tool. For extensibility and scale, I usually use an aws_iam_policy_document declaration that comes with Terraform. It’s a little bit trickier to unit test but the ideas are pretty similar.
awspec (and other RSpec-like testing tools) can sometimes cross the lines between integration and contract tests. The testing pyramid can be pretty fluid for infrastructure. I use my own discretion to determine if certain resources would benefit from contract tests or just the integration tests.
Testing tools may not support every feature that might be applied to a public cloud resource. For example, I couldn’t add an awspec test to ensure the bucket’s public access was fully blocked. That isn’t covered in any of the unit, contract, or integration tests.
If I have a lot of components and want to go even further with integration testing, I use localstack. localstack mocks AWS components like buckets, databases, etc. on my local machine. It’s not my go-to integration testing tool because it doesn’t always support the mocks I need. Sometimes, I’ll actually spend more time debugging it than writing the test.
Speaking of time, I evaluate my return on investment for writing tests. Sometimes, I might not write end-to-end tests because integration tests sufficiently cover my functionality. Other times, I forgo the contract tests because the amount of time I spend figuring out how to test the contract outweighs the benefit. I constantly work to balance the types of tests I write, the time taken to write them, and the confidence they provide me before I deploy.

Overall, TDD does take more time and can be fairly tricky for infrastructure. However, it helps me to…

Divide the minimal configuration I want in a cleaner way.
Express the configuration I want declaratively.
Focus on testing most of my infrastructure locally, which reduces my feedback cycle and cost.

Curious to learn more? Take a look at the example for S3 bucket creation on Github and try this approach on some other use cases!