End-to-end testing of cloud infrastructure

Published in

VirtusLab

9 min readNov 8, 2022

The rapid growth of cloud computing left a space for standardisation, leaving the DevOps approach kind of in the dark. The “you built it, you run it” approach became the de facto standard in the industry. Unfortunately, it didn’t scale enough and required specialised skills. Moreover, searching for people with this expertise was challenging and expensive.

According to multiple State of the Cloud surveys, 91% of businesses use public cloud, 76% of companies are already multi-cloud, and the trend is only upward. The amount of information we can retain hits the limits. We see increasing cognitive load caused by emerging technologies, tools, frameworks and new methodologies. That means a lot of code is being actively produced and maintained. That’s why the right infrastructure testing strategy becomes even more crucial today. In our best efforts, we believe we are doing everything to provide the best possible standard. But are we?

In this article, we’ll introduce you to infrastructure testing, provide test cases, and offer advice on how to set up your tests to achieve sustainability and resilience in the cloud.

Why is infrastructure testing a new practice?

To assure quality, we write tests with all of the application code. So why is writing tests for infrastructure not a standard practice? Thinking about edge cases for infrastructure requires not only vast knowledge about the proper way of utilising it but also understanding of underlying problems of the used solutions. As the Cloud Team at VirtusLab, we would like to share with you a few long-term observations and our solution for infrastructure testing.

This will help you to pave the way and give some insights to upgrade your infrastructure testing strategy. See how your organisation can benefit from it.

These are our observations on end-to-end infrastructure testing:

Infrastructure tests take a long time and include templating, which introduces more complexity. We do not want random errors during resource provisioning.
Static code analysis tools find bugs in the configuration before a much longer test deployment, which saves lots of time in case of a misconfiguration
End-to-end tests (E2E) can produce resources that affect consecutive tests, so it’s important to provide necessary sandboxing and reduce potential blast radius.
New versions of Infrastructure as Code (e.g. Terraform providers) and cloud APIs are released, some containing breaking changes. It’s not always possible to foresee how they will affect dependent resources without prior testing.
Properly written tests, with design patterns in mind, can be easily reused in spite of the cloud provider
Properly structured and verbose tests can reduce the troubleshooting time from hours to minutes.

The right way to start the testing flow

Infrastructure is written in stone in configuration files, which come in many different formats such as YAML, JSON, etc. They all have a way of checking spelling and indentation.

The testing process can be as simple as generating actual files from templates and evaluating the correctness of the outcome in terms of values and formatting. It can also be as complicated as dry-running configuration files with their dependencies.

We can choose from a variety of linters and code snippets that take care of different aspects of code quality:

Simple linters — standalone or IDE-integrated, it doesn’t matter. They provide basic syntax error detection and ensure readability. Examples of simple linters are built-in terraform fmt and terraform validate.
More advanced static code analysis tools — these are tests that check for security vulnerabilities, which are a result of misconfiguration, e.g. checkov or tfsec for Terraform configuration files.

Actual encounter of different parts

Everything looks good on the paper, but things mostly fail when dependencies on other components start to play a role. The following two types of tests need to create the infrastructure — end-to-end tests and conformance tests. What’s the difference between them?

Let’s take a look.

In end-to-end infrastructure testing, we create real components directly via the cloud providers’ API. They are ephemeral in the context of a particular test. We then check if the actual configuration aligns with the desired one and if everything results in the expected behaviour.

The most common test cases consist of checking whether:

The created resources provide necessary connectivity between each other and between the users and the resource, e.g. setting and getting Key Vault secrets from a command line.
Secondary resources were created, e.g. managed identities, which are not really the “main course” in a module but definitely need to be in place to ensure we have access to the created resource.
Basic operations specific to that resource can be handled without error, e.g. CRUD operations on a database.
And, of course, if the created resource has the test configuration applied.

On the other hand, applications have specific needs to conform to in terms of security, availability and connectivity. These are the things that can’t be tested thoroughly with E2E tests, because we need to recreate the environment in which the application is running 1:1 to be sure. In these cases, conformance testing can be used. These tests are run inside of the runtime environment, which is Kubernetes in the context of this article. Some things look different inside of the cluster, e.g. the access policies are different for the resources running inside of it than they are for users and other managed identities. Moreover, using conformance testing, we can mock failures of dependent resources and check the efficiency of the disaster recovery tactics.

Let’s take a quick look at the differences between end-to-end and conformance tests below:

End-to-end tests in depth

E2E tests can be divided into given-when-then parts, as any other test. The only difference is what comprises these parts. The illustration below gives us some overview.

Our team at Virtuslab has created a custom E2E framework, written in Golang, to encapsulate test logic for Terraform modules. It uses two modes:

Local development — infrastructure can be tested remotely but using code from a local machine. The infrastructure won’t be deleted, so we can inspect it in the cloud. It is very useful for finding the root cause of a test failure. For example, sometimes secondary resources weren’t created because of a networking problem or a provider was left misconfigured after an upgrade.
E2E tests in the continuous integration (CI) pipeline — carried out using code from the remote repository. The infrastructure will be deleted at the end. This mode is adapted for full automation.

In each of these contexts, resource templates are rendered from actual module directories. We can pass the necessary configuration variables to test specific use cases. All these rendered terraform files are then created in the actual cloud environment by running the init, plan and apply steps. By calling everything from code and using libraries like “Terratest” and wrapping them in additional logic, we get a custom library where we have all the necessary variables in the code that are ready to use, even the runtime values from Terraform outputs. We can easily use them for calling actions on the resources. That way, we can compare the desirable configurations with the actual remote state from the cloud. The additional advantage is that we can ensure that the providers and cloud API don’t inject any default values that would break our desired outcome.

Testing scenarios differ for each resource. We can check network connectivity via different protocols, success of the CRUD operations, existence of secondary resources and configurations, etc. As all the infrastructure creation, updating and changing configurations need time to propagate, resource and status polling are a must.

The simplest example can be creating a Key Vault.

Let’s use the given-when-then diagram from above for better understanding.

Does your application environment conform to the expectations?

In addition to the end-to-end test, conformance tests are run to check the infrastructure’s overall health and troubleshoot problematic features in runtime. They usually run in Kubernetes, in isolation, in dedicated namespaces with test resources. Such tests must be non-disruptive for existing infrastructure as well.

The most popular tool used for such tests is VMWare Tanzu’s Sonobuoy.

As Sonobuoy is designed to run in-cluster tests, we extend its use case to running a variety of different types of tests such as in-cluster connectivity, testing authorisation and lifecycle management. This gives us a nice baseline for future customisation.

Some applications are deployed in the cluster but can be considered part of the infrastructure. These include for example monitoring and logging apps such as Splunk or Thanos, CI/CD solutions like ArgoCD or Jenkins, Certification Issuing Authorities or event webhooks you created for the Developers. These solutions need to be resilient, and conformance tests are ideal for testing their lifecycle.
During runtime, these resources create temporary subresources, the existence of which needs to be checked for proper ecosystem functioning.
Internal and external cluster connectivity is often a tricky combination of many network rules, which must be thoroughly checked to ensure security. Connections to dependent infrastructure resources like image repositories and storage entities require both connectivity and permission check-ups.

How to structure a conformance test

In the case of extended conformance tests, the infrastructure already exists. So we only need to either try to create a working connection to a resource and trigger some logic or create a secondary resource to do it for us. As with the E2E test, this can be nicely divided into phases:

To be more descriptive, we have an example below of a test case checking the ability to pull images from Azure Container Registry into the Azure Kubernetes Service cluster. We can divide this simple test into the following steps:

In the Setup block, we create a separate namespace to limit the blast radius and create a special pod that will try to pull the image. Creating a separate namespace for a test case can greatly simplify removing all of the created resources and ensure we start without any old residue that can affect the test logic.
If the pod reaches the running status, the image has been successfully downloaded, and we assume that the cluster components have the necessary permissions set in the repository. The test logic is placed in the middle part of the code.
When the test ends, we clean up the namespace.

Whole code:

Putting it all together

Having gone through all these test types, we are ready to put together a solution that will make our infrastructure as resilient as possible. After making changes to the code base, the CI pipeline should run all the test types sequentially. Starting with the linting and static code analysis through the E2E and conformance tests. When we are ready to deploy, the CD pipeline should run the non-disruptive conformance tests to ensure that the environment is functioning properly after the update.

Closing words

Testing cloud infrastructure comes with many hardships, but it definitely pays off. Following the steps presented in this article should get you started. There are many design details specific to each infrastructure that need to be treated separately, but they all fall into one of the main categories you need to remember about.

For end-to end-tests, check:

Create, update and delete operations
Connectivity between infrastructure resources
Configurations and their propagation
Idempotency test
Infrastructure drift

For conformance tests, check:

Creating, updating and deleting workloads
Security policies
Authorisation/access control
Ingress and egress
Disaster recovery, availability and scaling capabilities
Utilities specific to the installed tools and workloads, running in the cluster

Give it a try, and enjoy a more stable and resilient infrastructure!