An approach to specification-based testing for asynchronous workloads in AWS

Published in

Deloitte UK Engineering Blog

8 min readMar 15, 2024

An generated image of specification testing around a closed box

In this article we deep-dive into the techniques we use when engineering cloud-native solutions. The article assumes the reader is conversant with AWS cloud-native solution architectures, in particular Lambda, DynamoDB, S3, API Gateway and SQS, as well as the basic aims of automated software testing. We’ll walk the reader through how to build a test framework specifically for asynchronous event-driven workloads.

Specification-based testing, also known as “opaque testing” or “behaviour testing” (and, formerly, black-box (sic) testing), aims to assure what the software does — i.e. “does it work to specification?” — not how it does it. Therefore, when designing and building tests, the engineer should only look at the specification and not require any knowledge of how the software is structured or built. This becomes a lot harder to achieve in modern software architectures that make extensive use of asynchronous processing — a concept which was explored in some depth by Werner Vogels in his re:Invent keynote in 2022¹

Building a specification-based test suite for a synchronous REST API involves invoking each of the endpoints with a series of test payloads, capturing the actual responses and making assertions against the expected responses. However, with asynchronous solutions it may be that there isn’t a REST API into which you can inject test payloads, or if there is, they don’t return the result of the actual processing. Writing automated tests for these can require more direct access to underlying cloud resources such as buckets, tables and queues. So, how can this be done in the context of a specification-based test which should only use the external interfaces? In this article, we present an approach we have recently taken and discuss some of the pros and cons.

We use a variety of automated tests at various points in our CI/CD pipelines:

Static code analysis — tools such as Sonar, Snyk, Prettier, and Eslint to scan the codebase on pre-commit, PRs and post merge-to-main
Unit tests — tools such as jest, playwright, and cucumber are used to execute the code in the pipeline container environment
Specification-based tests — are used to exercise the solution after it has been deployed to a cloud non-production environment using, as far as possible, only its external interfaces
Integrated system tests — end-to-end tests with both happy path and alternate path test cases
Performance tests — volume / load / stress with tools such as k6
Accessibility tests — test to make sure that the product is usable productively and comfortably by the maximum number of people

Specification-based tests

With specification-based tests, the system under test needs to be invoked, and assertions made against, the external interfaces of the solution — i.e. the test is agnostic of the internals of the solution but seeks to assure the behaviour of the solution meets the expectations of its external consumers both upstream and downstream. Put simply, the test is looking at how the software behaves, not how it does it. An example of this is a synchronous REST API:

Specification-based test approach for a synchronous REST API service

In this example the API is invoked with a specific payload in the request containing test data:

{
  "order_id”: “ABC123”,
  “activity”: “PAYMENT SUBMITTED”
}

and assertions are made on the actual response against an expected response:

{
  “status”: “PAYMENT PROCESSED”
}

For testing asynchronous workloads, it’s not as simple though. Typically, we may need to read or write directly to a bucket or queue as this is the input event at the boundary which triggers processing or the output event at the boundary at the end of processing. To do this we built a test harness — a test harness is collection of scripts or code which can be used to inject data, inspect output or mock the responses of externally integrated systems that are not present in the test. Below shows our test harness:

The Test Inspection API Gateway

This API Gateway exposes a REST API to be consumed by the Test Runner. In the IaC template it is configured to only be deployed into environments where these automated tests are to be run, for example Development and Build, but not Pre-Production and Production environments. If the Test Runner is also running in an AWS account you control, then this API Gateway may be made private, so the endpoints are not exposed publicly — but this does incur extra cost.

This API Gateway contains endpoints which the Test Runner consumes to inject a test message on to a queue or inspect state within specific resources in the AWS account running the system under test. It is protected using an API Gateway resource policy which ensures endpoints can only be invoked when accompanied by an AWS SIGv4 signature generated by principal(s) from the same AWS account. This could be for instance the task execution role assigned to the ECS container hosting the Test Runner. The use of a resource policy means permissions don’t need to be attached to the principal and keeps all the test harness IaC together whilst following the principle of least privilege². An example of such a resource policy is shown below:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Deny",
      "Principal": {
        "AWS": "*"
      },
      "Action": "execute-api:Invoke",
      "Resource": "arn:aws:execute-api:<REGION>:<ACCOUNT>:<APIID>/*",
      "Condition": {
        "StringNotEquals": {
          "aws:PrincipalAccount": "<ACCOUNT>"
        }
      }
    },
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "*"
      },
      "Action": "execute-api:Invoke",
      "Resource": "arn:aws:execute-api:eu-west-2:<ACCOUNT>:<APIID>/*"
    }
  ]
}

All but one of the endpoints are implemented as AWS service proxies e.g. to s3:GetObject, s3:ListObjectsV2, s3:PutObject and sqs:SendMessage. A separate endpoint which invokes a lambda is implemented to allow non-sequential access to items that are sent on to a queue by the system under test — to support assertions against items in the queue without being dependent on their order. This allows multiple tests to run concurrently which is important when tests are automated in CI pipelines.

Injecting a test message on to a Queue

Where the system under test needs to be triggered by an incoming message the test container can use the API Gateway to inject a message on to that queue.

Asserting against an Object created in a Bucket

Where the expected output from the system-under-test is an object in a bucket and the object key is known then the Test Runner can assert both the presence and content of the object. A lifecycle policy is set on the bucket to ensure these objects are regularly cleaned up.

Asserting for the presence and content of a Message in a Queue

This pattern is used where the expected output from the system-under-test is a message in a queue and a specific attribute of that message (e.g. an id) is known. A dequeue lambda is deployed in to the non-integrated environments to read messages off the queue and write them as objects in to a specified bucket. The object key is in the form <prefix>/<attribute-value>-<timestamp>-<message_id> where:

<prefix> is a static string defined in an environment variable for the dequeue lambda
<attribute> is the name of a root-level attribute of the SQS message specified in an environment variable for the dequeue lambda, for example id
<timestamp> is the timestamp in ISO8601 format in milliseconds, for example, 2023-08-01T17:03:14.725Z
<message_id> is the SQS message id retrieved from the message attributes, this is included to guarantee uniqueness of the key.

The Test Runner can find and retrieve one or more relevant objects in order to assert against by using API Gateway calls which call s3:ListObjectsV2 and s3:GetObject. The s3:ListObjectsV2 API call can apply prefix filtering, for example to limit responses to only objects based on a specified <prefix>/<attribute>, and returns objects pre-sorted by ascending key. This allows the Test Runner to efficiently find the output message item corresponding to the test it initiated and then perform the test assertions.

The dequeue bucket is configured with a lifecycle policy which expires objects 24 hours after creation.

Conclusion

Advantages of this approach:

Cloud resources (i.e. infrastructure and services) can be conditionally deployed depending on the environment.
The ease of asserting against presence and content of specific messages.

Disadvantages of this approach

Complexity of additional infrastructure cloud resources to engineer and manage.
This approach may not be suitable for performance testing, as it’ll introduce additional load and usage in the account, which is not related to the system under test, and also introduces additional latency on each request which is also not representative of production.

The approach described here introduces additional infrastructure to the deployed solution in non-production environments to enable automated specification tests to be written for asynchronous workloads. The test infrastructure provides secure, selective access to queues, buckets and tables which may be inputs and/or outputs for individual processing flows. Access to this infrastructure is controlled both to ensure it can only be consumed by the Test Runner, and to ensure that the Test Runner only has access to the resources on the boundary it requires, so as to preserve the intent of specification testing.

Depending on the system under test this approach could be considered overkill, however investments such as this in automated CI-pipeline tests pay-off in the long term as they are a great way of avoiding future regression issues.

Maybe as part of this maybe its worth having a bullet-point section around key considerations when setting up a specification-based testing — (e.g. consider enabling debugging when logging and monitoring, or consider timeouts and retry mechanisms etc.)

[1] AWS re:invent 2022 — keynote with Dr Werner Vogels (2022) YouTube. Available at: https://www.youtube.com/watch?v=RfvL_423a-I (Accessed: 04 March 2024).

[2] Gillis, A.S. (2023) What is the principle of least privilege?: Definition from TechTarget, Security. Available at: https://www.techtarget.com/searchsecurity/definition/principle-of-least-privilege-POLP (Accessed: 04 March 2024).

Disclaimer

Note: This article speaks only to my personal views / experiences and is not published on behalf of Deloitte LLP and associated firms, and does not constitute professional or legal advice.

All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, logos, and brands does not imply endorsement.