Scalable Integration Testing Method

Zhenhua Cao
9 min readOct 3, 2018

We do a lot of testing today. Unit testing, integration testing, end to end testing. While a comprehensive test coverage is good, it does come with some additional costs, especially when different types of testing are mostly isolated and use totally different toolchains. Because you have nothing you can reuse which means to a tremendous amount of effort are needed for building each type.

In this article, we will describe a method which unifies many of those testing phases with similar components and standards, costing only a portion of traditional methods.

scalable testing

Personally, I don’t like terms such as “End to End Testing” and “Regression Testing” that much since they are vague by definition and understandings to those terms vary wildly for different persons. Instead, let us try to categorize by one factor only, which is the number of subsystems involved in a test.

A “subsystem” is a commonly agreed perceivable boundary within the whole system, it can be a service, like the user service, or an application, like UI, API Gateway, etc.

If we define two categories One Subsystem Integration Testing(ST1) and Cross Subsystems Integration Testing(STn), the test cases we have will form the following testing pyramid:

/ \\
/ \ \
/ \' \ <--------------------+ How do we
/ STn \'' \ | scale from
/ \ ' '\ | ST1 to STn?
/ \'' \ +-------------
/=============\ ' ' \ |
/ \ '' \ |
/ ST1 \' ' ' ' \ <--------+
/ \' ' ' ' \
/ \' ' ' ' \
/=======================\ ' ' \
/ \ ' ' ' /
/ Unit Tests \ ' / <--- Not in the scope
/ \' /

The number of test cases falls into ST1 and STn categories is likely to follow 80/20 distribution, while the cost of doing ST1 and STn is likely to follow 20/80 distribution. STn seems pretty cost ineffective given 1/4 the number of cases costs 4 times as much for setting up and running.

It becomes important that when building one subsystem test cases we build something reusable for cross subsystems test cases, or we won’t be able to comfortably afford the cost of building the whole infrastructure for cross subsystem tests from scratch.

When we zoom into a specific subsystem and start seeing its internal technical parts, interestingly, “integration within one subsystem” isn’t that different from “cross subsystems” which means even a single subsystem is facing similar challenges that are applicable to cross subsystem testing.

These observations open a gate for us. If we treat both smaller scale testing and larger scale testing equally, with the same approach that works equally well regardless the scale, the same testing infrastructure and methodology that we use to build one subsystem test cases could be used to test more complicated interactions between more subsystems (when we would like to).

The Scalable Method

Here’re key ideas forming the scalable method.

Test against a “complete” environment

Since the method doesn't distinguish ST1 and STn, the first prerequisite would be a complete environment.

The complete environment includes all subsystems (or as many as possible subsystems that we are able to include).

Those subsystems are configured properly to connect to each other, in a manner almost identical to production. Except for one thing, that only 1 instance for each subsystem would be necessary. This minimization makes it possible to fit the complete environment into a laptop, which brings a lot of conveniences and flexibility.

Even though there’re many ways to implement the complete environment, today the most feasible solution is using Docker.

When subsystems are dockerized, all we need is a lightweight `kube.yml` file to wire them together. And then we have the ability to run one command `kubectl create -f kube.yml` / `kubectl delete -f kube.yml` to create/destroy the complete environment.

Use the empty state as the baseline

Many of us are familiar with the test fixture, which is the pre-populated data we produced manually. With test fixture before each test case run, we can assume that certain business entities are already there, so when writing test cases we can reference an entity with a magic “id”, and start from there.

The test fixture is what we used to, BUT, when we look into this approach closely, we revealed some essential limitations.

  • A fixture is fixed, and limited. There’re always complicated enough cases you have to introduce new fixtures. With more cases, the fixture set keeps growing, until it encounters the following “roofs”.
  • A fixture is a mental overhead that can not scale. When fixture set grows big enough, it becomes hard to memorize, and one has to frequently go back to DB/UI to lookup what entity is behind the magic number mentioned in a test case.
  • A fixture is a technical overhead that can not scale. When fixture set grows, the cleanup cost after each test case grows exponentially. In a typical system, in order to restore to a non-empty baseline, you can imagine that first DB tables need to be truncated, and then have fixture data re-injected. And derived storage subsystems, including search servers, Redis queues, and in-memory storages need to follow the exact same procedure. Also, it is important to point out that not all of those injected data will be used by every test case, which means significant time was spent on unnecessary IOs.
DB              Search
+-------+ +-------+ the cost of "wipe-and-seed"
| | derive | | | grows exponentially with
|-------| from |-------| | the size of state
| state | <----- | state | +--------------------
+-------+ +-------+

With the above limitations, it becomes clear in order to build something “scalable”, we need a different solution.

The first strategic move is to start from an empty baseline. This ensures a constant cleanup cost.

With an empty baseline, initial environment provision is easy. Just build each subsystem and ensure when they start they have no business data in their storage.

Each (storage) subsystem should provide a way to “reset” itself back to the empty baseline state, this method will be used to clean up after a test case run.

For the sake of local development and complicated test cases, we need to figure out a manageable way of producing data, namely “seed” and “factory”, which we’ll elaborate on later.

Build subsystems with test friendliness in mind

| :inspect_port <--------- Call to Inspect
| |
| :reset_port <--------- Call to Reset


In many cases, we need the ability to inspect the state of a subsystem in order to write a complex test case.

For example, say we are working on a test case which tests triggering an asynchronous data export job, and expect UI notifies with an inbox message when the export is ready.

Since the job is highly asynchronous, we would run into a wait cycle. The ideal time to stop waiting is when the downstream “data exporter” shows that its “pending queue size” becomes 0 and “processed queue size” becomes 1.

We can implement the waiting logic in our test case, given “data exporter” provided a way to inspect its status and return “pending queue size” and “processed queue size” as its key metrics.

This is where “Inspectable” kicks in and it requires subsystem to provide this kind of support (which can also be very valuable for production debugging too).


Let’s use the same example above. After an export job, the “data exporter” was left in a state where “pending queue size” is 0 and “processed queue size” is 1. Since those metrics are global, when the next test runs, this non-empty “processed queue size” state became a potential problem.

In order to keep test cases isolated from each other, we sometimes need to reset the state of the system under test back to its empty baseline state.

While specific technologies such as memory-based caching are likely to store something in memory which isn’t accessible from outside, we would expect subsystem to provide a way to reset its state by external calls, when necessary.

Reuse common test components

There’re a few components that can be made reusable so developers can write test easier and faster.

  • Factory (for data preparation)
  • Programmable Mock (for limiting the test boundary to a smaller scale)
  • Subsystem Specific Utils

We’ll explain the above reusable components one by one.


Data preparation is an essential step for any test cases.

A known pattern named “Factory” has been widely used in many unit test framework.

The factory pattern hides lower level implementation details, by offering a higher level interface, so when writing cases, developers can say:

# language-less, just assume the language would support method call, and somehow allow passing an option map as a parametercreate("user", { "active": true })# instead ofINSERT INTO `user` (`id`, `name`, `status`) VALUES (1, ‘Marry’, ‘ACTIVE’);

The factory pattern also frees people from creating depended entities manually, thus can make us be more focused on the final goal. For example:

# the order factory should created depended payment and address automatically
# instead ofINSERT INTO `payment` (`id`, `type`, `amount`) VALUES (1, ‘VISA’, 100);
INSERT INTO `address` (`id`, `line1`, `line2`) VALUES (1, '1 Main St', 'Apt 5');
INSERT INTO `order` (`id`, `payment_id`, `address_id`) VALUES (111, 1, 1);

We can use whatever way to implement factory if it can provide an interface like the above examples.

For factories’ actual way of implementation, there would be a few choices:

  • by making API calls (performant and stable)
  • by simulating user clicks on UI (slow but stable)
  • by injecting into storage engine such as DB directly (very performant but also very brittle, since it is highly bound to implementation detail)

The factory developer should balance the need for each concrete factory carefully and make sensible decisions based on actual factory use cases.

Also worth noting, one useful use case of the factory is to provide seed for local development.

A “seed” is basically a test suite, which calls many factories but makes little assertions, the sole purpose of a seeding test suite is to produce a side effect, a system in a certain state, which can be used by local development purpose.

Programmable Mock

A programmable mock is a kind of HTTP proxy that allows dynamic on/off and programmable response.

# When mock is off, mock acts as a reverse proxy+----------+   1   +----------+   2   +----------+ 
| | ----> | | ----> | |
| Consumer | | Mock | | Provider |
| | <---- | | <---- | |
+----------+ 4 +----------+ 3 +----------+
# When mock is on+----------+ 1 +----------+ +----------+
| | ----> | | | |
| Consumer | | Mock | | Provider |
| | <---- | | | |
+----------+ 2 +----------+ +----------+

When mock is on, we limited the test boundary to within only one subsystem.

In a test case, as a test precondition, we should be able to call mock to turn on/off the mock behavior.

Below described a programmable mock server design.

A mock server is a proxy server which is placed in front of the actual service. All traffic to the actual service needs to go through the mock server.

Default proxying request to the actual service, the behavior of the mock server is programmable, with the following mock APIs:

# callers should always point to mock server, which is preconfigured to proxy to the actual serviceGET /users
HOST mock-user-service
# default relay to /users
# turn on mockPOST /_mock?url=/users
HOST: mock-service
Body: {“response_code”: “200”, “response_headers”: [], “response_body”: “…user list…” }
# turn off mockPOST /_unmock?url=/users
HOST: mock-service

Subsystem Specific Utils

Reusable utils for each subsystem have potential uses for both internal and external test scenarios.

For example:

  • a function that simulates login can be reused by all UI integration cases
  • a function checking background queue status can be reused by even cross subsystem test cases

Just try to provide utils as a library function, or a very straightforward HTTP endpoint.

Both would work well.


The scalable method comes with its own complexities and constant overhead.

The environment is essentially brittle since it requires all components to be functional all time, and the initial factory setup requires considerable upfront investments.

It requires careful optimization to keep the overhead minimal for setting up/running the full environment, which sets a bar for the engineering team.

The scalable method could offer significant cost advantages if a software organization can form general agreements about the testing practice in advance. But please keep in mind the challenges comes with the practice and proceed with caution.