Running All The Tests At Once… We Can Do It!

Higher development velocity with a radical approach to testing Elixir

Published in

Whatnot Engineering

10 min readJul 6, 2022

Several combine harvesters lined up in a row on a wheat field — Photo by Darla Hueske on Unsplash

We have a massively concurrent Elixir system at Whatnot. At any given moment, our Elixir-based Live Service will be running auctions, accepting bids, routing chat messages, connecting users via DMs, and processing payments for their awesome new merchandise.

A couple months ago, we did not have a massively concurrent test suite to match. All our tests executed one-by-one, and as we added new features with their corresponding test cases, we found that the time it took for the whole suite to complete started reaching the psychological breaking point, where we’d want to give up and say: “I’m just going to push the code to have tests run in CI, I have other things to do ASAP”.

It was a very unsatisfying position, especially since we really like testing and feel that a good suite of large tests (a.k.a black-box, end-to-end, acceptance) does a wonderful job of documenting functionality and guarding against regressions. Our large test suite does not in any way modify the internals of the system. It only uses the public APIs of the service, as a client would, to exercise the user-facing functionality that we provide.

When the Live Service is operating under nominal conditions, there is a variety of activities going on at the same time, so why should we settle for exercising features sequentially during local test runs? Let’s have our test suite execute on a system that’s running just like in production, with multiple livestreams going on at once, multiple auctions taking place, and chat messages flying all around! That’s what Elixir does best, isn’t it?

That’s the philosophical argument. There is also a practical argument. Many of our tests rely on busy waiting, verifying that asynchronous processes on the server take place and push results onto the websocket. This means that a large chunk of our perceived test duration is spent doing nothing, simply waiting for clients to receive messages. This begs the question: why not do nothing in a massively parallel fashion?

We’ve aimed to answer the question in the positive, and we arrived at a pattern of strictly defined test cases which are isolated and logically concurrent, that is:

They do not depend on any particular data existing or not existing in the system
They have no hidden causal relationship to each other

The rest of this post will talk about how we achieved these postulates in our large test suite. Despite their brevity, these two concepts entail an approach to writing tests that’s a bit different than what you might be used to.

Let’s jump in!

Postulate 1: No data dependencies

This postulate is rather simple to grasp, and doesn’t require special support from the language or tooling. It is a maximalist interpretation of BDD (Behavior-Driven-Development) or the AAA (Arrange, Act, Assert) pattern.

We want to: a) set up the exact environment the interaction requires (given), b) perform the interaction (when), and c) check that the results were as expected (then).

Let’s see an example of how we take given-when-then to the extreme:

# given [{_, seller}, {_, buyer1}, {_, buyer2}] = users = given_users_connected(3) %{livestream_id: _livestream_id, live_product_id: live_product_id, topic: topic} = given_livestream_with_users(users) given_auction_started(users, topic, live_product_id) # when {:ok, _} = place_bid(buyer1, topic, live_product_id, 70) {:ok, _} = place_bid(buyer2, topic, live_product_id, 80) # then assert [{:ok, _}, {:ok, _}, {:ok, _}] = wait_for_broadcast([seller, buyer1, buyer2], &match?(%{“event” =

First, given_users_connected(3) on line 2 will

ensure that 3 completely new user accounts exist in the system
websocket clients are initialized for the 3 accounts
websocket clients are authenticated and connected to the service

This also ensures that no other users will be taking part in the interactions that follow, because we simply do not possess the user ids, authentication details, and websocket handles to other clients who might also be connected at the same time. By convention, this function generates N users, of which the first user has seller capabilities, and the others are buyers.

Next up we have given_livestream_with_users(users) on line 5. This function takes a list of users (produced by given_users_connected), and has the seller start a new livestream via the system’s API, then set up a unique product to sell in the livestream. All the other users will simply join the livestream and sit there. The function returns the ids of the livestream and the product, and the Phoenix PubSub Topic on which livestream events will be processed.

Finally, we have given_auction_started(users, topic, live_product_id), which sets the stage for interactions during an ongoing auction: the seller starts the real-time auction for the product that was just created, and the buyers all wait for the “auction_started” websocket event. When this function returns, we know all our givens are in place, and we are ready to test interactions with the auction.

The most important thing to note here is that nowhere do we manually create entities with hard-coded data: there is not a user_id or email, livestream title, or any other literal identifier in the test. Instead, we bind the identifiers returned from previous givens, and pass them on to the next givens.

The sequence could be seen as a nested function structure:

with_fresh_users( fn users -> with_fresh_livestream(users, fn livestream -> with_fresh_product(users, livestream, fn product -> with_started_auction(users, livestream, product, fn auction -> WE HAVE EVERYTHING WE NEED, AND ONLY WE HAVE IT end) end) end) end)

As long as there are no collisions in the internal functions that generate fresh data, we can be certain that, in the scope of the innermost function, we have access to the exhaustive and exclusive set of entities involved in the test.

This means that our scenarios can really go wild with the entities provided, and there is no chance that it will affect other tests which are ongoing.

An important implication of this is that we never use the setup block provided by ExUnit, as it creates an implicit data dependency between all tests in a module. It also creates a causal dependency, which we will illustrate next.

Postulate 2. No hidden causal relationships

Now that we know that our tests neither depend on preexisting data nor do they leak their own data into other tests, we could be excused for assuming that’s all we need to run them all concurrently? In a sense, yes. But our system is not a monolithic whole, and interactions with external systems pose a higher hurdle to unlocking logical concurrency.

Setting up controlled responses (fakes, mocks, etc) to external API services in our tests is a form of hidden causality. Why? Because later tests are influenced by the fake responses set up by earlier tests — there is information being shared across test cases, even if it’s not in the foreground of the interaction itself.

Let’s see an example. Actually, since we’re talking about concurrency and causality, it’s two examples:

a. Testing for an unsuccessful payment

# given (user+livestream setup abridged) given_auction_started(users, topic, live_product_id) expect_payment_unsuccessful_for(live_product_id, buyer_id) # when {:ok, _} = place_bid(buyer, topic, live_product_id, 1000) # then assert [{:ok, _}, {:ok, _}] = wait_for_broadcast([seller, buyer], “payment_failed”)

b. Testing for a successful payment

As you can see, these two tests are mirror images of each other. In one auction, the winning bid results in a failed payment, while in the other — the payment goes through. The rub here is that payments are processed by another system in the Whatnot infrastructure, called the Main Backend. Live Service makes a POST request to the Main Backend with bid information, and gets back a result that reflects whether the order was processed successfully or not.

The easy way to test external API interactions in the Elixir ecosystem is to set up a temporary fake HTTP handler, using Bypass or an inline Plug module directly in the test body. This approach is unsatisfactory for two reasons:

It requires messing with our system’s Application.config, due to the fact that each new Bypass or Plug needs to listen on a new, free port. This forces us to tell Live Service to talk to the Main Backend API on that particular port, breaking the black-box testing abstraction;
If we want to avoid the latter by always using the same port, it forces us to have only one Bypass instance active at the same time.

If we accept the “gray-box” aspect of 1), we’ll end up in a situation when one test modifies the global Application.config object in order to set up a mock, thereby messing with other tests that might be using that same config. Those tests will end up making requests to mocks that return the wrong data, or no longer even exist to reply to HTTP calls.

If we accept the “static port” aspect of 2), we are now forced to sequence our tests. If one test requires the mock to return a 500, and another needs the mock to return 200, the second will have to wait, then reset the mock, and then proceed.

Both of these approaches create a hidden causal force between the two tests, as each test case must somehow take into account the mock responses that were created in the other.

We cannot take either route, as they both lead us away from concurrency.

Persistent fakes with specific expectations

The way to cut this Gordian Knot is to give up on one-shot mocks, and embrace long-running Fakes as our method of testing API interactions. When our system starts up, it is configured to talk to a fake Main Backend. The fake Main Backend itself is explicitly started in test_helper.exs.

The way it works is simple: it is an Agent process that keeps some state. It is also a Plug, and implements all the HTTP endpoints that Live Service accesses on the Main Backend. It also has a functional API for setting up particular response data in advance.

Inside this fake Main Backend, we implement a handler for the payment POST call, and it looks like this:

defp handle_call( %Plug.Conn{request_path: “/api/order”, method: “POST”} = conn, raw_body, _opts) do %{“buyerId” => buyer_id, “productId” => product_id} = Jason.decode!(raw_body) case payment_successful?(@agent, product_id, buyer_id) do false -> Plug.Conn.resp(conn, 500, Jason.encode!(%{“error” => “failed”})) true -> Plug.Conn.resp(conn, 200, Jason.encode!(%{“id” => order_id})) end end

Now, you can see that the fake server looks at the (product_id, buyer_id) tuple to determine how to respond to the API call at hand. This means that as long as product ids and user ids are unique across tests, (and they are, based on our first postulate above), there is no need to coordinate the behavior of the fake. It’s sufficient for a test to set up the expected response of the fake Main Backend in the test givens section, and everything will proceed in complete isolation.

That means that our two test cases above, each expecting a different response from our fake, can run concurrently without stepping on each other’s toes:

# test 1 expect_payment_unsuccessful_for(unique_live_product_id0, unique_buyer_id0) # test 2 expect_payment_successful_for(unique_live_product_id1, unique_buyer_id1)

Note that our fake still has to handle the API calls in sequence, as it relies on a single Agent state, but in practice this is not noticeable, and we could always move the data to an ets table for more parallelism if desired.

Support from the language: Running all the tests at once

We’ve gone through quite some effort to ensure that our test cases are independent from the perspective of data isolation and causality. We have made them logically concurrent. So the final question is: how do we actually execute the tests in parallel?

There is a bit of subtlety involved. You probably know that the ExUnit.Case macro allows us to declare that the current module can be run concurrently with respect to other modules, using the flag async: true.

But how do we declare the tests within a single module to be causally independent and capable of parallel execution?

Sometimes a programmer writes elegant code. And sometimes we have to write macros. The hacker’s way to defeat ExUnit’s module-level concurrency paradigm is to lean into the module-level concurrency paradigm, and declare every test in its own module. Like so:

defmodule Test.Large.AuctionTest do import ParallelTest p_test “successful payment on highest bid” do # test here end p_test “unsuccessful payment on highest bid” do # another test here end end

Where ParallelTest is a dead-simple module generator:

defmodule ParallelTest do defmacro p_test(name, do: block) do quote do # some details omitted defmodule :”#{unquote(name)}” do use ExUnit.Case, async: true test “” do unquote(block) end end end end

Now, with every single test living in its own dedicated module, and all these modules declared as async: true for ExUnit’s test runner, we can simply sit back and let mix test run all our tests at once, with as much parallelism as our hardware can provide.

There are some developer usability caveats to generating fresh modules for every test, but our in-house Elixir magician Rafal Studnicki has done the work necessary to ensure that in future Elixir releases, test-per-module test suites will be a first-class citizen in ExUnit.

Summing it up

Thanks for reading this far! It’s been quite a long blog post, but the nature of concurrency is notoriously tricky, and the work of testing is notoriously laborious.

The engineering journey for our team here at Whatnot had us start off with the notion that our large tests are taking too long, and arrive at a renewed appreciation for both the theory and practice of concurrent programming in Elixir.

In terms of real-world results, here are the times of running tests one-at-a-time vs. all-at-once.

If you’re interested and motivated by this type of work, please reach out. We’re hiring!