Stop Letting Your Tests “Make” Your Luck

Using random data generation to improve your testing methodologies

Asaf Manshary

Published in

Riskified Tech

6 min readFeb 8, 2021

Writing and maintaining tests is tedious, manual work that consumes a fair amount of our time.

At Riskified, we wrote a small Scala testing framework that automatically generates random inputs. It makes our lives a bit easier, leaving us free to focus on what really matters in our tests.

We’ve created a library called “RandomObject” as part of our Scala infrastructure. It allows us to write maintainable tests fast by focusing on a subset of fields in each test. Now, If a certain field is irrelevant to a test case randomly generated data is used.

In this blog post, we will go over the steps needed to write your own RandomObject and how to use it as part of your ‘unit tests’ and ‘integration tests’ for Kafka consumers.

In tests that are based on randomly generated data, it might seem we are leaving things to luck, but we are actually doing the opposite. We are trying to reach full coverage deterministically while letting ‘lady luck’ partially choose our input, as Batman’s Two-Face does with his fixed coin.

RandomObject

First thing first: just what is RandomObject? It is a library that generates random data based on a case class signature. Before writing it together step by step, let’s see how it can be used:

Let’s assume you want to verify that your serialization framework serializes and deserializes the data without data loss. Since you want to test the framework itself, you can use a test object — let’s name it Foo — and then validate that the values of Foo are maintained after round trip serialization. This is a classic scenario for using our RandomObject:

What about a more complicated test case?

We have taken a look at basic usage as part of unit tests, but you can also use it for other use cases, such as testing Kafka pipeline end-to-end.

You can use a random generated input to check whether the services are communicating well. To do so, you need to generate multiple different inputs that will trigger the pipeline. Doing this manually is tedious and time-consuming.

First, let’s prepare a helper function that generates random data, produces it to a certain topic, and returns the generated data as an output. This output will be compared with the results later on. (in the example below we are using a generic ‘Encoder type class’ that adds a Json encoding extension to a class)

Now, you can easily write test cases that publish your data to Kafka. For our example, we will produce messages and read them from the same topic, a naive test case.

In the example below, assertResults compares results with the generated data and readResultsFromKafkaTopic finds messages in Kafka topic:

Now, let’s define our data models, EventA and EventB:

As you have probably noticed, we can now easily test our Kafka pipelines as a black-box with randomly generated data. This can be easily extended to test a whole data pipeline by inserting a message to the first topic and then finding the correlated message in the last topic of the pipeline.

Now that you know the benefits of using RandomObject, let’s write it together.

Writing our new random input generator

As I’ve mentioned before, we write in Scala, which is a statically typed language. It allows us to implement logic based on ‘type class derivation.’

In a nutshell, it’s a way to generate a type class automatically based on its signature when certain conditions are satisfied. In our case, we will implement a ‘RandomObject’ type class that leverages type class derivation as a way to generate random data. Don’t worry, we’ll do it together :)

We will start by defining a Type class called ‘RandomObject’ that will serve as a small ‘generator’ Type class, and then extend it with multiple types. The example below details the core of our ‘RandomObject’ implementation.

Note that the base implementation is a mere trait (interface); you will need to add a concrete implementation for each type you want to support. Without implementation for a certain type, the compilation will fail and produce an error message.
The implementations of each supported type will be part of RandomObject’s companion object.
In this section, you can see multiple examples of implementations of a certain type — for example, native types such as long, int, or Scala type FiniteDuration. Don’t forget to add an implementation for any native or user-defined classes you want to use.
You need to support Scala collections, List-like, and Map-like generic types. For that, you can use the CanBuild method from ‘scala.collection’ that provides an API for creating generic collections. The sample provides an example for a random list-like generator, and you can easily infer map-like implementations.

What about more complicated classes?

Now that we’ve covered Scala native types, what about more complicated classes, such as case classes or traits?

This is a little more complex, as you need to check the class’s signature in compile time.

We use the Magnolia library, which does it rather elegantly, and after testing with both Shapeless and Magnolia we saw better compilation timing performance with the latter.

In the following code, we relied on Magnolia’s derivation, which uses ‘combine’ and ‘dispatch’ method implementations:

Combine creates an instance of the type class based on Magnolia’s ‘CaseClass’ API
Dispatch chooses the ‘correct’ subtype to handle the input — relevant for sealed traits

In the previous example, derivation is defined as a dedicated trait, so make sure RandomObject is implementing this trait. You need to use a separate trait to make sure derived instances have a lower implicit lookup priority than the manual implementations provided in RandomObject implementation.

Don’t forget to equip the ‘RandomObject’ typeclass with the generic derivation solution we’ve just crafted:

Error handling

We’re going to go slightly off track now and discuss how to write a meaningful compilation error.
The default error for a missing implicit instantiation is ‘could not find implicit value for parameter rnd: RandomObject[A],’ which is rather ambiguous because it doesn’t specify what exactly ‘A’ is.

To solve this, you can provide your own, much clearer message using the ‘implicitNotFound’ annotation: add the following annotation before RandomObject’s companion object:

This results in a better, more detailed error message. You can now easily know what specific type is missing and supply a RandomObject implementation for it.

What happens when we want partially randomized data?

Now that you have a random data generator, let’s look at one last example. Perhaps you would like to set an exact value for a subset of the input while writing a certain test case, while the rest remains random. Just like Two-Face and his fixed coin, we will ‘fix’ portions of the input. Using the case class you want to test:

You want to check how your system interacts with an input with a certain value for an External instance, while ‘ints’ is irrelevant for this test case.

A ‘basic’ approach would look like this:

You need to maintain a value for the list of ‘ints’ in each test case that uses ‘External,’ even when it is not being used. With RandomObject it would look like this:

We have made progress, but we still have boilerplate to maintain. To reduce it further, I suggest a library named Quicklens. It provides a DSL to modify a field in a nested structure, and reduces boilerplate, allowing the test writer to focus only on what is relevant for each test case:

Now each test case is focusing on the values being tested, which means that updating a field in a case class or changing a certain flow requires adding or altering only the specs that are relevant to the change.

Conclusion and things to consider

In this post, we’ve written a random data generator for tests and discussed examples of usage. At Riskified, this allowed us to write better tests faster in order to focus on the actual logic. A similar testing approach can be found in Scalacheck’s Arbitrary, which comes as part of its ecosystem but we wanted to be decoupled from a specific testing framework. I highly recommend implementing such internal frameworks to boost development if you want your testing efforts to be efficient and short. Providing a robust testing infrastructure will allow your team to work and address issues faster, and most importantly (at least to me) will result in less maintenance-heavy boilerplates.