Making RAPIDS cuDF IO Readers and Writers More Reliable with Fuzz Testing

Published in

RAPIDS AI

3 min readMay 18, 2021

IO readers and writers have some of the most complex code in RAPIDS cuDF. These also have a wide variety of parameters that are supported which increase the combinations in which they have to be tested. Thus, IO readers and writers are hard to exhaustively test via normal unit tests on a per PR basis. Therefore, an additional testing framework was needed that includes comprehensive test coverage of these readers and writers with randomly generated data and parameters. To accomplish this we had to implement fuzz testing.

What is fuzz testing?

Fuzz testing is an automated software testing technique that involves providing invalid, unexpected, or random data as inputs to an API. In cuDF, the random data has different characteristics such as varying levels of nesting, cardinality, and the presence of null values. The program is then monitored for exceptions such as crashes, failing built-in code assertions, or potential memory leaks.

How does cuDF use fuzz testing?

cuDF currently has fuzz tests written for all the IO APIs i.e., for all formats’ readers and writers. These tests are run indefinitely on a specific API with a wide variety of parameter combinations that are randomly generated. These tests end only when a user manually interrupts the process. In the cuDF fuzz testing framework, there also exist capabilities to reproduce the test failures.

Results

In the early stages of fuzz testing our IO readers and writers have found 58 issues in total, out of which 15 issues were either data corruption or unrecoverable errors. So, there are 46 issues that have been resolved as part of this effort until now, making cuDF IO readers and writers even more reliable.

How are IO fuzz tests in cuDF currently implemented?

To write a fuzz test for a specific API, we will need to define a class dedicated to that specific API that inherits IOFuzz as base-class.

For example, a class we currently use to fuzz testing parquet writer is defined as shown below:

Writing a fuzz test parquet writer without any params is as simple as:

If you would like to include parameter combinations in the fuzz test, you can do so by passing the desired parameters to `params`, like this:

The command to initiate fuzz tests for `parquet_writer_test` is:

The above command should produce an output similar to the following:

Fuzz testing of this single API will keep running continuously until it is manually aborted.

Test failures

In case of any test failures, there are always two files that are created per test failure:

<timestamp>_crash.json: This file contains all the parameters and the seed used to reproduce the same exact error.
<timestamp>_crash.log: This file contains the complete stack trace of the error with which the test case has failed with.

A sample failure gets logged as follows:

The crash files are generated in a human-readable format so the developer can easily reproduce the failure on their dev setup:

2021–04–19 09:47:20.623960_crash.json & 2021–04–19 09:47:20.623960_crash.log:

This kind of crashes can also be easily reproduced by passing `regression` and `dir` parameters to the `pythonfuzz` decorator as shown below:

All you need to do is just rerun the same test and this time only the parameter combinations that result in test failures are run:

Conclusion

Fuzz testing is something that has recently been introduced in cuDF. The fuzz testing framework is designed to facilitate wide-ranging customizations with data generation, parameter combinations, and logging, along with reproducibility. Try it out and let us know if you have any feature requests or issues with the framework by raising an issue on our Github repository. As always, code contributions are welcome too!