Fuzzing Fuzzers, Reducing Reducers and Shrinking Shrinkers

GLFuzz takes a graphics shader and generates a semantically equivalent shader by applying a whole load of transformations. If the equivalent shader triggers a bug in the graphics driver, for example making it crash:

or render a totally different image compared with the original shader:

then GLFuzz reduces the equivalent shader, greedily reversing transformations to try to find a much smaller shader that is still equivalent and still exposes the bug.

This business of randomly generating inputs and then reducing or shrinking bug-inducing inputs is used a lot in testing, by tools such as Csmith/C-Reduce for compiler testing, and property-based testing tools like QuickCheck and Hypothesis.

Of course, sometimes the testing tool, and not the system under test, is to blame.

In the context of GLFuzz

A bug in our shader transformation engine can lead to us generating a supposedly equivalent shader that is in fact not equivalent. Let’s call this a fuzzer bug. On the other hand, when GLFuzz generates a truly equivalent shader that triggers a graphics driver bug, the GLFuzz reducer might be erroneous — while shrinking the shader down to a small example, a buggy reduction step might lead to a smaller shader that is not equivalent to the original. Let’s call this a reducer bug.

Finding fuzzer bugs

We have two ways of approaching this: ahead of time testing, and in the wild testing.

Ahead of time

We use swiftshader, a software renderer from Google, as part of our continuous integration. We generate families of shaders that should be equivalent and flag up cases where swiftshader renders significantly different images. The problem is, swiftshader is not bug-free. At time of writing there is an outstanding non-termination bug in swiftshader that we trigger fairly easily, so that we have to have a blacklist of shaders that are known to be untestable using swiftshader.

In general, this method of testing a fuzzer depends on there being a trustworthy oracle for the inputs the fuzzer generates. The problem of the oracle being buggy is well-known, and can lead to what John Regehr calls a “virtuous cycle” where the fuzzer and oracle find bugs in one another until a fixpoint of reliability is reached. This nice paper describes how Csmith and Frama-C were used to test one another. (And also see this follow-up.)

In the wild

We test many graphics drivers from a lot of vendors with GLFuzz. Sometimes we find a particular equivalent shader that claims to find a bug in every driver under test. When this happens, we’re pretty sure that the shader is in fact not equivalent to its seed shader, indicating a bug in GLFuzz’s generator, so we investigate.

Both approaches to detecting fuzzer bugs are instances of “fuzzing the fuzzer”. The first case is classic testing using a reference implementation. The second case is an inverted form of differential testing, where we are suspicious of cases where different implementations agree that a result is unexpected.

Reducing fuzzer bugs

The problem with fuzzer bugs is that the shaders GLFuzz generates can be really large. It’s basically impossible to look at a 100K transformed shader and figure out what is wrong with it.

Thankfully, reducing fuzzer bugs is pretty easy because we already have a reduction engine to reduce shaders that expose bugs in drivers. What we do is reduce the suspicious shader with respect to one of the drivers under test (or using swiftshader). This leads to a minimal example exposing the cause of the discrepancy — a left-over transformation that has been applied in a semantics-changing manner. This usually points pretty directly to the erroneous transformation, and often gives an indication of what is wrong with the transformation.

Detecting reducer bugs

We run the reducer when we have found a shader S that should be equivalent to a shader T, and yet yields a different image when both shaders are rendered using graphics driver D.

A reducer bug manifests as a chain of shaders, S = S_1, S_2, S_3, …, of decreasing size, such that:

  • A prefix of shaders, S1, S2, …, Sk, are truly equivalent to T, but yield a different image to T when rendered via driver D;
  • A shader S{k+1} that is not equivalent to T, and legitimately leads to a different image to T when rendered via driver D;
  • A “doomed” suffix of shaders S{k+2}, S{k+3}, … Sn that inherit S{k+1}’s deficiency.

As users of the reducer, we quickly know that something is wrong: we look at the final shader Sn and see in an instant that it has some difference from T that is evidently going to change the image that is rendered. For example, Sn might be the same as T but omit a critical statement.

Reducing reducer bugs

The problem with a reducer bug is that while we have the chain of shaders S1, S2, …, we don’t know the location of the shader Sk that is to blame: we know that at some point in the chain the shaders stop being equivalent to T, but we don’t know where.

To investigate this, we have a little tool called ReducerBugPoint (name inspired by LLVM’s bugpoint tool)that works as follows. It takes two graphics drivers, D and D’, with D being the driver in which shader S exposes a bug. The tool keeps reversing transformations, shrinking S, so long as:

  • D renders an unexpected image, and
  • D’ renders the expected image — the image associated with shader T

If at some point we get a shader Sk such that D and D’ both render unexpected images, we’ve very likely triggered the reducer bug.

At this point we could compare “good” shader S{k-1} with “bad” shader Sk. But if these shaders are still really large (tens or hundreds of Kb) this may not shed much light on the problem.

Instead, we backtrack to S{k-1}, the previous “good” shader, and try a different reduction path, hopefully leading us to a smaller shader, Sl, with l > k, that triggers the reducer bug. In that case, we backtrack to shader S{l-1}, and try a different reduction path, etc.

Unless D and D’ suffer from overlapping bugs, this process is pretty reliable at kicking out the parts of the shader that do not trigger the reducer bug, until eventually we get a pair of very small shaders, Sx, S{x+1}, such that Sx is good, S{x+1} is bad, and the difference between them, induced by the buggy reducer, can be eyeballed.

In effect, ReducerBugPoint is a reducer for our reducer, which I really like!

But what if there is a bug in ReducerBugPoint?

Contact me if you want to do a PhD on the topic of automatic reduction of bugs in reducers for test case reducers.