First stop: AMD — bluescreen via WebGL, and more

[Part of a series of stories on GPU shader compiler bugs; previous story gave an introduction]

We’ll kick off our alphabetical tour of GPU designers with AMD, focusing on some issues we found testing a Radeon R9 Fury under Windows 10. We’re going to look at:

  • a bluescreen caused by visiting a page that uses WebGL, using the Chrome browser
  • a bad image rendered on desktop OpenGL, which appears to be caused by an OpenGL shader compiler bug
  • a bad image rendered under WebGL in Chrome, which appears to be caused by a Direct3D shader compiler bug

We have reported these issues to AMD.

Blue screen via WebGL

[Edit: AMD promptly addressed this issue, fixing the problem in their 17.1.1 drivers.]

The WebGL API allows your web browser to use your GPU for rendering, enabling high-end graphics in web pages. Having JavaScript code from a web page cause shaders to execute on your GPU is potentially a risky business. Because of this, WebGL has been carefully designed to minimise the potential for security issues. As one example, array accesses are restricted so that it is easy to statically show that all accesses will be within bounds.

We applied GLFuzz to test Chrome’s implementation of WebGL, running on top of AMD’s GPU drivers for Windows.

We were surprised to find that we could bluescreen our machine by vising a web page. Check out this video:

Demonstration of a bluescreen caused by visiting a page that uses a WebGL shader, using the Chrome browser, under Windows 10 with an AMD GPU

We triggered this bluescreen using a large variant shader, which was derived from an original shader by applying many semantics-preserving injections. We haven’t yet tried reproducing the issue with a smaller shader. But either way: it shouldn’t be possible to bluescreen a machine by visiting a web page.

Let’s analyse the possible causes of the issue a little further. There are a number of components involved in running a WebGL shader. First, there’s Chrome’s implementation of WebGL. Next, there’s ANGLE — Almost Native Graphics Layer Engine — a cross-platform OpenGL ES implementation. Under Windows, ANGLE converts OpenGL API calls into corresponding Direct3D calls, the rationale being that Direct3D drivers tend to be more reliable than OpenGL drivers under Windows. Reliability is important for WebGL, so Chrome uses ANGLE by default. Then there’s AMD’s GPU driver. And finally there is Windows itself, which interacts with the GPU driver and is responsible for some aspects of Direct3D.

Looking at the bluescreen, we see this message:

If you’d like to know more, you can search online later for this error: THREAD_STUCK_IN_DEVICE_DRIVER

According to this Microsoft documentation, this error means that:

A device driver is spinning in an infinite loop, most likely waiting for hardware to become idle.
This usually indicates problem with the hardware itself, or with the device driver programming the hardware incorrectly. Frequently, this is the result of a bad video card or a bad display driver.

So clearly the issue is driver-related — it could be an issue with (the Direct3D component of) AMD’s driver, or with Windows, or both. It seems that even if Chrome or ANGLE was doing something dodgy, neither component should be able to trigger a bluescreen.

We have reported this issue to AMD, but with responsible disclosure in mind we’ll hold off providing the shader you need to reproduce the problem for now!

Bad image due to dead-by-construction code (desktop OpenGL)

This fragment shader, which we obtained from GLSLsandbox, leads to this image being rendered by our AMD GPU:

Original (and good) image rendered on AMD R9 Fury GPU

This looks as expected, and we see similar images rendered on other GPUs when using this shader.

GLFuzz creates this modified version of the fragment shader, variant.frag, with the following differences:

  • variant.frag declares a new uniform: uniform vec2 injectionSwitch;
  • The following chunks of code are injected into variant.frag (diff the original fragment shader with the variant to see where)

Injection 1:

if (injectionSwitch.x > injectionSwitch.y) {
if (injectionSwitch.x > injectionSwitch.y)
return;
int unused = 0;
}

Injection 2:

if (injectionSwitch.x > injectionSwitch.y)
discard;

Now, it’s clear that these changes could completely change the computation of the shader if injectionSwitch.x is larger than injectionSwitch.y. But when we launch the shader, we set injectionSwitch to (0.0, 1.0), so that each injectionSwitch.x > injectionSwitch.y condition is false. The conditions being false should mean that our injections have no impact on what the shader computes.

However, we find that these changes lead to nothing being rendered. Setting the background to yellow and then shading a rectangle using the variant shader yields … the yellow background:

Curiously, Medium would not let me upload a blank image, hence the text

Removing either of the injections, or simplifying the first injection by removing the nested conditional or the declaration of unused, makes the issue go away.

Our guess at the cause of the issue is this: the variant shader has more complicated control flow, induced by our changes, and this complex control flow triggers a compiler bug that causes rendering to go wrong, despite the conditions of the injected if statements being false at runtime.

Instructions on how to reproduce the issue are here— can you reproduce it? Let us know! (Here’s some info on our platform.)

Aside: EMI testing and dead-by-construction injection

As mentioned in my first post, our compiler testing approach was inspired by EMI testing. In its originally-proposed form, EMI testing involves identifying code in a C program that is unreachable for a certain input I, then creating a variant program in which parts of the identified code are removed. The original and variant, after compilation, should yield identical results when executed on input I.

We had the idea of injecting “dead-by-construction” code in our work on testing OpenCL compilers. We wanted to try EMI testing for OpenCL, and it was simpler for us to manufacture dead code than to identify it using profiling. The use of injectionSwitch above is an instance of dead-by-construction injection applied to OpenGL shader compiler testing.

Bad image due to an unreachable discard statement (WebGL)

This fragment shader renders this nice image via WebGL in Chrome, using our AMD GPU:

Again, the shader comes from GLSLsandbox, and this is the expected image.

We find (via GLFuzz) that changing:

if(d < 0.001) {
break;
}

to:

if(d < 0.001) {
break;
discard;
}

causes nothing to be rendered, leading to a blank image. (I won’t bother pasting in the blank image!)

Clearly it’s a bit silly to write something right after a break, since it will be unreachable. But equally, it should not affect what the shader computes. The discard statement tells the fragment shader not to write any pixel, so we’d expect a blank image if discard was always reachable, but that’s clearly not the case here.

To dig into this a bit more, we tried the example using desktop OpenGL. Here, adding an unreachable discard has no effect (good). Recall from above that Chrome’s WebGL uses ANGLE to translate OpenGL to Direct3D. With this in mind, we tried taking ANGLE rendering out of the browser, by writing a desktop application that uses ANGLE directly. In this setting, the unreachable discard does have the effect we observed using WebGL.

We then speculated that the issue might be an ANGLE bug. Perhaps the unreachable discard was causing an erroneous translation from GLSL (the OpenGL shading language) to HLSL (the shading language for Direct3D). But this doesn’t seem to be the case: we compared the ANGLE-generated HLSL code for the original shader with the ANGLE-generated HLSL code for the variant shader. All looks well here.

We speculate, then, that this example triggers a problem related to compilation of HLSL shaders — either with Direct3D support in Windows, or in AMD’s GPU driver.

Again, full details for how to reproduce the issue are here — let us know if you reproduce it! (Details of our platform.)

Summary

I’ve illustrated that GLFuzz can trigger (what we believe are) shader compiler bugs by injecting code that is either statically or dynamically unreachable. We’ll see more of this in future posts.

We’ve also seen that a bluescreen can result from visiting a page that uses WebGL. In future posts I’ll show some more cases where graphics shaders can cause Bad Things to happen — display freezes, display chaos, and device restarts.

I hope these examples help illustrate that graphics rendering can be a risky business in the presence of compiler and driver bugs, and that GLFuzz has the potential to help GPU designers in finding and eliminating some of these bugs ahead of time.

Next stop: Apple