GPU Ray Tracing in One Weekend

8 min readMar 30, 2018

In January 2016, Peter Shirley released the e-book Ray Tracing in One Weekend, which is a quick and gratifying introduction to ray tracing and rendering. Recently I was inspired to implement the book on the GPU to create a project which shows how to trace rays using compute shaders as easily as possible. Yes, fragment shader based ray marching is easier, but compute can actually scale up to a production render engine.

For simplicity, I chose to implement the project using Unity, which significantly lowers the barrier to entry for GPU compute. Unity also has an interactive editor, which I thought would make it fun and easy to setup new scenes and create interesting camera angles.

The intent is not to create a production path tracer, but the final result should be fast, without complicating the code and should serve as an example of how to use the GPU efficiently. Let’s get started.

The Unity Game Loop

The first bit of business to take care of is the basic Unity scaffolding. Here’s the overview of the game loop:

Trace rays using a ComputeShader.
Accumulate results into a RenderTexture.
Render this texture full-screen to the main Unity camera.

The compute kernel runs blocks of 8 x 8 threads for each work group, so the shader dispatch executes (texture.width / 8, texture.height / 8, 1) thread groups. In this way, each pixel of the render texture can be thought of as a single ray which will be traced through the scene. When the ray terminates, the final color value is accumulated into the render texture.

The texture is shown on screen by using Unity’s post effect chain on the main camera: the function OnRenderImage copies the texture to the screen, resolving the final color.

void OnRenderImage(RenderTexture source, RenderTexture dest) {
  Graphics.Blit(m_accumulatedImage, dest, FullScreenResolve);
}

What needs to be resolved? This is a progressive path tracer, so results are accumulated and must be normalized. This happens in the FullScreenResolve shader. Each pixel in the accumulated image stores the sample count in the alpha channel and the color is accumulated into RGB. The full-screen shader writes out RGB divided by A, which gives the average color over all rays traced for that pixel.

Random Sampling

Part of the core algorithm requires selecting random directions for scattered rays, but there are two GPU-specific wrinkles to address; first is the lack of a built-in random() function. There are many ways to generate pseudo random numbers on a GPU, though they vary greatly in terms of how random they actually are in practice. In the spirit of keeping things simple, I used Jorge Jimenez’ interleaved gradient noise which generates a good signal for our purposes and really cant get simpler in terms of implementation:

float InterleavedGradientNoise(vec2 xy) {
  return frac(52.9829189f
              * frac(xy.x * 0.06711056f
                   + xy.y * 0.00583715f));
}

The next problem is choosing a point on a sphere with a uniform distribution. The book implements this as an infinite loop choosing random points in a cube, terminating when a point on the unit sphere is found. This would lead to terrible divergence on the GPU, since each thread will likely terminate after a different number of iterations. There are various ways to approach this, but a fun solution is Michael Sanger’s (et al.) Spherical Fibonacci sequence. This could be computed directly on the GPU, but instead I’ve opted to generate it on the CPU and bake it into a texture. Later, samples are chosen at random on the GPU from the texture using the interleaved gradient noise function described above.

Visualization of 4096 Spherical Fibonacci Points

It’s easy to gloss over these functions, but it’s important to note that the quality of noise and the sample distribution are fundamental drivers of the final image quality:

Buggy Sampling, Image Would not Converge

Tracing Rays on the GPU

Now that the Unity game loop is in place, we need to trace rays on the GPU. Shirley presents this as a recursive algorithm in which rays bounce until they hit the sky dome or run out of bounces:

Initialize a ray with orientation and direction.
Compute the closest intersection.
Compute a scattered ray with orientation and direction.
Rinse & repeat until the ray terminates (recursive step).

Recursion is very bad for a GPU, for most intents and purposes it’s not possible. Furthermore, blocks of threads on the GPU all execute the same instruction, so an early return on one thread means it must sit idle until all other threads in its thread group terminate (this is known as divergence).

Regardless of the performance implication, I wondered if I could side step the recursion by unrolling the code for a small, fixed number of bounces, but while I got surprisingly decent performance, the shader compile times alone forced me to create a proper ray scheduler. This sounds fancy, but it’s not actually complicated at all.

The Ray Scheduler

Rather than storing a rays state on the stack, the scheduler computes all primary rays up-front and stores them in a ComputeBuffer (a StructuredBuffer<Ray> in HLSL). Now that all rays have persistent state that lives across compute invocations, it can process each ray incrementally: compute one bounce and return. The recursion is now a GPU loop, where each iteration of the loop is a single invocation of the compute kernel.

I use the word “scheduler” because the system is capable of allocating any number of rays and processing them in an arbitrary order. The simplest idea is still to store one ray per pixel, so for a 10x10 pixel texture it would allocate 100 rays and process them according to their 2D pixel coordinate. However, there is no requirement that they match, in fact this can be leveraged to more efficiently use the GPU. In the final version, 8x more rays are allocated than pixels, which can be thought of as 8x super sampling. In addition, the order in which rays are processed is slightly more complex than simply mapping the 2D coordinate.

Timothy Lottes gave an inspirational talk at nvscene in which he described an algorithm that minimizes thread divergence for marched rays. Each thread (fragment shader invocation) processes a fixed number of bounces; if a ray terminates early, a new ray is pulled from the queue and continues processing the remaining bounces. A similar pattern is used here: each thread selects the next available ray, however if the ray has terminated, the thread exits, otherwise the thread computes the next ray bounce. All completed rays are re-initialized to camera rays before dispatch.

The initial dispatch numbers still hold, but now the thread group Z dimension is set to supersamplingFactor * computeBouncesPerRay, which in this case is 8 * 6 giving exactly enough threads to process every ray with 6 bounces.

Using Unity’s Camera

Shirley describes how to construct a camera from whole-cloth, however Unity is already doing this work for the main camera. Furthermore, we would like the ray traced camera to reflect exactly what we see in Unity, which leaves the door open to composite ray traced and rasterized pixels together.

The challenge is mapping from the Unity camera to primary rays. Implemented as a separate kernel, InitCameraRays() computes the signed normalized XY coordinate of the pixel and then uses the inverse projection and camera matrices to project the ray from NDC space back into camera-space and then world-space. I chose NDC because the ray start and end points are well known as vec3(ndc.xy, 0) and vec3(ndc.xy, 1).

Depth of Field

Once the basic camera was setup, the rays needed to be adjusted to account for sampling a lens with a given aperture. The algorithm above needs to be tweaked slightly: rather than putting the ray end point at the far clip plane, the ray end is placed at the focal distance. Then the ray origin and ray end are perturbed as described in the book. Here’s the final camera setup:

// Setup focal plane in camera space.
vec4 focalPlane = vec4(0, 0, -_FocusDistance * 2, 1);// To NDC space.
focalPlane = mul(_Projection, focalPlane);
focalPlane /= focalPlane.w;// Ray Start / End in NDC space.
vec4 rayStart = vec4(ndc.xy, 0, 1);
vec4 rayEnd   = vec4(ndc.xy, focalPlane.z, 1);// Rays to camera space.
rayStart = mul(_ProjectionI, rayStart);
rayEnd = mul(_ProjectionI, rayEnd);// Rays to world space.
rayStart = mul(_CameraI, rayStart / rayStart.w);
rayEnd = mul(_CameraI, rayEnd / rayEnd.w);

For fun, I also added a focus target gameObject, which drives the focal plane and allows the user to rack the focus interactively using the Unity Editor manipulators.

GameObject & Material Updates

The ray tracer only supports sphere intersections, so it doesn’t make sense to attempt to search for all Unity geometry types. Instead, I created a special MonoBehavior “RayTracedSphere” which is used to manage objects which can be rendered.

The RayTracedSphere script is responsible for monitoring the object state and communicating changes to the ray tracer as well as producing a Sphere object for the RayTracer to consume.

Currently only the sphere transform, uniform scale, and albedo color are synchronized from Unity.

Reducing Flicker After Updates

The RayTracer script responds to NotifySceneChanged events by invalidating the currently accumulated image. My first implementation simply reset the image to black, which works and is correct, but is also visually jarring. Instead, the image is normalized: for every pixel rgba / a is stored back to the accumulated texture, resetting the sample count to one.

This alone is a great improvement, but instead of using a single sample, the last image is weighted as 15 samples (e.g. (rgba / a) * 15). Furthermore, a slight blur is added while the first 15 samples are computed, the sample count is used to implement a smooth transition.

Source

Here’s the source, it’s just begging for textures, materials, lights and a triangle intersector — have fun!