Path tracing with Poplar

Path Tracing

The vast majority of modern approaches to high-quality rendering are based on Monte-Carlo (MC) path-tracing techniques that were originally proposed in (Kajiya, 1986) and have grown in sophistication ever since. The basic idea is that we can form an image by sampling a (very) large number of ray-traced paths through a scene, starting with primary rays at the camera.

Disney’s path tracing explainer.

Processor Architecture

But why implement a path tracer on an IPU? MC path tracing has some characteristics that lean towards the IPU’s strengths and also make it a good introductory example for Poplar programming. The algorithm is highly parallel and Poplar has been designed to express highly parallel IPU programs as simply as possible. The IPU has also shown good performance compared to alternative processors on other Monte-Carlo workloads in probabilistic modelling. Here are some highlights of the IPU architecture:

  • Each tile has 6 hardware worker threads so the IPU can be executing 8832 programs with completely independent control flow.
  • Each tile has 624KiB private SRAM (897MiB aggregate per GC200).
  • Every tile/worker has its own hardware random number generator (HW-RNG) based on the xoshiro family (Blackman, 2018).
  • Tiles can exchange data between their private SRAM all to all with a bandwidth of 8TB/sec.
  • The Poplar C++ API makes it easy to describe highly parallel programs.

Implementation

Although a fully featured path-tracer is a highly complex and sophisticated piece of software, the basic principle can be expressed in a few hundred lines of C++ and we will use one such open source implementation, small-paint, as a starting point: (KarolyZsolnai-Feher, 2018).

Main loop structure in a typical path tracer.

Poplar Programming Model

The IPU uses a BSP graph programming model, so we can’t just take those C++ loops and compile them for the IPU. We first need to re-architect the program as a computational graph in Poplar.

The Compute Graph

Tensor data, Compute vertices, and tile-mapping.
An example Poplar compute graph for path tracing.
Constructing a Poplar compute graph in C++.

Compute Kernels (Codelets)

A compute kernel in Poplar is called a codelet. Codelets define the calculations performed by a worker at each compute vertex. Vertex code is written in a cut down form of C++ and compiled by a dedicated compiler (PopC). These vertices can also be written using assembly for maximum performance or optimised using inline assembly or intrinsics to access IPU hardware specific functionality (see Graphcore’s Vertex Programming Guide for details). A sketch of the C++ codelet for the RayGen vertex we introduced above is here:

Code fragment for the C++ RayGen vertex (compute kernel/codelet).

Programs and Execution

Once the graph is constructed all that is left is to describe how it should be executed. The simplest form of Poplar program is just to execute a single compute set but they can also be grouped into sequences or more complex (and data dependent control flows) that Poplar allows. This control flow is executed on device with no interaction from the host machine. The host only has to tell the device which program to run, hence there is no `kernel launch’ overhead (other than sending a program ID to the device). Here is a code fragment that builds a program with a loop that repeats samplesPerPixel times:

Constructing a graph program (adding control flow to a Poplar compute graph).

Porting the path tracer

The full source code is available as part of the Graphcore examples repository. We fork our own version of Small-paint to both ease the transition to a Poplar program and add a few enhancements. Key modifications to small paint are listed here:

  • Instead of parallelising over single paths in the loop, break the image into tiles and parallelise over jobs that process each image tile.
  • Use a C implementation of xoshiro so the CPU RNG will match the IPU (this also happens to be faster than the original’s use of the C++ standard library Mersenne-Twister).
  • Remove the recursion in the inner loop: replace with two iterative loops, one to trace paths and a following one to accumulate the colour contributions. (Whilst we could use recursion on the IPU we would have to manage the stack size ourselves).
  • Add a disc primitive.
Schematic of the path tracer compute graph (as seen by a single core/tile).

Worker Utilisation

The random number generation vertices RandGenAntiAlias and RandGenUniform are actually program sequences constructed using Poplibs’ utilities and Poplibs will automatically use as many worker threads as possible for those vertices. PrimaryRayGen and PathTracevertices are custom C++ vertices. Most of the work is done in PathTrace so there the frame-buffer tiles are split by rows among the six worker threads (so for perfect utilisation the image tiles should have a multiple of 6 rows). WhilePrimaryRaygen could also be split among workers, profiling using PopVision showed this would add complexity for negligible benefit so rays are generated using one thread per tile. You can see ray_gen takes around 231K cycles versus around 24M to execute the path_trace vertex in the profile here:

Screenshot from PopVision profiler.

Random number generation

Samples are generated using the HW-RNG via Poplib’s utility functions and generated in advance. For the anti-aliasing distribution created in RandGenAntiAlias a choice of uniform, Gaussian, and truncated Gaussian are available: a direct reflection of those currently supported in Poplibs. Other distributions could be generated by custom codelets that access the HW-RNG themselves. Since we need to set a maximum path depth anyway we can ensure we generate enough samples for the worst case path.

Codelets (Kernels)

These are implemented in pure C++. No assembler, intrinsics, or any special care has been taken to optimise the code for the IPU and the AMP unit is not utilised.

Precision

The frame-buffer is float32 by default but can be optionally float16. In the latter case we limit the number of iterations of the repeat loop to avoid saturation: this is no limitation in practice because we will accumulate the final image on the host in higher precision anyway. Uniform random numbers are generated at float16 as are the primary camera rays in the PrimaryRayGen vertex (these are very close to normalised vectors so precision issues are minimal).

Scene Data

The scenes that are used in this proof of concept are very simple and constructed from primitives rather than mesh geometry which means it is possible to hold the entire scene in each core’s memory with room to spare. We also inherit Smallpaint’s approach to the scene data-structure whereby objects are stored in a list and intersections are tested against every object: no acceleration structure is used.

Host Interaction

Because the IPU is an accelerator not a stand-alone computer it requires a host CPU to run an operating system and orchestrate the computation. The host control loop looks like this:

Host-side control loop.
Schematic of the entire computation (host and IPU).
PopVision System Analyser.

Results

We now have everything we need to render some images on the IPU. As you can see even the most basic Monte-Carlo path tracer models a lot of physical effects in a natural way: soft shadows, diffuse scattering, reflection/refraction, and caustics:

Spheres with caustics. 1272x936 image, 100,000 samples per pixel (no de-noising), path traced in 4 minutes and 50 seconds on a Graphcore M2000 (4 x GC200 IPUs).
Cornell-box style scene. 1920x2160 image with rendered to 1 million samples per pixel using 16 IPUs.

Performance

We can see the results look nice but how fast is it? Since we started with a CPU version we can compare performance with that (noting that neither implementation has been heavily optimised for either platform, and also that we would not implement a real path tracer in anything like this way on either CPU or IPU). The machines we will compare are an IPU compute node: an M2000 which contains four GC200 IPUs in a 1U chassis, and a dual processor CPU node containing two 48-core Intel Xeon Platinum 8168 processors. This is actually the IPU’s host CPU machine in our setup (but note that the hosts are disaggregated in IPU systems).

Benchmarks for the two scenes.

References

Kajiya, J. T. (1986, August). The Rendering Equation. SIGGRAPH Comput. Graph., 20(4), 143–150.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store