Path tracing with Poplar

Mark Pupilli
13 min readAug 25, 2021

--

Recent years have seen a surge in novel hardware for artificial intelligence (AI) primarily targeting deep learning (DL) training/inference. The vast majority of these are highly specialised processors, often dedicated to accelerating matrix multiplies or convolutions: essentially, they mimic GPUs but strip out all graphics and general purpose compute capability. Graphcore IPUs, on the other hand, whilst being designed for AI are not over specialised for today’s machine learning fashion of large dense matrix multiplies and convolutions, they support dynamic sparse training for example. Outside of deep-learning IPUs have been used on a range of interesting applications already such as; structured grids for HPC simulations, Kalman-filtering for particle physics, and loopy-Gaussian-belief-propagation for computer vision. All these non-DL applications were implemented using the C++ Poplar API which provides direct access to the IPU hardware via a graph programming model.

In this post we are going to explore the Poplar/IPU programming model by implementing a toy ray-tracer to learn how to directly program the IPU using C++.

Path Tracing

The vast majority of modern approaches to high-quality rendering are based on Monte-Carlo (MC) path-tracing techniques that were originally proposed in (Kajiya, 1986) and have grown in sophistication ever since. The basic idea is that we can form an image by sampling a (very) large number of ray-traced paths through a scene, starting with primary rays at the camera.

Disney’s path tracing explainer.

Once the rays reach light sources, we can accumulate the lighting contributions back along the path accounting for materials hit along the way. We won’t go into the details here but Disney, who use path tracing extensively, have a good video explaining the algorithm and full details of every aspect can be found in (Pharr et. al. 2004).

Processor Architecture

But why implement a path tracer on an IPU? MC path tracing has some characteristics that lean towards the IPU’s strengths and also make it a good introductory example for Poplar programming. The algorithm is highly parallel and Poplar has been designed to express highly parallel IPU programs as simply as possible. The IPU has also shown good performance compared to alternative processors on other Monte-Carlo workloads in probabilistic modelling. Here are some highlights of the IPU architecture:

  • The IPU is highly parallel MIMD machine: each GC200 contains 1472 tiles (homogeneous cores).
  • Each tile has 6 hardware worker threads so the IPU can be executing 8832 programs with completely independent control flow.
  • Each tile has 624KiB private SRAM (897MiB aggregate per GC200).
  • Every tile/worker has its own hardware random number generator (HW-RNG) based on the xoshiro family (Blackman, 2018).
  • Tiles can exchange data between their private SRAM all to all with a bandwidth of 8TB/sec.
  • The Poplar C++ API makes it easy to describe highly parallel programs.

Poplar maps programs to a bulk synchronous parallel (BSP) computation paradigm (Valiant, 1990), (Cheatham, 1996). In this scheme all cores/nodes compute, then synchronise at a barrier, then all exchange data, and repeat. This computation model is well suited for expressing massively parallel computation without the programmer needing to micro-manage synchronisation and race conditions. The IPU hardware architecture directly embodies this paradigm: tiles iterate between computing and then exchanging data (between themselves, or off chip) and they contain hardware synchronisation support for this. Inter-tile communication is achieved by Poplar’s sophisticated graph compiler scheduling all of the program’s communication in advance. This allows the on-chip communication to be protocol free: no bandwidth is lost to addressing or packet headers which translates to a huge inter-core communication bandwidth.

Note: each tile also has a special accumulating matrix product (AMP) unit that achieves the device’s peak floating-point throughput and accelerates the large matrix multiplies and convolutions in deep learning: we will not be using that unit in this programming example.

Implementation

Although a fully featured path-tracer is a highly complex and sophisticated piece of software, the basic principle can be expressed in a few hundred lines of C++ and we will use one such open source implementation, small-paint, as a starting point: (KarolyZsolnai-Feher, 2018).

Small-paint consists of C++ code to trace rays one by one through a simple scene in a loop which is automatically parallelised on a CPU using OpenMP. The inner loop calls various functions to intersect rays with the scene and compute the scattering effects at the intersections and so on. This is a code fragment (not from small-paint) summarising this kind of rendering loop:

Main loop structure in a typical path tracer.

(Note that fragments between brackets <<>> in these snippets are code we omit for succinctness: everything else is C++ not pseudo code.) We would like to take this code but run these loops on the IPU parallelised over many tiles (cores) across multiple IPU processors.

Poplar Programming Model

The IPU uses a BSP graph programming model, so we can’t just take those C++ loops and compile them for the IPU. We first need to re-architect the program as a computational graph in Poplar.

Poplar is a C++ API for describing and compiling computational graphs but the Poplar SDK also contains a codelet (kernel) compiler called PopC, and a number of libraries that build on Poplar (Poplibs). In the path tracer implementation we will implement C++ codelets (kernels) to perform the core path tracing computations (i.e. those within scene.trace()), then we will parallelise these kernels over many thousands of worker threads using Poplar’s graph description API. We will also utilise some Poplibs functionality for accessing the HW-RNG. Let’s see how all these Poplar components interoperate.

The Compute Graph

Tensor data, Compute vertices, and tile-mapping.

Poplar is used to construct a compute graph consisting of vertices describing computations which are connected by edges describing data flow. Data at these edges can be n-way tensors of various floating point and integer datatypes but in this application the majority of data are vectors and a few scalar parameters. We can build a computation graph in C++ by adding compute vertices to our program and connecting fields in those vertices to tensors, creating the data dependencies as we do so. For example following the path tracer C++ code fragment above we might turn that into a graph by implementing a vertex which consumes random numbers, sampled from a distribution appropriate for anti-aliasing, generating noisy camera rays for each pixel which are then passed onto a path trace vertex: this sequential computation forms this linear compute graph:

An example Poplar compute graph for path tracing.

The ellipses above are compute vertices, the edges are labelled with the data flow, and the dashed rectangles are compute sets (CS1, CS2, and CS3) which contain the vertices. A compute set is essentially a compute phase in the IPU’s BSP schedule. C++ code to build this part of the compute graph using the Poplar API could look like this:

Constructing a Poplar compute graph in C++.

In this snippet we omit the fragment that sets tile mappings: in the real program many thousands of vertices are added to each compute set and mapped to different tiles so that we utilise all worker threads across the machine. You can see how this is done here.

Compute Kernels (Codelets)

A compute kernel in Poplar is called a codelet. Codelets define the calculations performed by a worker at each compute vertex. Vertex code is written in a cut down form of C++ and compiled by a dedicated compiler (PopC). These vertices can also be written using assembly for maximum performance or optimised using inline assembly or intrinsics to access IPU hardware specific functionality (see Graphcore’s Vertex Programming Guide for details). A sketch of the C++ codelet for the RayGen vertex we introduced above is here:

Code fragment for the C++ RayGen vertex (compute kernel/codelet).

We can see a class that inherits from the Vertex base class implementing a compute() method that defines the computation performed at this vertex. Note the names of the class and its Input and Output member fields match the strings we used to wire up the compute graph in buildRayGenerator(). For brevity we have omitted fragments with details of other parameters and the computation itself (you can see the full vertex code here).

Programs and Execution

Once the graph is constructed all that is left is to describe how it should be executed. The simplest form of Poplar program is just to execute a single compute set but they can also be grouped into sequences or more complex (and data dependent control flows) that Poplar allows. This control flow is executed on device with no interaction from the host machine. The host only has to tell the device which program to run, hence there is no `kernel launch’ overhead (other than sending a program ID to the device). Here is a code fragment that builds a program with a loop that repeats samplesPerPixel times:

Constructing a graph program (adding control flow to a Poplar compute graph).

Note that the control flow within a sequence is not dependent on the order in which compute sets are added to the sequence but by data dependencies in the graph: the corollary to this is that if there are no data dependencies Poplar will automatically schedule work to run in parallel if possible. The final piece needed for a complete program is setting up data streams to and from the device but this is straight forward and we omit the description here (the full source is available in Graphcore’s GitHub examples repo).

Using the mechanisms described we can build highly parallel and intricate compute graphs. Full details of how to write graph programs can be found in Poplar’s documentation https://docs.graphcore.ai/.

Porting the path tracer

The full source code is available as part of the Graphcore examples repository. We fork our own version of Small-paint to both ease the transition to a Poplar program and add a few enhancements. Key modifications to small paint are listed here:

  • Use float instead of double throughout (IPU has no hardware support for double).
  • Instead of parallelising over single paths in the loop, break the image into tiles and parallelise over jobs that process each image tile.
  • Use a C implementation of xoshiro so the CPU RNG will match the IPU (this also happens to be faster than the original’s use of the C++ standard library Mersenne-Twister).
  • Remove the recursion in the inner loop: replace with two iterative loops, one to trace paths and a following one to accumulate the colour contributions. (Whilst we could use recursion on the IPU we would have to manage the stack size ourselves).
  • Add a disc primitive.

All the ray/path tracing is performed in two custom vertices: PrimaryRayGen and PathTrace. Here is a schematic of the complete computational graph for the path tracer for one tile:

Schematic of the path tracer compute graph (as seen by a single core/tile).

To avoid clutter we omit the compute sets and simply group the compute vertices by program. This graph is replicated across all cores to render different tiles of the image into local frame buffers. To achieve maximum utilisation of the IPU the image is split into a number of tiles that is as close as possible to the number of cores in the device (e.g. 1472 for a single GC200). There are four programs the host can execute; an initialisation program sequence (Init); a repeat loop to execute the core path tracing operation producing a fixed number of samples per pixel (Render); a sequence to reset the frame-buffers to zero (ResetFrameBuffers); and finally a program which streams the frame-buffers back to the host (ReadFrameBuffer).

The init program sets up some parameters. Not all are listed here but essentially it sets the HW-RNG state and writes the pixel coordinates that specify the tile to be rendered on this core. The pixel coordinates are then read by the PrimaryRayGen vertex which generates a camera ray for each pixel by adding anti-aliasing noise to the pixel coordinates it receives. Note there is no data flow arrow between SetHwSeeds and the random number generating vertices. This is because the HW-RNG state is set in registers on each core rather than being passed around as tensor data.

Worker Utilisation

The random number generation vertices RandGenAntiAlias and RandGenUniform are actually program sequences constructed using Poplibs’ utilities and Poplibs will automatically use as many worker threads as possible for those vertices. PrimaryRayGen and PathTracevertices are custom C++ vertices. Most of the work is done in PathTrace so there the frame-buffer tiles are split by rows among the six worker threads (so for perfect utilisation the image tiles should have a multiple of 6 rows). WhilePrimaryRaygen could also be split among workers, profiling using PopVision showed this would add complexity for negligible benefit so rays are generated using one thread per tile. You can see ray_gen takes around 231K cycles versus around 24M to execute the path_trace vertex in the profile here:

Screenshot from PopVision profiler.

Random number generation

Samples are generated using the HW-RNG via Poplib’s utility functions and generated in advance. For the anti-aliasing distribution created in RandGenAntiAlias a choice of uniform, Gaussian, and truncated Gaussian are available: a direct reflection of those currently supported in Poplibs. Other distributions could be generated by custom codelets that access the HW-RNG themselves. Since we need to set a maximum path depth anyway we can ensure we generate enough samples for the worst case path.

Codelets (Kernels)

These are implemented in pure C++. No assembler, intrinsics, or any special care has been taken to optimise the code for the IPU and the AMP unit is not utilised.

Precision

The frame-buffer is float32 by default but can be optionally float16. In the latter case we limit the number of iterations of the repeat loop to avoid saturation: this is no limitation in practice because we will accumulate the final image on the host in higher precision anyway. Uniform random numbers are generated at float16 as are the primary camera rays in the PrimaryRayGen vertex (these are very close to normalised vectors so precision issues are minimal).

Scene Data

The scenes that are used in this proof of concept are very simple and constructed from primitives rather than mesh geometry which means it is possible to hold the entire scene in each core’s memory with room to spare. We also inherit Smallpaint’s approach to the scene data-structure whereby objects are stored in a list and intersections are tested against every object: no acceleration structure is used.

Host Interaction

Because the IPU is an accelerator not a stand-alone computer it requires a host CPU to run an operating system and orchestrate the computation. The host control loop looks like this:

Host-side control loop.

We have not introduced the poplar::Engine object but hopefully its role is obvious from the fragment: it provides functionality that allows the host to control execution of programs on the IPU device.

The only complicated part of the loop is that the host accumulates the image so far into its own float32 buffer and writes it to disk in parallel with the IPU device producing the next set of path samples: the object hostProcessing encapsulates this asynchronous behaviour in the obvious way (implementation here). This ensures the IPU is fully utilised at all times, avoids saturation if we use a float16 frame-buffer, and allows a progressive preview of long renders.

Here is a schematic of the entire program including host (blue) and IPU (pink) computation:

Schematic of the entire computation (host and IPU).

You can see an operation WriteUndef that we have not yet discussed: this explicitly tells Poplar that certain temporary tensor data is no longer live at that point so the memory statically allocated for them can be re-used. (Poplar can automatically determine liveness in many, but not all, cases).

PopVision System Analyser.

The full source code contains PVTI trace points which allows us to visualise the system behaviour in the PopVision system analyser. In the screenshot from the analyser we can see how the host-side computation save_images overlaps with the rendering on the IPU.

Results

We now have everything we need to render some images on the IPU. As you can see even the most basic Monte-Carlo path tracer models a lot of physical effects in a natural way: soft shadows, diffuse scattering, reflection/refraction, and caustics:

Spheres with caustics. 1272x936 image, 100,000 samples per pixel (no de-noising), path traced in 4 minutes and 50 seconds on a Graphcore M2000 (4 x GC200 IPUs).
Cornell-box style scene. 1920x2160 image with rendered to 1 million samples per pixel using 16 IPUs.

Performance

We can see the results look nice but how fast is it? Since we started with a CPU version we can compare performance with that (noting that neither implementation has been heavily optimised for either platform, and also that we would not implement a real path tracer in anything like this way on either CPU or IPU). The machines we will compare are an IPU compute node: an M2000 which contains four GC200 IPUs in a 1U chassis, and a dual processor CPU node containing two 48-core Intel Xeon Platinum 8168 processors. This is actually the IPU’s host CPU machine in our setup (but note that the hosts are disaggregated in IPU systems).

Benchmarks for the two scenes.

The M2000 node can sample paths at a rate two orders of magnitude higher than the dual processor CPU node. This may come as no surprise because the M2000 has orders of magnitude more hardware threads. However, to have the same amount of thread parallelism in a system based on these CPUs you would need 368 nodes plus the hardware and software to orchestrate them. Using IPUs instead plus Poplar’s BSP graph programming model makes writing programs with this level of parallelism relatively simple and requires significantly less hardware than a CPU implementation. Finally, it is worth mentioning that even though this is a “toy” program, there are plenty of occasions in research and engineering when we need results from a toy problem fast, without spending weeks optimising it (and obfuscating the code in the process).

References

Kajiya, J. T. (1986, August). The Rendering Equation. SIGGRAPH Comput. Graph., 20(4), 143–150.

Pharr, M. a. (2004–2021). Physically Based Rendering: From Theory To Implementation https://www.pbr-book.org/

Blackman, D. a. (2018). Scrambled linear pseudorandom number generators. arXiv preprint arXiv:1805.01407.

Valiant, L. G. (1990). A bridging model for parallel computation. Communications of the ACM, 33(8), 103–111.

Cheatham, T. a. (1996). Bulk synchronous parallel computing — a paradigm for transportable software. Tools and Environments for Parallel and Distributed Systems, pp. 61–76.

KarolyZsolnai-Feher. (2018). Smallpaint: A Global Illumination Renderer. Retrieved from https://users.cg.tuwien.ac.at/zsolnai/gfx/smallpaint/

--

--