Ray Ray Tracing Raises Ray Tracing Rate

Making Ray Tracing 8x Faster on a Laptop

Published in

Distributed Computing with Ray

7 min readAug 3, 2020

Computer graphics have improved rapidly over the past few decades, leading to increasingly realistic video games and movies. Ray tracing, which renders 3D scene descriptions into 2D images is one of the most powerful techniques. While ray tracing generates the most detailed and realistic images, it consumes substantially more compute than comparable methods like global illuminance. We will explore how Ray, a distributed execution framework for Python applications, makes it easy to add parallel computation to a simple ray tracing application, reducing its execution time by 8x on my 8-core laptop.

State-of-the-art ray-tracing implementations use GPU-based tracing algorithms, whereas this blog post takes a CPU-based approach. However, it illustrates a technique that could be applied as easily to a cluster of GPUs as to my laptop.

And… there will be a lot of the word “ray” in this post, so I use capital-R “Ray” to mean the distributed systems framework and lowercase “ray tracing” to indicate the graphics technique.

Background

Before diving into implementation, let’s take a look at the two kinds of ray for readers who may not be familiar.

Ray tracing is a computer graphics technique to generate 2D images from 3D environments by imitating the way that a camera captures photographs. However, while a physical camera takes in light to capture an image of its surroundings, a ray tracer operates the process in reverse. It sends rays of “light” out from its “camera,” through a 2D plane whose coordinates correspond to pixels in an image. These rays then may intersect with and reflect off of objects in the scene. Then, the color of the pixel is determined by the angle of the reflection compared to the light source, in addition to properties and shape of the occluding material. When a ray reflects from an object and travels directly towards a light source, the pixel corresponding to that ray will be more affected by that light’s properties, such as its luminosity (intensity of light) and color. You can imagine reversing the trajectory of the ray that you have traced, following a ray from the light source, reflecting from an object, and terminating at the camera. Tracing feasible paths from the camera to light sources is more performant than tracing light rays from each light source to the camera, as most light rays are not captured by the camera.

A visualization of the Ray tracing algorithm.

Because these ray computations are independent of one another, a program can freely trace rays in parallel. Ray is a Python framework developed by the RISELab at UC Berkeley that will allow us to accomplish that with minimal changes to our program.

Python has a global interpreter lock that prevents thread-level parallelism. Thus, to run code in parallel you need to run multiple processes, each of which has its own memory space. But why use Ray over the built-in multiprocessing module? I want to share a few ways in which Ray makes our life much easier. With multiprocessing, the programmer must:

Figure out how to pass messages between processes (e.g. GRPC, message queues, etc.)
Copy data between processes or figure out a shared memory scheme
Deal with failures of processes gracefully

Ray lets you not worry about any of this. When using Ray, message passing occurs through Python function calls and method invocations, so you can program your application as though everything is running in one thread on one computer. Ray’s use of Apache Arrow enables data sharing between processes to save on copying costs without any thought from the user. Tasks (functions) and actors (classes) restart automatically if they fail.

Our Sample Implementation

The ray-tracer implementation that we are looking at is found here. Credits go to Cyrille Rossant. It is small, yet reasonably performant through its use of numpy. [1] It incorporates recursive ray “bounces” to produce more accurate lighting and reflectivity.

The original code calls trace_ray in a loop over all pixels on the screen, sending one ray through each pixel and recursively tracing its bounces to calculate a color value.

Ultimately, to achieve the greatest possible speed-up with Ray for my 8-core laptop, we want to break this loop up into 8 chunks, letting one worker/core handle each chunk.

Changes to Utilize Ray

In Ray, the user must define the distributed parts of their program as either Python functions (tasks) or classes (actors). Given that we want to perform the recursive ray-tracing in Ray, I extract this logic to its own function.

By decorating this function with ray.remote, I enable it to run as a task in a distributed context.

In addition, the previous iteration of the function calculates the result, then immediately sets the color intensity of the pixel in an image data structure. Now that we are kicking off many jobs in parallel, we need to change each call to return its result instead of writing to the pixel results list immediately. [2]

The required changes are:

Trigger the jobs and collect the results, along with their corresponding coordinates. These results are not the actual return values of the traced rays. Rather, they refer to the location where the return value will be stored, and are used like Python futures. [3]
Await the results, zip them to the coordinates, and set the image pixels in a loop.

Results

To help illustrate what the speed-up means, I’ve attached two images that took roughly 30 seconds each to render, with only the latter using Ray.

Generated from original raytracing script

Benchmarks

In benchmarking this program on my 8-core laptop, I found that for large inputs, Ray yielded approximately the 8x speed-up I expected.

For each version of the script, I ran 5 trials for each image size: 600x450, 1200x900, and 1600x1200. In each trial, I captured the user value from the Unix time command to reduce noise in the data due to changes in other processes’ usage on the machine. Though I didn’t include the results here, the sys usage time is slightly higher for the Ray version due to the system calls required to distribute the work, though the difference is marginal.

Overall, we get closer to achieving a full 8x speed-up on larger inputs where the work of tracing rays begins to eclipse work in other parts of the program.

Other Improvements Made Easy by Ray

Anti-Aliasing
Anti-aliasing is a technique to prevent artifacts, image distortions that are a result of algorithmic imperfections. It does this by sending multiple slightly-offset rays into the scene per pixel, and averages the results to determine the color value of the pixel. In doing so, it removes imperfections caused by the limited sampling that ray tracing performs. Ray makes it easy to implement anti-aliasing in parallel as a wrapper around the core trace_rays_with_bounces function.

Easy Sharing of Changing State
While the independent nature of tracing individual rays highlights the simplicity of using Ray tasks, there are times when you want to share information between the ray computations. For instance, you may want to give each computation access to a bounding box hierarchy. This is a data structure that makes ray tracing much faster by reducing unnecessary collision checks with objects. It does so by breaking the 2D space into partitioned segments that only contain some subset of the scene’s objects and can result in massive speed-ups. In the case of real-time ray tracing, the bounding box hierarchy changes over time with the scene, and workers need a way to fetch an up-to-date copy; a Ray actor suits this use case.

Ray Tracing on a Cluster
Using the Ray cluster launcher, you can run your program on the cloud (with spot instance support because of Ray’s fault-tolerance). This means that if you run out of algorithmic optimizations, you can go faster by throwing more compute at the problem!

Wrapping Up

I wrote this blog post because I think Ray can do for distributed systems what cornerstone libraries like Tensorflow and React have done for machine learning and interface-building respectively. Distributed computing grows in importance every year as Moore’s Law proves insufficient to meet growing computational needs (read more in Ion Stoica’s article “The Future of Computing is Distributed”). Ray’s goal is to democratize distributed computing and bring it beyond the purview of specialists. The Ray project already includes several “batteries-included” distributed libraries for reinforcement learning, HTTP serving, data munging in pandas, and more. Check it out!

Here is a repo with all the code used in this blogpost, including the reference implementation, the ray implementation, and the scripts I used for benchmarking and graphing.

Finally, if you enjoyed this, you attend Ray Summit for more interesting content!

[1]: Incidentally, Ray supports zero-copy sharing of numpy ndarrays among workers, which can save a tremendous amount of memory.

[2]: You could also create a Ray actor (a Python class) to represent the image, then pass that actor into the trace_ray_with_bounces function.

[3]: A reasonable question here is: “What’s the deal with CHUNK_SIZE? Why didn’t you make a single task correspond to the tracing of a single ray?” Well, when the task becomes too small, we see performance decrease because the overhead of communication overwhelms the benefits of parallelism. Just as an example, an image the size I generated, 1600 by 1200 pixels, would require launching more than a million sub-tasks. Even a small amount of overhead per task adds up quickly with so many and such small tasks. The ideal size of chunk of rays to trace is 1/8 the total number of pixels, which results in 8 tasks distributed across 8 cores.