How The Turing GPU Does Ray-tracing (Explanation For Humans)

Nerdy N Gon
6 min readSep 4, 2019

Modern GPU’s are incredible. Prior to the RTX series, video cards had to fake lighting because real lighting was required too much math. The more math you throw at your processor, the slower each frame will render. GPU’s have become so fast, that they can finally simulate the way light works in real time. But what did they do differently, and how do they do it.

Full diagram of the truing architecture.

GPU architecture may look intimidating at first, but I believe there is a simple way to break it down. Before we understand what GPC’s and TCP’s and all that good stuff are, we need to break down how ray-tracing works.

Modern Raytracing In A Nutshell

Raytracing

Raytracing works the same way real light works.You computer monitor is made of pixels. Your monitor shoots invisible lasers (rays) out from each pixel in a cone as if it were an actual camera lens. Each ray will bounce off of the surface as if it’s an actual laser light. Once the rays are finished bouncing, it will check to see if there is any object between the final point it collided with, and the light source. If there is nothing in the way, the object will be lit. If something is in the way, the object will be shaded, as seen in the diagram below. Nvidia said the Turing’s fastest GPU can project 10 billion rays per second (10 Gigarays/s). Since the other 2 variants of Turing’s GPU’s have 0.75x and 0.5x the number of cores, we can infer that they can produce around 7.5 and 5 Gigarays/s.

DLSS and the De-Noiser

As powerful as the GPU is, raytracing is too still powerful in most cases. Roughly 80% of the work is done in 20% of the time. The other 20% of the work takes 80% of the time. Graphics engineers decided to resolve this problem by only calculating 80% of the image in 1/5th the time, and using deep learning to guess what the pixel color is based on the neighboring sampling results. and previous frames.

20% of the time is spent calculating the left image. The image is mostly complete so deep learning algorithms “guess” what the other pixel colors are

Dividing Up The Workload

Each color is a workload to be processed

Objects are first prepared for ray-tracing when exported from their software. Polygons are divided up in several chunks ahead of time. The rabbit in this case has a different color for each group of polygons. Although the rabbit is a simple shape, it has been subdivided several times to look smoother. The smoother an object, the more polygons are needed to represent it.

Octree Path Tracing

When rays hit a specific point, your computer will need to address what polygon it hits and when. This is called pathtracing. In the past, pathtracing required drawing a line and finding what individual polygon the ray hits. Dense meshes with lots of polygons required tons and tons of math sorting math based on ray 3 dimensional vertex coordinates. Octrees make everything easier.

The truing architecture has entire cores dedicated to this process alone. The scene starts as objects inside an invisible bounding box. The pathtrace ray only needs to check if it collides with that single box. The object’s box gets cut into 4. The ray is either going to be intersecting one of those boxes or it is not. The boxes that have the ray intersecting get subdivided into 4 smaller boxes while the boxes that do NOT contain the ray get removed from processing. This process continues until a single polygon group is isolated.

This process is more binary and requires a lot less dynamic math. In addition to being more simplified, Turing has entire cores dedicated to octree pathtracing and raytracing.

Isolating boxes with dots in them
Shooting a ray onto a rabbit in order to figure out what polygon it hits. Only requires 4 binary (yes/no) checks.

Think of the GPU as a Large Office Building

Our GPU is very large. It is made of of several smaller processors. Each processor can be thought of as a person in a cubical that has to do work to help process the rabbit in our example. Each grouping of cubicals has a team name. or now, only pay attention to the INT32, FP32, tensor cores, and RT cores. Fundamentally they will all do the same thing. They will each take numbers and do math calculations like a mini calculator. Now think of each grey box as a room. Every 4 rooms shares the same supply box (L1 Cache). The grey box room is the only part of the GPU you need to look at because the engineers bust basically copy and pasted each grouping of cubical s over and over again.

Shared Memory (SM)

SM: A group of 4 rooms is called an SM. All 4 rooms can share from a pool of 96K of L1 memory.

INT32: The job of this team is to calculate integer numbers. Integers are whole numbers. GPU’s can compute up to 32 bits. The more bits you have, the LARGER numbers you can work with.

FP32: The job of this team is to calculate floating point numbers. Floating points are whole numbers. GPU’s can compute up to 32 bits. The more bits you have, the MORE SPECIFIC numbers you can work with. Your choices are either working with higher numbers or working with lower numbers.

Tensor Cores: There are full processors that can do all sorts of math. If FP32 and INT32 can be thought of interns that can only be tasked with simple work, these guys can be thought of more experienced workers who get larger cubical because they have more responsibility. They are responsible for all the deep learning calculations. They can work with numbers that range from 4-bit to 8-bit integers or a single 16-bit floating point. The fewer bits they use, the more calculations they can do per second.

Larger numbers take longer

RT Core: There are full processors that can do all sorts of math. If FP32 and INT32 and Tensor cores have larger cubicals, this guys can be thought someone who has his own office. They are responsible for all the raytrace calculations.

L1 Cache: Workload que. All 4 rooms source 96KB of data. Each room can break up the tasks in 1 of 2 ways.

32KB can be shared between all rooms while 64KB is reserved for specific rooms

OR the opposite

64KB can be reserved for rooms while 32KB is reserved for specific rooms. the rest can be assigned to dedicated cores.

LD/SD: basicly just doors to the room

TEX: Memory reserve specifically tor texture data that can be loaded into L1 Cache.

Variants: Nvidia has created 4 variants of the Turing arcitecture. Each with a different number of SM’s and with their respective inherent core counts per SM

SOURCES

--

--