Two-Pass Occlusion Culling
Occlusion Culling is an optimization technique that is used to improve performance by skipping the rendering of geometry that is occluded, or hidden, by other objects in the scene. There are many different occlusion culling implementations, each with its own set of problems, such as visible popping, data authoring, or performance issues. One of the most widely used techniques in the gaming industry today is a form of GPU-driven occlusion culling called Hierarchical Z-Buffer (HZB) Occlusion Culling.
Here is a brief overview of how HZB occlusion culling works: First, an HZB is built using a depth buffer. The geometry is then culled against the HZB using its bounding volumes. An HZB is a chain of MIP-maps that is generated by downsampling each MIP level to the next. This is done by taking the maximum or minimum (depending on whether we use Reversed-Z) depth value of sets of 4 texels to form each new texel. In addition to occlusion culling, an HZB can also be used for rendering Screen Space Reflections, Volumetric Fog etc.
There are a few ways to optimize the process of building an HZB in order to improve performance. One such method is by using Texture Gather or Sampler Reduction Modes. These allow reduction of texture samples, resulting in improved performance.
Testing against an HZB is typically done using Axis Aligned Bounding Boxes (AABBs) inside a compute shader. The result of this compute shader is usually a set of draw call arguments representing visible objects, which are then used to perform an indirect draw call.
To do an HZB occlusion culling test, we need to choose the correct MIP level in order to produce good coverage and to make the culling process conservative. This is typically done by some logarithmic equation which factors in sides of an AABB. MIP level selection is slightly more complex if the lengths of HZB sides are not the same.
// MIP level selection example
int mipLevel = floor(log2(max(AABB.pixelWidth, AABB.pixelHeight)));Problem
As we already know, in order to build an HZB, we need a depth buffer. The problem is: How do we obtain this depth buffer in the first place? One might assume that we need to render all of our geometry first in order to fill the depth buffer. However, if we do that, what is the point of HZB if all of the geometry has already been rendered? The purpose of occlusion culling is to skip geometry rendering as much as possible, so it needs to be done before the actual rendering process, not after.
Potential Solutions
One common solution to this problem is to render a small subset of geometry first, which is going to represent occluders. Occluders are typically chosen and manually authored by artists, and are usually large objects such as buildings, walls, or terrain. The rendering of occluder geometry is typically done only to the depth buffer, without fragment shader invocations (though it can also be rendered to GBuffers). This process is called Depth Prepass. Using the resulting depth buffer, we can build an incomplete but conservative HZB that can be used for the occlusion culling process. While this technique works well, it requires a significant amount of manual effort to keep it useful.
In most situations, we can assume that geometry that was visible in the previous frame is likely to be visible in the current frame as well. One way to make use of this knowledge is to reuse depth buffer from the previous frame in order to generate an approximation of the new depth buffer for the current frame. This technique is called Depth Buffer Reprojection, and it involves taking into account the new camera transform and Velocity Vector Buffer from the previous frame. Depth reprojection can also be useful for building shadow map HZBs.
One advantage of depth reprojection is that it reduces the need for manual occluder selection, although using both of these techniques can yield even better results.
While depth reprojection has many benefits, it also has some significant limitations. One major problem is precision: since an HZB is built using an approximation of the depth buffer, the culling process can become non-conservative in some cases. This can result in problems such as visible popping, where objects appear to “pop” into view as the camera moves.
Solution
An alternative to using depth buffer reprojection is to use the same core assumption that we started with, but instead of reprojecting the depth buffer from the previous frame, we “reproject” all visible geometry from the previous frame. This technique is called Two-Pass Occlusion Culling. It is considered the state-of-the-art occlusion culling technique by many at the moment and is currently getting more popular due to Nanite.
As the name suggests, two-pass occlusion culling involves dividing the geometry pipeline into two passes. Both passes consist of a single compute shader dispatch, whose purpose is to fill indirect draw call arguments, followed by the execution of an indirect draw call, preferably by using Multi Draw Count Indirect if available.
First Pass
In the first pass, we only process the objects that were visible in the previous frame. To do this, we dispatch a compute shader with the same number of threads as the total number of objects in the scene. Each thread performs frustum culling and LOD selection on objects that were previously visible, while skipping non-visible objects entirely. The result of this compute shader is stored in a GPU buffer as indirect draw call arguments.
After the compute shader is dispatched, we execute an indirect draw call using the arguments from the previously mentioned GPU buffer. To track which objects were visible in the previous frame, we use another GPU buffer called Visibility Buffer. Each element in this buffer corresponds to a single object in the scene and multiple visibility bits can be packed into each buffer element to save space. The visibility buffer is initialized with either 1s or 0s at the beginning of the application.
...
// Read object's visibility from the previous frame
bool visible = visibilityBuffer[drawIndex];
// [Optional] Check if previously visible object
// is frustum culled in the current frame
if (visible)
{
bool frustumCulled = isFrustumCulled(...);
visible &&= !frustumCulled;
}
// Only object that was visible in the
// previous frame should be drawn in the first pass
bool shouldDraw = visible;
if (shouldDraw)
{
// [Optional] Select LOD
...
// Fill indirect draw call arguments
IndirectDrawArgs drawArgs;
...
drawArgs[drawArgsIndex] = drawArgs;
}After the first pass, we generate an HZB from the resulting depth buffer. This is a completely conservative approach compared to using depth reprojection techniques.
Second Pass
In the second pass, we again dispatch a compute shader with the same number of threads as the total number of objects in the scene. This time, however, each thread performs occlusion culling in addition to frustum culling and LOD selection, regardless of whether the object was previously visible or not. The result of this dispatch is a set of indirect draw call arguments representing objects that were found to be visible in this pass and were not drawn in the first pass. These draw arguments are stored in a GPU buffer and later executed as an indirect draw call.
To ensure that we don’t draw objects that were already drawn in the first pass, we need to skip drawing objects that had visibility of 1 in the previous frame. Additionally, we need to update the visibility buffer for each object for the next frame based on the results of frustum and occlusion culling.
...
// [Optional] Check if object is frustum culled in the current frame
bool frustumCulled = isFrustumCulled(...);
bool visible = !frustumCulled;
// Check if object is occlusion culled in the current frame
if (visible)
{
bool occlusionCulled = isOcclusionCulled(...);
visible &&= !occlusionCulled;
}
// Only object that is visible in the current frame
// and was not drawn in the first pass should be drawn in the second pass
bool shouldDraw = visible && !visibilityBuffer[drawIndex];
if (shouldDraw)
{
// [Optional] Select LOD
...
// Fill indirect draw call arguments
IndirectDrawArgs drawArgs;
...
drawArgs[drawArgsIndex] = drawArgs;
}
// Fill visibility buffer for the next frame
visibilityBuffer[drawIndex] = visible;Example
As an example, let’s consider a scene with 5 objects and a moving camera. We will observe the two-pass occlusion culling process of a single frame.
In the first pass, we process all 5 objects in the scene. If an object had a visibility of 0 in the previous frame, we skip it. For objects that had a visibility of 1, we perform frustum culling and LOD selection. Let’s say that we had 3 visible objects in the previous frame, but since the camera is moving, one of the objects is no longer visible due to it being outside the camera’s frustum. As a result, we render only the two objects that are still visible and skip the third one.
In the second pass, we again process all 5 objects in the scene, but this time we don’t skip any of them, regardless of their previous visibility. We perform frustum culling, occlusion culling and LOD selection for all 5 objects. Let’s say that compared to the previous frame, one previously non-visible object enters the camera frustum and becomes visible in the current frame. This means that in this pass, we will draw only this additional object on top of the two already drawn objects from the first pass, resulting in a total of three rendered objects. We also update the visibility buffer appropriately for the next frame.
Conclusion
Two-pass occlusion culling is generally a very effective optimization technique, but it may show its limits in situations with very radical movements relative to the camera. These types of movements are uncommon and are often seen in cutscene transitions. If they do occur, they may impact performance for a single frame after the cut was made. To address these issues, one potential solution would be to introduce an additional depth prepass.
This technique can also be used for meshlet and triangle occlusion culling, although triangle occlusion culling is usually not worth the effort. It can be used in Forward Rendering, Deferred Rendering, and Visibility Buffering (not to be confused with the previously mentioned visibility buffer). For very dense geometry, visibility buffering is likely to work the best, which is why Epic decided to use it in Nanite. My implementation of two-pass occlusion culling can be found on my GitHub page.
