Efficient GPU Rendering for Dynamic Instances in Game Development

8 min readJan 24, 2024

This article explores a custom rendering architecture designed to efficiently render procedurally generated geometry. Our focus is on setting up compute shaders and optimizing batch counts by merging them into fewer material passes. This approach aims to minimize memory allocation for the entire system. We’ll skip over the basics of indirect rendering and GPU instancing, as there are already many resources available on these topics.

All visible instances were rendered with described architecture

Assumptions: We’re working with a system that aims to use the least amount of data possible for a fixed number of transformations, but with the flexibility to handle different types of game elements (referred to as ‘Prefabs’). The goal is to render everything as efficiently as possible, with most processes (like culling, LOD selection, and data compaction) being handled by the GPU.

Prototype

Think of a prototype as something akin to a prefab in game design. It’s a package that contains all the LOD (Level of Detail) levels, each with its submeshes and materials. This extra layer of abstraction gives us more control and lets us use some neat tricks that are tricky or too labor-intensive with traditional Unity rendering.

Examples of Customization:

Custom Materials for Shadows: By excluding the alpha clip in vegetation, or by drawing all mesh shadows with a single material, we can significantly improve batching.
Rendering Shadows Across LOD Thresholds: We can use the last LOD level mesh of a prototype to render shadows for all LOD distances. This approach reduces the vertex count and further improves batching.

First Attempt at Implementation

Inspired by articles like GPU-Driven Rendering Engines, my initial focus was on reducing batch count and optimizing material passes, essentially aiming to render the entire scene with one large command buffer.

This architecture worked in following way:

We need to know how many prototypes of each type we will have and allocate specific buffers for them. We combine all required prototypes and allocate one big transform and visibility buffer.
We also mark each buffer start/end so we might use it later to read how many prototypes of each type we have based on prefix sum algorithm.
We mark if specific prototype is visiblity ( based on some external class ) and then do prefixSum + compaction to know how many prototypes for each type we need to draw
Then we baste this data into batch buffer and then into command buffer.

As I know this is quite standard method and there is already tons of great papers about it.

As long as this method might works for static environment problem begins when we want to dynamicly change types of prototypes. My main case was to use this architecture with dynamic TerrainInstancer I was working on. You can read more about it here:
https://medium.com/@kacper.szwajka842/gpu-run-time-procedural-placement-on-terrain-cc874e39bbfb
As long as we can predict how many transforms we will have. A number of specific prototypes is unknown and may change each time we update instancer.
In first attempt I allocate buffers for all avaible cases. For example if i would like to spawn 10 types of trees in 1000 position ( type of tree is randomly pick ) i would need to allocate data for actualy 10000 trees and do all prefix sum / compaction for them. This solution scale terrible with increasing number of prototypes.

Found Solution!

One of the highlights of this algorithm is its efficiency — specifically, how it keeps memory allocation constant, no matter how many prototypes are involved.
Let’s unpack this process step by step for a better understanding.

Picking Prototype / Culling
We pick prototype index and set up transform buffer. This buffer map 1=1 to transform buffer and store picked prototype index in it . For better memory storage we pack this pick prototype buffer in the following way : 16 bit pick Prototype, 14-bit lod value, 2-bit visibility ( GBuffer, Shadow). 16 bit max number is 65,536 which gonna be our max number of prototypes types in batch group. 14-bit lod value and 2-bit visibility will be used later.
Then we perform frustum culling ( Hierarchical-Z map occlusion might be also implemented here ) and lod picking. Culling needs to be perform separately for GBuffer and shadow pass, thats why we store their visibility in 2 separate bits.
It also worth mentioning that we only support directional shadow culling. Picked lod is stored in 14bit value ( 0.0–4.0) where 4 is max lod value we support. We store this as transition value to support lod cross fades.
Visibility Transfer
Here based on pick lod we read from “GPU_PROTOTYPE_LOD” buffer to know which meshes need to be rendered on current lod level.
Picking LOD Meshes for Prototypes
Let’s say that in the first step we pick prototype 5, then compute shader reads prototype 5 parms and based on its BBox position and size perform culling + lod picking.
Then based on this pick lod it’s reading which meshes ( batch indexes) need to be rendered and set their indexes.
Note: the size of this “ instancePickBatchBuffer” is dependent from max number of sub meshes in prototype lod level. In my implementation, I assume that we can store max to 4 sub meshes in GBuffer rendering and 4 in shadow pass ( they might be different).
Sorting
After last step, our “instancePickBatchBuffer” have very chaotic order and we still don’t know how many meshes of each type we need to render. Because of that we perform sorting.
Batch Counting and Command Buffer Preparation
When sorting is perform we do few steps to calculate each batch count and then transform this data into one big command buffer which will be used in IndirectDrawingCall.
- We dispatch shader on top of and search for batch start — end. Here`s example code doing this:

[numthreads(GPUI_THREADS,1,1)]
void MarkLaneStart (uint3 id : SV_DispatchThreadID)
{
  if(id.x >= maxInstances) return;

  uint refIndex = batchIndexesRef[id.x];
  uint prevRefIndex = batchIndexesRef[id.x-1];

  uint pickBatch = batchIndexes[refIndex];
  if(pickBatch >= 0xFFFF) return; // Invalid
   
  if((pickBatch != batchIndexes[prevRefIndex] || id.x ==0))
  {
    batches[pickBatch].start = id.x;
    batchesVisibility[pickBatch] = 1;
  }
}
[numthreads(GPUI_THREADS,1,1)]
void MarkLaneEnd (uint3 id : SV_DispatchThreadID)
{
  if(id.x >= maxInstances) return;

  uint refIndex = batchIndexesRef[id.x];
  uint nextRefIndex = batchIndexesRef[id.x+1];

  uint pickBatch = batchIndexes[refIndex];
  if(pickBatch >= 0xFFFF) return; // Invalid
   
  if((pickBatch != batchIndexes[nextRefIndex] || id.x+1 >= maxInstances ))
  {
    batches[pickBatch].end = id.x;
  }
}

Then we loop over all batches and count their sizes and pass this data into command buffer.

Note: these 2 kernels might be merge easly into one but i just left them like this for easier reading. There is 0 performance lost on it anyway because 99,9 % bootleneck is on Sorting pass

Sorting on GPU

Currently one of the biggest bootleneck of the system is sorting. There is already a lot of great papers explaing this process and how we may optimize it.
I`m still testing new options but probably the biggest change is to move to DX12 and use power of WaveIntrinsics and other memory access tricks. Found this repo implementing one of the best algorithms for sorting and also testing their performance : https://github.com/b0nes164/ShaderOneSweep
My current implementation is based on BitonicSort : https://github.com/nobnak/GPUMergeSortForUnity

Transform Storage

As mentioned before one of the biggest bottleneck is memory bandwidth.
My first attempt was to optimize matrix transform and instead of float4x4 use flaot4x3 as unity use in BRG (https://blog.unity.com/engine-platform/batchrenderergroup-sample-high-frame-rate-on-budget-devices )
Another optimization is to not store WorldToObject matrix and instead of calculating it in shader.
WorldToObject matrix is as I know mainly needed for lighting calculation, so we may skip this for some pipelines

Thanks to Jason Booth optimization solution i managed to pack position, rotation and scale into float3 and uint3. I store position in float3 and scale and rotation in uint3.
https://forum.unity.com/threads/gpu-driven-rendering.1519936/#post-9492490:~:text=The%20list%20of,and%20reconstruct%20it.
That was real deal. For example to store 1 000 000 transform in float4x4 we need ~64Mb but with final packing method it only takes ~24Mb

Other memory optimization

Beyond scaling and rotation, other parameters like transform index selection and LOD crossfade values are also packed, further optimizing memory usage. The final buffers and their packing descriptions are as follows:

Drawing methods

If you use Unity just rember to use “Graphics.RenderMeshIndirect()” instead of “Graphics.DrawMeshInstancedIndirect()”, first one is neewest and have working features like motion vectors in HDRP etc. You can easly find more details about them on UnityForum

Can we do better ?

Drawing Many Meshes Once: My implementation is done using Unity engine which as long as I know do not fully support proper way to draw many meshes with same material. ( maybe in future). As renderdoc showing unity split this each mesh into diffrent batch and only pass data for these meshes once. Which is still better than ( REWRITE THIS ). Here`s one thread which showing this:
https://forum.unity.com/threads/gpu-driven-rendering-with-srp-no-drawproceduralindirectnow-for-commandbuffers.1301712/
Merging materials and use texture array : This is something I will try to improve in next step. Idea is to merge materials parms into buffers and store their texture into texture array then when drawing object we just chose which page from array should be taken. Only problem is that implementing this in Unity in somehow “clean” way is quite hard. As I know there is no direct api in Unity to force texture to load/unload. ||
For example lets say we gather all materials which should be merge. In each materials there is 3 texture so in result we should have 3 texture arrays . If we would just leave one reference to old material somewhere in code then for 99% texture from this material will be loaded into memory :) If this is the case, drawing the geometry itself will actually be cheaper, but we will practically double the memory consumption.
I would love to see some “proper” way to do this.
Mesh Clusters : This technique becoming more and more popular in last few years. Probably first bigger implementation might be found here: https://advances.realtimerendering.com/s2015/aaltonenhaar_siggraph2015_combined_final_footer_220dpi.pdf