Stalling a GPU

Lets consider this piece of shader code

half center = tex2D(_Tex, uv).g;
half left = tex2D(_Tex, uvleft).g;
half right = tex2D(_Tex, uvright).g;
half up = tex2D(_Tex, uvup).g;
half down = tex2D(_Tex, uvdown).g;
if (center > 0)
return left + right + up + down;
else
return 0;

We might be tempted to optimize this like so:

half center = tex2D(_Tex, uv).g;
float2 dx = ddx(uv);
float2 dy = ddy(uv);
UNITY_BRANCH
if (center > 0)
{
half left = tex2D_grad(_Tex, uvleft, dx, dy).g;
half right = tex2D_grad(_Tex, uvright, dx, dy).g;
half up = tex2D_grad(_Tex, uvup, dx, dy).g;
half down = tex2D_grad(_Tex, uvdown, dx, dy).g;
return left + right + up + down;
}
return 0;

Assuming the compiler decides to obey our UNITY_BRANCH intrinsic and branch on this code, we’ve potentially saved 4 samples! However, we’ve introduced something far worse, a dependent texture read (DTR for short). It may not be obvious what a DTR is if you’re used to writing CPU code, so let’s rewrite our original example as CPU pseudo code to illustrate what is really happening.

var centerJob = LaunchTextureJob(_Tex, uv);
var leftJob = LaunchTextureJob(_Tex, uvleft);
var rightJob = LaunchTextureJob(_Tex, uvright);
var upJob = LaunchTextureJob(_Tex, uvup);
var downJob = LaunchTextureJob(_Tex, uvdown);
// if the center fetch is finished, we might be able to early out
when (centerJob.Done())
{
if (centerJob.result <= 0)
{
return 0;
}
}
// if not, we wait for all jobs to complete and return result
when (leftJob.Done() && rightJob.Done() && upJob.Done() && downJob.Done())
{
return leftJob.result + rightJob.result + upJob.result + downJob.result;
}

Now with our branch added:

var centerJob = LaunchTextureJob(_Tex, uv);// wait for the center job to complete
when (centerJob.Done())
{
if (centerJob.result <= 0)
{
return 0;
}
}
// now launch all the other jobs
var leftJob = LaunchTextureJob(_Tex, uvleft);
var rightJob = LaunchTextureJob(_Tex, uvright);
var upJob = LaunchTextureJob(_Tex, uvup);
var downJob = LaunchTextureJob(_Tex, uvdown);
when (leftJob.Done() && rightJob.Done() && upJob.Done() && downJob.Done())
{
return leftJob.result + rightJob.result + upJob.result + downJob.result;
}

Texture fetches can happen in parallel on the GPU, so they are more like jobs being launched on other threads. GPU compilers will reorder your code to start these jobs as soon as possible. So introducing a branch that depends on the center sample’s return value forces the GPU to have to wait for that job to complete before it can decide if it needs to fetch the other textures or not. In most cases, it’s just faster to start fetching all of the results at once.

A balancing act

The compiler will attempt to reorder your code to try to get any algorithmic work done while it waits for these texture fetches (assuming there are no dependencies). For instance, if you were also generating some noise on the GPU, it might compute this noise while waiting for the textures to be fetched, making the cost of generating the noise essentially free. Much like balancing the load of many cores, shaders can be designed to balance texture fetches and algorithmic work in parallel. In a shader with few texture fetches, generating noise on the GPU can be more expensive than using a texture, while on a shader with lots of texture fetches, it might be faster to generate the noise than to read it from a texture. Of course, this will entirely depend on the GPU you are running on.

All samples are not created equal

Another thing to consider is the cost of a sample. In our case above, we sample a center pixel and the pixels to each side of it. This looks like 5 texture samples, but the cost is going to be much cheaper than that. This is because the data needed for the adjacent samples will already be in the texture cache. When you sample a texture on a GPU, it figures out which mip map to sample from, and loads a small section of texture data and, if compressed, decompresses it into the cache. Repeated samples in close proximity like this are thus much faster than if they were scattered all over the texture, or from different textures, because that work is already done.

Sometimes the dependency is worth it

As an example of this, consider the way Unity’s terrain shader works vs. MicroSplat. Unity’s terrain shader draws the terrain once for every 4 textures used on it, and during each pass, it samples a single control texture (splat map) and 4 sets of textures (diffuse, normal, etc). These can all be launched in parallel at the very beginning of the fragment shader. Once they are completed, blending is performed and the result returned.

MicroSplat takes a different approach. It runs in a single pass and samples all the control maps to get the weights for all of the textures used on the terrain. It then sorts these and takes the 4 highest weight values. It then samples only those 4 texture sets instead of all of the ones which might be on the terrain, and branches around any textures that have 0 weight as well.

This introduces a dependency on all of the control map textures being sampled before it can start sampling the actual terrain textures, because it won’t know which ones to sample until the weights from the control texture are returned and sorted. This makes MicroSplat slower than the Unity terrain shader in very simple shading cases, but at some level of complexity (more terrain textures, more complex sampling like triplanar or stochastic), this extra dependency allows for massive speed increases.

Further, because we are already paying the cost of that dependency, we know that we are not adding extra dependencies if we want to branch around texture samples based on their weights. Doing that on the Unity shader actually slows it down because of the extra dependency introduced.

Summary

  • Remember that a shader is more like a bunch of jobs than serial GPU code, even though the syntax looks like serial code
  • Avoid introducing extra dependencies between those jobs, and understand the dependencies you have as they can hint to additional optimizations
  • The cost of certain operations can be completely masked by this parallelism.
  • The cost of all texture samples are not the same

--

--

Graphics Engineer, blog mainly about shader techniques used in my Unity assets available on the asset store

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jason Booth

Graphics Engineer, blog mainly about shader techniques used in my Unity assets available on the asset store