Efficient Render Passes — On Tile-Based Rendering Hardware

Published in

Android Developers

11 min readAug 9, 2024

Overview

There are currently two major classes of GPU architectures: Immediate-Mode Rendering (IMR) and Tile-Based Rendering (TBR).

The IMR architecture is older, somewhat simpler and more forgiving to inefficiently written applications, but it’s power hungry. Often found in desktop GPU cards, this architecture is known to provide high performance while consuming hundreds of watts of power.

The TBR architecture on the other hand can be very energy efficient, as it can minimize access to RAM as a major source of energy draw in typical rendering. Often found in mobile and battery-powered devices, these GPUs could consume as little as single digit watts of power. However, this architecture’s performance heavily depends on correct application usage.

In comparison with IMR GPUs, TBR GPUs have some advantages (such as efficient multisampling) and disadvantages (such as inefficient geometry and tessellation shaders). For more information, see this blog post. Some GPU vendors produce hybrid architectures, and some manage to consume little power with IMR hardware on mobile devices, but in most GPU architectures used in mobile devices, it’s the TBR features that make low power consumption possible.

In this post, I’ll explain one of the most important features of TBR hardware, how it can be most efficiently used, how Vulkan makes it very easy to do that, and how OpenGL ES makes it so easy to ruin performance and what you can do to avoid that.

Efficient Rendering on TBR Hardware

Without going into too much detail, TBR hardware operates on the concept of “render passes”. Each render pass is a set of draw calls to the same “framebuffer” with no interruptions. For example, say a render pass in the application issues 1000 draw calls.

TBR hardware takes these 1000 draw calls, runs the pre-fragment shaders and figures out where each triangle falls in the framebuffer. It then divides the framebuffer in small regions (called tiles) and redraws the same 1000 draw calls in each of them separately (or rather, whichever triangles actually hit that tile).

The tile memory is effectively a cache that you can’t get unlucky with. Unlike CPU and many other caches, where bad access patterns can cause thrashing, the tile memory is a cache that is loaded and stored at most once per render pass. As such, it is highly efficient.

So, let’s put one tile into focus.

Memory accesses between RAM, Tile Memory and shader cores. The Tile Memory is a form of fast cache that is (optionally) loaded or cleared on render pass start and (optionally) stored at render pass end. The shader cores only access this memory for framebuffer attachment output and input (through input attachments, otherwise known as framebuffer fetch).

In the above diagram, there are a number of operations, each with a cost:

Fragment shader invocation: This is the real cost of the application’s draw calls. The fragment shader may also access RAM for texture sampling etc, not shown in the diagram. While this cost is significant, it is irrelevant to this discussion.
Fragment shader attachment access: Color and depth/stencil data is found on the tile memory, access to which is lightning fast and consumes very little power. This cost is also irrelevant to this discussion.
Tile memory load: This costs time and energy, as accessing RAM is slow. Fortunately, TBR hardware has ways to avoid this cost:
- Skip the load and leave the contents of the framebuffer on the tile memory undefined (for example because they are going to be completely overwritten)
- Skip the load and clear the contents of the framebuffer on the tile memory directly
Tile memory store: This is at least as costly as load. TBR hardware has ways to avoid this cost too:
- Skip the store and drop the contents of the framebuffer on the tile memory (for example because that data is no longer needed)
- Skip the store because the render pass did not modify the values that were previously loaded

The most important takeaway from the above is:

Avoid load at all costs
Avoid store at all costs

This is trivial with Vulkan, but easier said than done with OpenGL. If you are on the fence about moving to Vulkan, the extra work of managing descriptor sets, command buffers, etc will all be worth the tremendous gain from creating fewer render passes with the appropriate load and store ops.

Render Passes in Vulkan

Vulkan natively has a concept of render passes and load and store operations, directly mapping to the TBR features above. Take a set of attachments (some color, maybe a depth/stencil) in a render pass, they will have load ops (corresponding to “Tile memory load” as described in the section above) and store ops (corresponding to “Tile memory store”). Inside the render pass, only a few calls are allowed; notably calls that set state, bind resources and draw calls.

You can create render passes with VK_KHR_dynamic_rendering (modern approach) or VkRenderPass objects (original approach). Either way, you can configure the load and store operations of each render pass attachment directly.

Possible load ops are:

LOAD_OP_CLEAR: This means that the attachment is to be cleared when the render pass starts. This is very cheap, as it is done directly on tile memory.
LOAD_OP_LOAD: This means that the attachment contents are to be loaded from RAM. This is very slow.
LOAD_OP_DONT_CARE: This means that the attachment is not loaded from RAM, and its contents are initially garbage. This has no cost.

Possible store ops are:

STORE_OP_STORE: This means that the attachment contents are to be stored to RAM. This is very slow.
STORE_OP_DONT_CARE: This means that the attachment is not stored to RAM, and its contents are thrown away. This has no cost.
STORE_OP_NONE: This means that the attachment is not stored to RAM because the render pass never wrote to the attachment at all. This has no cost.

An ideal render pass could look like the following:

Use LOAD_OP_CLEAR on all attachments (very cheap)
Numerous draw calls
Use STORE_OP_STORE on the primary color attachment, and STORE_OP_DONT_CARE on ancillary attachments (such as depth/stencil, g-buffers, etc) (minimum store cost)

For multisampling, the ops are similar. See this blog post for further details regarding multisampling.

You can achieve highly efficient rendering on TBR hardware with Vulkan by keeping the render passes as few as possible and avoiding unnecessary load and store operations.

Mapping to OpenGL

Unfortunately, OpenGL does not guide the application towards efficient rendering on TBR hardware (unlike Vulkan). As such, mobile drivers have accumulated a number of heroics to reorder the stream of operations issued by the applications as otherwise their performance would be abysmal. These heroics are the source of most corner case bugs you might have encountered in these drivers, and understandably so; they make the driver much more complicated.

Do yourself a favor and upgrade to Vulkan!

Still here? Alright, let’s see how we can make an OpenGL application issue calls that would lead to ideal render passes. The best way to understand that is actually by mapping a few key OpenGL calls to Vulkan concepts, as they match the hardware very well. So first, read the Render Passes in Vulkan section above!

Now let’s see how to do that with OpenGL.

Step 1: Do NOT Unnecessarily Break the Render Pass

This is extremely important, and the number one source of inefficiency in apps and heroics in drivers. What does it mean to break the render pass? Take the ideal render pass in the previous section: what happens if in between the numerous draw calls, an action is performed that cannot be encoded in the render pass?

Say out of 1000 draw calls needed for the scene, you’ve issued 600 of them and now need a quick clear of a placeholder texture to sample from in the next draw call. You bind that texture to a temp framebuffer, bind that framebuffer and clear it, then bind back the original framebuffer and issue the rest of the 400 draw calls. Real applications (plural) do this!

But, the render pass cannot hold a clear command for an unrelated image (it can only do that for the render pass’s attachments). The result would be two render passes:

(original render pass’s load ops)
600 draw calls
Render pass breaks: Use STORE_OP_STORE on all attachments (super expensive)
Clear a tiny texture
Use LOAD_OP_LOAD on all attachments (super expensive)
400 draw calls
(original render pass’s store ops)

OpenGL drivers actually optimize this and shuffle the clear call before the render pass and avoid the render pass break … if you’re lucky.

What causes a render pass to break? A number of things:

The obvious: things that need the work to get to the GPU right now, such as glFinish(), glReadPixels(), glClientWaitSync(), eglSwapBuffers(), etc.
Binding a different framebuffer (glBindFramebuffer()), or mutating the currently bound one (e.g. glFramebufferTexture2D()): This is the most common reason for render pass breaks. Very important not to unnecessarily do this. Please!
Synchronization requirements: For example, glMapBufferRange() after writing to the buffer in the render pass, glDispatchCompute() writing to a resource that was used in the render pass, glGetQueryObjectuiv(GL_QUERY_RESULT) for a query used in the render pass, etc.
Other possibly surprising reasons, such as enabling depth write to a depth/stencil attachment that was previously in a read-only feedback loop (i.e. simultaneously used for depth/stencil testing and sampled in a texture)!

The best way to avoid render pass breaks is to model the OpenGL calls after the equivalent Vulkan application would have:

Separate non-render-pass calls from render pass calls and do them before the draw calls.
During the render pass, only bind things (NOT framebuffers), set state and issue draw calls. Nothing else!

Step 2: Control the Load and Store Ops

OpenGL has its roots in IMR hardware, where load and store ops effectively don’t exist (other than LOAD_OP_CLEAR of course). They are ignored in Vulkan implementations on IMR hardware today (again, other than LOAD_OP_CLEAR). As demonstrated above however, they are very important for TBR hardware, and unlucky for us, support for them was not directly added to OpenGL.

Instead, there is a combination of two separate calls that controls load and store ops of a render pass attachment. You have to make these calls just before the render pass starts and just after the render pass ends, which as we saw above it is not at all obvious when it happens. Enter driver heroics to reorder app commands of course.

The two calls are the following:

glClear() and family: When this call is made before the render pass starts, it results in the corresponding attachment’s load op to become LOAD_OP_CLEAR.
glInvalidateFramebuffer(): If this call is made before the render pass starts, it results in the corresponding attachment’s load op to become LOAD_OP_DONT_CARE. If this call is made after the render pass ends, the corresponding attachment’s store op may become STORE_OP_DONT_CARE (if the call is not made too late).

Because the glClear() call is made before the render pass starts, and because applications make that call in really random places, mobile drivers go to great lengths to defer the clear call such that if and when a render pass starts with such an attachment, its load op can be turned into LOAD_OP_CLEAR. This means that generally the application can clear the attachments much earlier than the render pass starts and still get this good load op. Beware that scissored/masked clears and scissored render passes thwart all that however.

For glInvalidateFramebuffer(), the driver tracks which subresources of the attachment have valid or invalid contents. When done earlier than the render pass starts, this can easily lead to the attachment’s load op to become LOAD_OP_DONT_CARE. To get the store op to become STORE_OP_DONT_CARE however, there is nothing the driver can do if the app makes the call at the wrong time.

To get the ideal render pass then, the application would have to make the calls as such:

glClear() or glClearBuffer*() or glInvalidateFramebuffer() (can be done earlier)
Numerous draw calls
glInvalidateFramebuffer() for ancillary attachments.

It is of the utmost importance for the glInvalidateFramebuffer() call to be made right after the last draw call. Anything else happening in between may make it too late for the driver to adjust the store op of the attachments. There is a slight difference for multisampling, explained in this blog post.

Step 3: Get Help From ANGLE

Now you’ve gone through the trouble of implementing all that in your application or game, but how do you know it’s actually working? Sure, FPS is doubled and battery lasts much longer, but are all optimizations working as expected?

You can get help from a project called ANGLE (slated to be the future OpenGL ES driver on Android, already available in Android 15). ANGLE is an OpenGL layer on top of Vulkan (among other APIs, but that’s irrelevant here), which means that it’s an OpenGL driver that does all the same heroics as native drivers, except it produces Vulkan API calls and so is one driver that works on all GPUs.

There are two things about ANGLE that make it very handy in optimizing OpenGL applications for TBR hardware.

One is that its translation to Vulkan is user-visible. Since Vulkan render passes map perfectly to TBR hardware, by inspecting the generated Vulkan render passes one can determine whether their OpenGL code can be improved and how. My favorite way of doing that is taking a Vulkan RenderDoc capture of an OpenGL application running over ANGLE.

Notice a LOAD_OP_LOAD that’s unnecessary? Clear the texture, or invalidate it!
Notice a STORE_OP_STORE that’s unnecessary? Put a glInvalidateFramebuffer() at the right place.
Is STORE_OP_STORE still there? That was not the right place!
Have more render passes than you expected? See next point.

The other is that it declares why a render pass has ended. In a RenderDoc capture, this shows up at the end of each render pass, which can be used to verify that the render pass break was intended. If it wasn’t intended, together with the API calls around the render pass break, the provided information can help you figure out what OpenGL call sequence has caused it. For example, in this capture of a unit test, the render pass is broken due to a call to glReadPixels() (as hinted at by the following vkCmdCopyImageToBuffer call):

ANGLE can be instructed to include the OpenGL calls that lead to a given Vulkan call in the trace, which can make figuring things out easier. ANGLE is open source, feel free to inspect the code if that helps you understand the reason for render pass breaks more easily.

While on this subject, you might find it helpful that ANGLE issues performance warnings when it detects inefficient use of OpenGL (in other scenarios unrelated to render passes). These warnings are reported through the GL_KHR_debug extension’s callback mechanism, are logged and also show up in a RenderDoc capture. You might very well find other OpenGL pitfalls you have fallen into.

Conclusion

Vulkan may seem complicated at first, but it does one thing very well; it maps well to hardware. While an OpenGL application may be shorter in lines-of-code, in a way it’s more complicated to write a good OpenGL application especially for TBR hardware because of its lack of structure.

Whether you end up upgrading to Vulkan or staying with OpenGL, you’d do very well to learn Vulkan. If nothing else, learning Vulkan will help you write better OpenGL code.

If your future holds more OpenGL code, I sincerely hope that the above knowledge helps you produce OpenGL code that is not too much slower than the Vulkan equivalent would. And don’t hesitate to try ANGLE out to improve your performance, people who do have found great success with it!