Multisampled Anti-aliasing For Almost Free — On Tile-Based Rendering Hardware

Shahbaz Youssefi
Android Developers
Published in
9 min readMay 9, 2024

--

Overview

Anti-aliasing (AA) is an important technique to improve the quality of rendered graphics. Numerous algorithms have been developed over the years:

  • Some rely on post-processing aliased images (such as FXAA): These techniques are fast, but produce low quality images
  • Some rely on shading multiple samples per pixel (SSAA): These techniques are expensive due to the high number of fragment shader invocations
  • More recent techniques (such as TAA) spread the cost of SSAA over multiple frames, reducing the cost to single-sampled rendering at the cost of code complexity
Example of anti-aliasing. Left: Aliased, Right: Anti-Aliased
Anti-aliasing in Action. Left: Aliased scene. Right: Anti-aliased scene.

While TAA and the likes are gaining popularity, MSAA has for a long time been the compromise between performance and complexity. In this method, fragment shaders are run once per pixel, but coverage tests, depth tests, etc are performed per sample. This method can still be expensive due to the higher amount of memory and bandwidth consumed by the multisampled images on Immediate-Mode Rendering (IMR) architectures.

However, GPUs with a Tile-Based Rendering (TBR) architecture do so well with MSAA, it can be nearly free if done right. This article describes how that can be achieved. Analysis of top OpenGL ES games on Android shows MSAA is rarely used, and when it is, its usage is suboptimal. Visuals in Android games can be dramatically improved by following the advice in this blog post, and practically for free!

The first section below demonstrates how to do this on the hardware level. The sections that follow point out the necessary API pieces in Vulkan and OpenGL ES to achieve this.

On the Hardware Level

Without going into too much detail, TBR hardware operates on the concept of “render passes”. Each render pass is a set of draw calls to the same “framebuffer” with no interruptions. For example, say a render pass in the application issues 1000 draw calls.

TBR hardware takes these 1000 draw calls, runs the pre-fragment shaders and figures out where each triangle falls in the framebuffer. It then divides the framebuffer in small regions (called tiles) and redraws the same 1000 draw calls in each of them separately (or rather, whichever triangle actually hits that tile).

The tile memory is effectively a cache that you can’t get unlucky with. Unlike CPU and many other caches, where bad access patterns can cause thrashing, the tile memory is a cache that is loaded and stored at most once per render pass. As such, it is highly efficient.

So, let’s put one tile into focus.

Memory accesses between RAM, Tile Memory and shader cores. The Tile Memory is a form of fast cache that is (optionally) loaded or cleared on render pass start and (optionally) stored at render pass end. The shader cores only access this memory for framebuffer attachment output and input (through input attachments, otherwise known as framebuffer fetch).
Memory accesses between RAM, Tile Memory and shader cores. The Tile Memory is a form of fast cache that is (optionally) loaded or cleared on render pass start and (optionally) stored at render pass end. The shader cores only access this memory for framebuffer attachment output and input (through input attachments, otherwise known as framebuffer fetch).

In the above diagram, there are a number of operations, each with a cost:

  • Fragment shader invocation: This is the real cost of the application’s draw calls. The fragment shader may also access RAM for texture sampling etc, not shown in the diagram. While this cost is significant, it is irrelevant to this discussion.
  • Fragment shader attachment access: Color, depth and stencil data is found on the tile memory, access to which is lightning fast. This cost is also irrelevant to this discussion.
  • Tile memory load: This costs time and energy, as accessing RAM is slow. Fortunately, TBR hardware has ways to avoid this cost:
    - Skip the load and leave the contents of the framebuffer on the tile memory undefined (for example because they are going to be completely overwritten)
    - Skip the load and clear the contents of the framebuffer on the tile memory directly
  • Tile memory store: This costs even more than load. TBR hardware has ways to avoid this cost too:
    - Skip the store and drop the contents of the framebuffer on the tile memory (for example because that data is no longer needed)
    - Skip the store because the render pass did not modify the values that were previously loaded

The most important takeaway from the above is:

  • Avoid load at all costs
  • Avoid store at all costs

With that in mind, here is how MSAA is done on the hardware level with practically the same cost as single-sampled rendering:

  • Allocate space for MSAA data only on the tile memory
  • Do NOT load MSAA data
  • Render into MSAA framebuffer on the tile memory
  • “Resolve” the MSAA data into single-sampled data on the tile memory
  • Do NOT store MSAA data
  • Store only the resolved single-sampled data

For comparison, the equivalent single-sampled rendering would be:

  • Do NOT load data
  • Render into framebuffer on the tile memory
  • Store data

Looking more closely, the following can be observed:

  • MSAA data never leaves the tile memory. There is no RAM access cost for MSAA data.
  • MSAA data does not take up space in RAM
  • No data is loaded on tile memory
  • The same amount of data is stored in RAM in both cases

Basically then the only additional cost of MSAA is on-tile coverage tests, depth tests etc, which is dwarfed in comparison with everything else.

If you can implement that in your program, you should be able to get MSAA rendering at no memory cost and practically no GPU time and energy cost. For once, you can have your cake and eat it too! Just don’t go overboard with the sample count, the tile memory is still limited. 4xMSAA is the best choice on today’s hardware.

In Vulkan

Vulkan makes it very easy to make the above happen, as it’s practically structured with the above mode of rendering in mind. All you need is:

  • Allocate your MSAA image with VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT, on memory that has VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT
    - The image will not be allocated in RAM if no load or store is ever done to it
  • Do NOT use VK_ATTACHMENT_LOAD_OP_LOAD for MSAA attachments
  • Do NOT use VK_ATTACHMENT_STORE_OP_STORE for MSAA attachments
  • Use a resolve attachment for any MSAA attachment for which you need the data after the render pass
    - Use VK_ATTACHMENT_LOAD_OP_DONT_CARE and VK_ATTACHMENT_STORE_OP_STORE for this attachment

The above directly translates to the free MSAA rendering recipe outlined in the previous section.

This can be done even easier with the VK_EXT_multisampled_render_to_single_sampled extension where supported, where multisampled rendering can be done on a single-sampled attachment, with the driver taking care of all the above details.

For reference, please see this modification to the “hello-vk” sample: https://github.com/android/ndk-samples/pull/995. In particular, this commit shows how a single-sampled application can be quickly turned into a multisampled one using the VK_EXT_multisampled_render_to_single_sampled extension, and this commit shows the same with resolve attachments.

In terms of numbers, with locked GPU clocks on a Pixel 6 with a recent ARM GPU driver, the render passes in different modes take approximately 650us when single-sampled and 800us when multisampled with either implementation (so, not completely free). GPU memory usage is identical in both cases. For comparison, when using resolve attachments, if the store op of the multisampled color attachments is VK_ATTACHMENT_STORE_OP_STORE, the render pass takes approximately 4300us and GPU memory usage is significantly increased. That’s more than 5x slow down by using the wrong store op!

In OpenGL ES

In contrast with Vulkan, OpenGL ES does not make it clear how to best utilize TBR hardware. As a result, numerous applications are riddled with inefficiencies. With the knowledge of the ideal render pass in the sections above, however, an OpenGL ES application can also perform efficient rendering.

Before getting into the details, you should know about the GL_EXT_multisampled_render_to_texture extension, which allows multisampled rendering to a single-sampled texture and lets the driver do all the above automatically. If this extension is available, it’s the best way to get MSAA rendering for nearly free. It is enough to use glRenderbufferStorageMultisampleEXT() or glFramebufferTexture2DMultisampleEXT() with this extension to turn single-sampling into MSAA.

Now, let’s see what OpenGL ES API calls can be used to create the ideal render pass without that extension.

Single Render Pass

The most important thing is to make sure the render pass is not split into many. Avoiding render pass splits is very important even for single-sampled rendering. This is actually quite tricky with OpenGL ES, and drivers do their best to reorder the application’s calls to keep the number of render passes to a minimum.

However, applications can help by having the render pass contain nothing but:

  • Bind programs, textures, other resources (not framebuffers)
  • Set rendering state
  • Draw

Changing framebuffers or their attachments, sync primitives, glReadPixels, glFlush, glFinish, glMemoryBarrier, resource write-after-read, read-after-write or write-after-write, glGenerateMipmap, glCopyTexSubImage2D, glBlitFramebuffer, etc are examples of things that can cause a render pass to prematurely finish.

Load

To avoid loading data from RAM onto the tile memory, the application can either clear the contents (with glClear()) or let the driver know the contents of the attachment is not needed. This latter is a very important function for TBR hardware that’s unfortunately severely underutilized:

const GLenum discards[N] = {GL_COLOR_ATTACHMENT0, …};
glInvalidateFramebuffer(GL_DRAW_FRAMEBUFFER, N, discards);

The above must be done before the render pass starts (i.e. the first draw of the render pass) if the framebuffer is not otherwise cleared and old data doesn’t need to be retained. This is also useful for single-sampled rendering.

Store

The key to avoiding storing data to RAM is also glInvalidateFramebuffer(). Even without MSAA rendering, this can be used for example to discard the contents of the depth/stencil attachment after the last pass that uses it.

const GLenum discards[N] = {GL_COLOR_ATTACHMENT0, …};
glInvalidateFramebuffer(GL_DRAW_FRAMEBUFFER, N, discards);

It is important to note that this must be done right after the render pass is finished. If it’s done any later, it may be too late for the driver to be able to modify the render pass’s store operation accordingly.

Resolve

Invalidating the contents of the MSAA color attachments alone is not useful; all rendered data will be lost! Before that happens, any data that needs to be kept must be resolved into a single-sampled attachment. In OpenGL ES, this is done with glBlitFramebuffer():

glBindFramebuffer(GL_READ_FRAMEBUFFER, msaaFramebuffer);
glBindFramebuffer(GL_DRAW_FRAMEBUFFER, resolveFramebuffer);
glBlitFramebuffer(0, 0, width, height, 0, 0, width, height,
GL_COLOR_BUFFER_BIT, GL_NEAREST);

Note that because glBlitFramebuffer() broadcasts the color data into every color attachment of the draw framebuffer, there should be only one color buffer in each framebuffer used for resolve. To resolve multiple attachments, use multiple framebuffers. Depth/stencil data can be resolved similarly with GL_DEPTH_BUFFER_BIT and GL_STENCIL_BUFFER_BIT.

The Complete Picture

Here is all the above in action:

// MSAA framebuffer setup
glBindRenderbuffer(GL_RENDERBUFFER, msaaColor0);
glRenderbufferStorageMultisample(GL_RENDERBUFFER, 4, GL_RGBA8,
width, height);
glBindRenderbuffer(GL_RENDERBUFFER, msaaColor1);
glRenderbufferStorageMultisample(GL_RENDERBUFFER, 4, GL_RGBA8,
width, height);

glBindFramebuffer(GL_FRAMEBUFFER, msaaFramebuffer);
glFramebufferRenderbuffer(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0,
GL_RENDERBUFFER, msaaColor0);
glFramebufferRenderbuffer(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT1,
GL_RENDERBUFFER, msaaColor1);

// Resolve framebuffers setup
glBindTexture(GL_TEXTURE_2D, resolveColor0);
glTexStorage2D(GL_TEXTURE_2D, 1, GL_RGBA8, width, height);
glBindFramebuffer(GL_FRAMEBUFFER, resolveFramebuffer0);
glFramebufferTexture2D(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0,
GL_TEXTURE_2D, resolveColor0, 0);

glBindTexture(GL_TEXTURE_2D, resolveColor1);
glTexStorage2D(GL_TEXTURE_2D, 1, GL_RGBA8, width, height);
glBindFramebuffer(GL_FRAMEBUFFER, resolveFramebuffer1);
glFramebufferTexture2D(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0,
GL_TEXTURE_2D, resolveColor1, 0);

// Start with no load. Alternatively, you can clear the framebuffer.
const GLenum discards[] = {GL_COLOR_ATTACHMENT0, GL_COLOR_ATTACHMENT1};
glBindFramebuffer(GL_FRAMEBUFFER, msaaFramebuffer);
glInvalidateFramebuffer(GL_FRAMEBUFFER, 2, discards);

// Draw after draw after draw ...

// Resolve the first attachment (if needed)
glBindFramebuffer(GL_DRAW_FRAMEBUFFER, resolveFramebuffer0);
glReadBuffer(GL_COLOR_ATTACHMENT0);
glBlitFramebuffer(0, 0, width, height, 0, 0, width, height,
GL_COLOR_BUFFER_BIT, GL_NEAREST);

// Resolve the second attachment (if needed)
glBindFramebuffer(GL_DRAW_FRAMEBUFFER, resolveFramebuffer1);
glReadBuffer(GL_COLOR_ATTACHMENT1);
glBlitFramebuffer(0, 0, width, height, 0, 0, width, height,
GL_COLOR_BUFFER_BIT, GL_NEAREST);

// Invalidate the MSAA contents (still accessible as the read framebuffer)
glInvalidateFramebuffer(GL_READ_FRAMEBUFFER, 2, discards);

Note again that it is of utmost importance not to perform the resolve and invalidate operations too late; they must be done right after the render pass is finished.

Also worth noting that if rendering to a multisampled window surface, the driver does the above automatically as well, but only on swap. Usage of a multisampled window surface can be limiting in this way.

For reference, please see this modification to the “hello-gl2” sample: https://github.com/android/ndk-samples/pull/996. In particular, this commit shows how a single-sampled application can be quickly turned into a multisampled one using the GL_EXT_multisampled_render_to_texture extension, and this commit shows the same with glBlitFramebuffer().

With locked GPU clocks on a Pixel 6 with a recent ARM GPU driver, performance and memory usage is similar between the single-sampled and GL_EXT_multisampled_render_to_texture. However, using real multisampled images, glBlitFramebuffer() and glInvalidateFramebuffer(), performance is as slow as if the glInvalidateFramebuffer() call was never done. This shows that optimizing this pattern is tricky for some GL drivers, and so GL_EXT_multisampled_render_to_texture remains the best way to do multisampling. With ANGLE as the OpenGL ES driver (which translates to Vulkan), the performance of the above demo is comparable to GL_EXT_multisampled_render_to_texture.

Conclusion

In this article, we’ve seen one area where TBR hardware particularly shines. When done right, multisampling can add very little overhead on such hardware. Luckily, the cost of multisampling is so high when done wrong, it is very easy to spot. So, don’t fear multisampling on TBR hardware, just avoid the pitfalls!

I hope that with the above knowledge we can see higher quality rendering in mobile games without sacrificing FPS or battery life.

--

--

Shahbaz Youssefi
Android Developers

I am the tech lead of ANGLE's Vulkan backend and a participant in the Vulkan Working Group at Khronos.