Engine Internals: Optimizing Our Renderer for Metal and iOS
This year at Nordic Game Conference I did a talk about optimizing our renderer for tile-based deferred renderers and Metal on iOS. I received some really positive feedback on the talk and decided to write a companion blog post on the topic. You can find the slides from the presentation at [Presentation].
So, here we are. I’m going to start off with an overview of our renderer architecture followed by a description of tile-based deferred rendering as well as the tools and processes we utilize for optimization. Finally, I will go through some of the optimization tips and tricks we employ in our renderer.
Our renderer is multi-layered, with different layers exposing different optimization opportunities.
The highest layer is built on top of our engine component system, which is described in a detailed blog post I have written earlier. This layer describes concepts such as models, lights and cameras, which are crucial for representing a scene.
The components on the first layer are responsible for creating draw commands that are then given to the next layer in the form of draw lists. Rendering of the scene is orchestrated by the camera component, which defines three major draw lists for different types of geometry: opaque, alpha tested and alpha blended. The geometry is rendered in this order for a reason, which I will describe in more detail when discussing tile-based deferred rendering.
The second layer of the renderer is responsible for translating the draw calls that are passed to it into low-level calls of the third and final layer. The second layer has the ability to perform some optimizations such as merging render passes.
The final layer is a thin wrapper for the graphics API itself. This layer is architectured around Metal allowing us to take advantage of Metal’s structural optimizations on a higher level. It also maps really well to other low level graphics APIs such as Vulkan and DX12.
Our rendering is performed in roughly seven steps.
- Initialization is performed at the start of the application or when loading the 3D world.
- Shadow pass is the first pass that is performed for each frame. During this pass we render the scene from the point of view of the light to create the shadow map. Shadow mapping is only used on higher-end hardware while the lower-end devices fall back to using shadow blobs.
- The scene is rendered into a 16-bit floating point RGBA texture using a PBR lighting model.
- Tone mapping, color correction and sRGB color space conversion are performed on the linear lighting from the previous step.
- Debug information is rendered on top of the lit and post-processed scene.
- UI elements are rendered on top of the final scene render.
- Finally we perform cleanup when the game state changes or the application exists.
After employing an optimization technique described later in this article we’re able to render the scene in two passes: one pass for lighting and a second one for computing the shadow maps.
Tile-Based Deferred Rendering and Metal
iOS devices are all based on PowerVR chips, which perform rendering by splitting the screen into tiles. This limits the amount of memory required for rendering the full view. Rendering is also performed on all the draw commands in one go meaning that the graphics API needs to first compile the whole frame before sending it to the GPU for processing.
Rendering starts with the GPU doing vertex processing and tiling. This process results in what’s called a parameter buffer containing transformed geometry as well as information about which tiles contain which triangles.
The GPU then splits the screen into tiles and processes one or more of these at the same time. It loads only the data from the parameter buffer that is needed for rendering the individual tile. This allows for a much smaller amount of memory to reside on the GPU chip itself.
The first thing that rasterizer does after creating fragments from the polygons in the tile is hidden surface removal. This is done by implicitly using an early-z pass, which you might be familiar with when working on immediate mode renderers such as your typical desktop GPU. The GPU only shades fragments that are visible based on this z-test. The caveat is that it can only perform this for opaque geometry. This means that it’s really important to render opaque geometry first followed by alpha tested and alpha blended geometry, as the GPU will stop doing early z-tests the moment it encounters transparent triangles or those relying on alpha testing, discards in the shader or modification of the depth value in the shader.
The new PowerVR chips are optimized for 16-bit floating point math. Starting from the Rogue series, the chips also have separated 16-bit floating point ALUs from the 32-bit ones reducing energy consumption and providing even better throughput. On this hardware it’s also virtually free to convert values between 32-bit and 16-bit.
The Imagination Technologies blog [ImgTecBlog] also has some really good posts about what their technology does under the hood.
Note: Apple has actually announced that they will be ending their partnership with Imagination Technologies and develop their own GPU to replace PowerVR.
Metal is a graphics API that is not quite as low-level as Vulkan or DX12 but significantly lower than OpenGL or DirectX 9. The core idea behind Metal is the use of immutable states in rendering. This moves the render state validation away from frame rendering and into the loading stages.
Metal also provides direct visibility and control of the rendering pipeline. It, for instance, allows you to define what content to load into tile-local memory and what to store back when the GPU is finished with a pass.
Since iOS 10 Metal also supports memoryless render targets that have no RAM backing. These surfaces exist only in tile-local memory and only for the duration of a single pass. The benefit of this feature is that no load/store bandwidth is used when using the surface. Depth textures are a good use case for memoryless render targets as they are usually not used as textures for future passes (unless they are shadow maps, of course).
One of the biggest benefits of Metal is the updating of the feature set with iOS updates, which also typically have good adoption rates. Since iOS updates tend to become available for older hardware as well, this enables us to use new Metal optimization features on older generation hardware and reduces the amount of special cases needed when developing the renderer.
Metal Render Flow
The process of rendering a single frame with Metal consists of seven steps.
- Acquire a command buffer from a command queue. Command queues typically have a limited amount of commands buffers available that can be reused when they are not being operated on. A single command buffer contains a stream of commands for the GPU to execute.
- Acquire a drawable. A drawable defines the final render target that can be presented. There are only a limited amount of drawables available because they are typically very large (bitmaps the size of the device screen).
- Create a command encoder that you will use to encode draw commands into the command buffer for the GPU to execute. A single encoder defines a single render pass rendering into a single set of render targets.
- Use the command encoder to encode all the scene render commands into the command buffer.
- End encoding draw calls. Note that commands will not yet be sent to the GPU at this point.
- Register a callback to be notified when the commands in the command buffer have been executed on the GPU.
- Commit the command buffer. The commands are finally sent to the GPU for processing.
More on Metal
Check out [AppleMetal] for more in-depth information about Metal and how to use it.
Tools and Processes
Optimization work generally requires building tools and processes into the code in order to get more information about its performance and bottlenecks. Platform SDKs also typically provide tools for collecting vital information from the target device. I’m now going to go through a system that we have built into our engine as well as a couple of really helpful tools provided by Xcode.
Especially when optimizing rendering code it’s important to advance in small steps and limit the amount of moving parts when comparing the before with the after. Often seemingly unconnected changes can lead you astray when analyzing the results.
We have built a regression testing framework that allows us to script tests and play them back in a deterministic manner. These tests allow us to compare results before and after we have made modifications to the code.
The game code supports this framework by allowing recording of gameplay sessions that can then be played back. Our gameplay is fully deterministic and as long as we provide the random number generators with the same seed we are guaranteed to get matching game states at specific ticks. Although there will be minor fluctuations in frame times, we can still compare certain areas of the profiler timelines to see if our optimization endeavors have been successful.
The game code also supports a pause feature allowing us to pause the game and capture that frame for further analysis. In addition to the pause feature we also have the option to create fully static scenes in our game agnostic viewer application. This feature uses the regression testing framework and allows scripting creation of models as well as effects and freeze them at specific times. The static scene creation system allows us to create a specific frame and continue rendering that for a longer period of time. This, in turn, allows us to use timeline tools to smooth out the minor variations in render time between seemingly matching frames.
Metal System Trace
The metal system trace allows us to inspect how individual frames are scheduled on the GPU and how the rendering work flows from the CPU to the final displayed surface on the GPU.
The tool is split into multiple tracks, each of which specifies a certain type of work. The top tracks called “Metal Application” show when the generation of the command buffers begins and ends on the CPU.
The track “Shader Compilation” shows work related to shader compilation. This work should always be pushed to the loading time, so the track should be empty when actually rendering gameplay.
The “Graphics Driver Activity” track displays the CPU work for processing the command buffers and sending rendering operations to the GPU.
The “Vertex” track displays the vertex transformation, vertex shader execution and tiling work. Blocks on this track are typically much shorter than the matching “Fragment” ones, which depict the work of rasterization, fragment shader execution and render target read/writes.
The “Display” track indicates when a specific presented drawable is displayed. If each of the blocks fits between individual lines on the “Vsync” track you are running at 60 FPS.
Finally, if you would use features like the built-in sRGB render targets you would see work on the “Scaler” track.
Shader Debugger and Frame Capture
One of the tools that we use the most is the frame capture tool. This allows us to view all the individual draw calls contributing to the finished frame as well as the associated parameters such as load/store bandwidth, bound textures etc.
The above image displays a tree of command buffers, render command encoders and individual Metal calls contributing to the frame being captured. You can see the individual draw calls by opening up the render command encoder. The panel also displays the time it takes to execute the command encoder or draw call in question as well as a general FPS counter so you can get an overall view of the application performance.
The frame capture tool also allows you to view the shaders being executed for a draw call. The shader panel displays the percentage of time that is spent on a specific line. It does not allow expanding function calls, however, limiting its usability somewhat.
Note: GPUs generally don’t support functions. They are unrolled, so you don’t have to worry about their potential performance hit.
One of the best feature of the frame capture and the shader debugger is the ability to rebuild a modified shader by pressing the button below.
Pressing the button will also execute all the draw calls again for the frame and show you the updated draw call timings. This allows you to make small modifications to the shader and immediately see the effect. Sadly, rebuilding has proven somewhat unstable, but the feature is quite useful when it works.
More on Metal Tools
To find out more about the optimization tools found in the Xcode toolset, check out the presentation [WWDC15] held at WWDC in 2015.
Tips: The Big Things
Finally, let’s start talking about some of the tips and tricks for optimizing for Metal and tile-based renderers. I will start with the big things moving further into finer grained shader optimizations.
Render passes in Metal are created with individual render command encoders. They have a relatively high overhead associated with them as all related input data needs to be copied from RAM to tile-local memory and back again, when the pass is finished.
In case you are just continuing to render to the same render targets you should avoid creating new command encoders. If you, however, need to do post-processing on a previous pass, things will get a bit more tricky.
I will now cover a method that we utilize to merge two passes where the first one renders the scene and the second one does post-processing on it.
The naive implementation of HDR PBR involves first rendering the scene into an intermediate 16-bit floating point texture. This texture will contain the scene lighting in HDR.
The next step is to do a post-processing pass where we perform tone mapping, color correction and finally convert the color into the sRGB color space for output.
When using Metal on iOS we use programmable blending to merge the “Draw scene” and “Tone map” passes. This is a feature available on all new PowerVR GPUs and allows us to read/write framebuffer color values in the fragment shader. In Metal this happens by passing a structure to the fragment shader that looks something like this:
Our process uses two color textures. We bind the 16-bit floating point texture to color zero and output a value for this target when rendering the scene. For post-processing we simply render a fullscreen quad with a shader that reads the color value from color zero and outputs the processed pixel color into color one.
This approach does not allow us to perform convolutions on the intermediate texture, however. It only allows us to touch the color currently present at the pixel we are outputting to.
The above image already contained one use of a new feature in Metal called function constants. These are constants that can be given values when you request the shader function from Metal.
Function constants can be used like any constants in shader code allowing you to exclude code such as processing color when only rendering depth values for shadow mapping. The code excluded this way will never get executed and is actually removed from the function returned, allowing Metal to further optimize the code.
For instance, in the above code we are using a function constant called MaxVertexBoneCountValue that is defined when we retrieve the function from Metal. When set to the value 2, for instance, it will limit the execution of the loop to 2 iterations that can then be unrolled.
Rendering in a tile-based deferred renderer always happens one command buffer at a time. This means that the GPU can only start processing the render commands when it receives the full command buffer. Delaying processing in this way allows the GPU perform automatic hidden surface removal as described earlier.
This method does mean that the command buffer you have just prepared will be unusable when the GPU is processing commands from it. If you have created a command queue with only a single command buffer your code will have to wait until the command buffer becomes available again.
Furthermore, it takes some time for the CPU to know that the GPU is finished and vice versa. Because iOS devices are based on a unified memory architecture this means that the CPU will also have to take care not to access memory that is currently being accessed by the GPU.
Triple buffer is a method that remedies this issue. The method is based on creating enough command buffers and matching buffers for per-frame dynamic data, so the CPU is able to provide the GPU with enough work for it to not starve. Furthermore, the CPU will also optimally never have to wait for a free command buffer.
The above image depicts the process of performing triple buffering. It always takes some time for the GPU to start processing a frame that has been prepared by the CPU. Due to this communications latency we are quite often in the situation that the GPU is also still using a command buffer that was prepared for the first frame while the CPU is already starting to prepare the third frame.
Now that we have covered some of the big optimization tips and tricks, let’s take a moment to enjoy the small things in life: optimizing shaders. For in-depth discussion on the topic, check out the presentation [WWDC16] at WWDC 2016.
The GPU shader cores use registers to store intermediate values. GPUs are different from CPUs in that they use almost no stack memory at all. They tend to have minuscule amounts of stack due to the large number of threads that need to share it.
Using less registers means more threads can be queued for execution at any given time. This comes from the fact that any threads waiting for memory fetches, for instance, need to fit their registry values into the registry file, so they can continue execution when the requested data has arrived.
Registers on the new PowerVR chips are also 16-bit wide. This, of course, means that using 16-bit math whenever possible will halve the registry usage of your shader leading to significantly improved occupancy. This in turn leads to shorter wait times on the GPU during pending texture fetches, for instance.
Reduce Input Data
Naturally it’s important to reduce the amount of data that the shader needs to access. This generally lowers its bandwidth requirements and registry usage.
In our case, for instance, we used 3x4 matrices instead of full 4x4 ones for skinning. We actually do three 4-vector dot products because Metal does not have a packed 3x4 matrix format that we could use to be compatible with the interfaces used in our OpenGL renderer.
I also recommend you to use compression whenever possible and consider packing multiple textures into individual channels of a single texture. Note, however, that when using sRGB, the RGB components are sRGB encoded but the alpha component is linear.
When writing shaders it’s important to pay attention to address spaces. There are two major address spaces: device and constant.
The device address space should be used for large amounts of data with little reuse. One good example of data like this is vertex data. You should always use data types with ranges smaller than or equal to 32-bit signed integers. The modern PowerVR chips have addressing instructions that allow passing in a 32-bit signed offset but the value needs to fit into this range or else the GPU will have to manually compute the address.
The constant address space should be used for small amounts of data that have heavy reuse. Typically you will want to define all scene/model constants using this space. The constant space allows prefetching values into constant registers.
GPUs also attempt to vectorize memory reads.
In the above code the GPU will have to read both a and c using separate memory reads, while the following layout would allow it to read both a and c as a single read.
The shader compiler will generally attempt to schedule texture sampling operations to happen as early as possible during the shader execution. It cannot cross dynamic flow control. However, it can cross flow control based on constants and function constants.
Finally, I wanted to share a couple of words about environment mapping in our engine. We are using cube maps instead of cylindrical environment maps due to their superior sampling speed.
We also use 8-bits per channel RGBA textures instead of 16-bit ones. This saves us bandwidth and uses less memory in general. To fit HDR color values into 8-bit textures we use a custom compression scheme.
At pre-processing time we first compute the maximum of the RGB color components. We then store the RGB color divided by this maximum in the 8-bit texture and store the maximum component divided by 8 in the alpha channel. We then restore an approximation of the original RGB value by multiplying the RGB color by alpha times 8.
This approach works well for strong white lights but removes precision when there is large variation between channels. We have, however, not noticed significant degradation in image quality with the kinds of environment maps we are using.
- [Presentation] Optimizing Our Renderer for Metal on iOS: https://www.slideshare.net/TimoHeinpurola/optimizing-our-renderer-for-metal-on-ios
- [ImgTecBlog] Imagination Technology Blog: https://www.imgtec.com/blog
- [AppleMetal] Metal Programming Guide: https://developer.apple.com/library/content/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Introduction/Introduction.html
- [WWDC15] General Metal Optimization: https://developer.apple.com/videos/play/wwdc2015/610/
- [WWDC16] Metal Shader Optimization: https://developer.apple.com/videos/play/wwdc2016/606/
In this article I covered the presentation I made at Nordic Game Conference in 2017 about optimizing our renderer for Metal and iOS. I first introduced the reader to tile-based deferred rendering and how to use Metal to develop applications on that platform. After this I moved on to describing the high level architecture of our renderer. Finally I explained some of the tools and processes that we use for optimizing our rendering systems as well as introduced some important high and low level optimization tips and tricks.