Implementing Hello Triangle with Mantle
The specification and drivers for Vulkan, Khronos’ next generation graphics API, are expected to be released later this year. However, we can get a head start on learning how this new API will work by learning from the project that it evolved from: Mantle.
The problem with APIs like OpenGL and DirectX 11 is that they are not a good abstraction of modern graphics hardware, resulting in very complex drivers that attempt to guess what the application actually wants to accomplish. Mantle is a project by AMD that alleviates these problems by offering a low level API and a simplified driver that doesn't get in the way.
There is no public SDK available for Mantle, but AMD has released a programming guide and there are various games and demos available that use Mantle, like the Star Swarm Stress Test. The programming guide can be used to reconstruct a header. The demos load the function entry points from mantle64.dll, which means that it’s straightforward to produce a trace of Mantle calls by creating our own “proxy” DLL with implementations looking like this:
Luckily, the Star Swarm demo starts by rendering a basic splash screen with 2 triangles, resulting in a simple trace that shows the basics of using Mantle. With that information and some help from other people, I've managed to write a “Hello Triangle” demo in C++, which takes about 560 lines of code. In the next sections I will walk through this code and describe what it takes to render a per-vertex colored triangle in Mantle. Some experience with APIs like DirectX or OpenGL helps to fully understand this post.
The first thing we need to do is to load the function entry points from mantle64.dll. I've set up a helper function that initializes function pointers, much like GLEW for OpenGL, which you can find in my mantle.h.
Mantle has an optional validation layer that checks function calls and arguments and gives us warnings when we’re doing something wrong. To receive these, we have to register a debug callback:
As you can see, Mantle functions have the gr prefix. Many Vulkan functions will simply have the same name with gr replaced by vk.
Devices and queues
Mantle is initialized by calling grInitAndEnumerateGpus with some information that describes our application, like the API version we‘re targeting, the engine name and the application name. This information could be used by the driver to optimize for certain applications without resorting to hacks like checking executable names. Many parameters for functions in Mantle are wrapped into structs, which can make the code a bit bloaty, but it has the advantages of easy argument reuse and clarity.
In exchange for that info, Mantle provides us with a list of GPUs that are supported. We’ll simply go with the first GPU and check if it supports the Windows API extension. This extension will provide the connection with the window manager.
To use the GPU with Mantle, we have to acquire a device handle by calling grCreateDevice. Mantle improves upon the OpenGL extension system by requiring us to explicitly specify which extensions we want to use. Function calls will fail if you forget to specify an extension, making it much harder to accidentally use extensions.
Because Mantle is all about minimal driver overhead, it’ll let you happily crash the graphics driver if you don’t know what you’re doing. That’s why it’s a good idea to enable the validation layer while developing. The warnings and errors will be printed using the debug callback we registered.
To give you an idea of how extensive the validation layer is, let’s remove the grGetExtensionSupport call and see what happens after we call grCreateDevice.
There’s one other thing that needs to be specified in the grCreateDevice call: command queues. Similar to OpenCL, Mantle requires you to create a queue that work is submitted to. That work is then asynchronously executed by the graphics card. This is great for multithreaded applications, because each thread can have its own queue to submit work independently. There are two types of queues available by default:
- Universal queue
- Compute queue
The universal queue allows you to submit graphics and compute workloads and the compute queue only allows compute workloads. Because we’ll only consider the graphics side of Mantle, we tell Mantle we want one universal queue for our device.
After creating the device we can request a handle to the queue:
Render target image
To draw things to a window, we’ll need to create an image. Images can be used as input for drawing, like a texture, but they can also be used as output or target. To show the image in a window when we've finished rendering a frame, it also needs to be presentable.
The easiest way to acquire a presentable image is the… grWsiWinCreatePresentableImage function. The WsiWin prefix indicates that this function is from the Windows extension.
The image will have RGBA (RGB + alpha) channels and each channel is stored as a normalized unsigned byte. We indicate that the image will be used as color target, meaning that the color output of rendering operations will be stored in it. The width and height is the same as the window.
Observe that the call not only gives us a handle to the image, but also a handle for the GPU memory it’s stored in. We’ll see why we need that memory handle in the next section.
To actually use the image we've created, we first need to transition its state from uninitialized to presentable in a window. Such state transitions are accomplished with the grCmdPrepareImages call. Notice the Cmd prefix, which means that this call represents a command. To execute commands, you first have to wrap them in a package called a command buffer.
Building a command buffer looks like this:
The command buffer is emptied each time grBeginCommand is called, after which you can add new commands to it. You can then execute the same command buffer many times. The concept is quite similar to the good old display lists feature in OpenGL.
Let’s add the command to transition the image:
Images can have different mipmap levels and aspects, so you need to state exactly which part you want to transition. Our image is simple and has just a color aspect and no mipmap levels or arrays.
To execute this transition, submit the command buffer to the queue:
The submit function takes an array of command buffers that will be executed sequentially. It also takes a list of all the memory regions on the GPU that are referenced in the commands. The reason for this is that the graphics card needs to ensure that the memory is swapped in. If you don’t specify all referenced memory, you’ll get undefined behavior.
This is another one of those tasks that is normally handled by the driver, but it can be useful for advanced applications that want to control exactly which resources really need to be in VRAM, which is a relatively scarce resource.
Note that grQueueSubmit doesn't take memory handle objects directly; they need to be wrapped in memory reference structures:
Now that the image is ready, we need to wrap it in a color target view object so that we can later tell Mantle to draw to it.
To render things with Mantle, we have to configure the components of the graphics pipeline that we need. The optional stages have a dashed line in the image below, like the geometry shader and tessellation. For brevity, we will only consider the stages that are relevant for the demo here.
The stages that we’ll consider are:
- Input assembler (IA): prepares the vertex data input, optionally using an index buffer
- Vertex shader (VS): processes each vertex to produce a position onscreen and other properties used by the pixel shader
- Rasterizer (RS): interpolates the vertices to produce fragments, pixels that are part of a primitive, like a triangle
- Pixel shader (PS): produces a color for each fragment for one or more color targets (we’ll have 1, the image discussed above)
- Color blender (CB): Processes overlapping fragments to produce a final color for a pixel using the alpha channels
Most of these stages are known as the fixed-function part of the pipeline (green). They’re not programmable, but you can configure their behavior to some extent.
Like with other objects, creation of the graphics pipeline requires filling a structure with data. We’ll start with the fixed-function stage configuration.
The input assembler needs to know what kind of primitives will be rendered to group vertices together that belong to the same primitive (e.g. triangle). Interesting to note here is that Mantle supports rendering of quads natively (GR_TOPOLOGY_QUAD_LIST and even GR_TOPOLOGY_QUAD_STRIP), unlike modern OpenGL and DirectX.
Depth clipping is disabled in the rasterizer state, because the demo does not involve any depth testing.
The color blender state describes the color format of the target, which matches the presentable image we created earlier. It also describes what logic operation is used to store a color in the target. It could for example be XOR’d with the value that is currently stored. We’ll simply overwrite it with the new color by specifying the copy operation.
We've now configured the fixed-function stages. That leaves us with the vertex shader and pixel shader stages. These stages are fully programmable by compiling shader programs to the AMD Intermediate Language. The shader analyzer bundled with CodeXL is able to compile HLSL programs to this byte code format. It’s a little tricky to use, which is why I've written a Python script that extracts the binary code from the output.
A vertex shader for Mantle looks like this:
You can see that this works a little differently from vertex inputs in DirectX and OpenGL. Vertex attributes need to be explicitly loaded from memory buffers using the built-in vertex ID variable. If an index buffer is used, the vertex ID will be based on that. Each of these needs to be manually assigned to a register, which we’ll later reference to provide the vertex data.
The pixel shader works exactly the same as a normal DirectX 11 shader. It takes input from the vertex shader and writes a color to a target.
After these shaders have been compiled to AMD IL byte code, they need to be loaded into Mantle:
The loadShader call here is a helper function that returns a C++ vector with the shader bytes. The type of the shader does not need to be specified, because it’s encoded in the binary format, see the AMD IL Reference.
Now we can specify the handle to the vertex shader in the pipeline. The dynamic memory view mapping feature is not used.
The pixel shader is specified in the same way:
To be able to pass data to the shaders later on, we need to specify the types of resources accessed by each shader using descriptor slots. Each descriptor slot maps to a register and describes the type of resource that is accessed.
- Unused: Self-explanatory
- Shader resource: Readable buffer (texture like)
- UAV: Unordered Access View, can be used for arbitrary reads/writes
- Sampler: Texture sampler
- Next descriptor set: Nested descriptor set
Remember that the vertex shader has 2 buffer inputs at registers 0 and 1. They need to be described using descriptor slots:
The pixel shader doesn't take any inputs or outputs, but the amount of descriptor slots is required to be the same. That’s where the unused descriptor type comes in handy:
Finally, all that hard work pays off:
Allocating and assigning memory to objects
Unfortunately, we’re not completely done with the pipeline yet. I mentioned earlier that Mantle leaves much of the memory management responsibilities up to your application. One of those responsibilities is explicitly allocating memory for some objects, and the pipeline is one of those that require it.
First, we need to ask the pipeline object about its memory requirements:
The memory requirements specify how much memory the object will need, the alignment requirement and in which memory heaps it would most preferably reside. If an object doesn't require manual memory allocation, like a command buffer, the reported size will be 0.
Mantle offers many types of memory heaps that have properties like CPU accessibility, GPU read/write speed, heap size and page size. To allocate memory for the pipeline, we’ll query the properties of the heap it desires.
In my case, the pipeline requires 320 bytes and the heap indicates that is has a page size of 64KB. Since memory needs to be allocated by page, you would allocate these heaps and use them for different objects in a real application. We’re going to be lazy and allocate these page sized memory buffers for every object.
The required size for the object here is rounded up to the nearest multiple of the page size. The memory priority indicates how hard the GPU should try to prevent this memory from being swapped out. It can be set to a level as low as GR_MEMORY_PRIORITY_UNUSED, which is essentially a promise that swapping out that memory will not cause a performance hit.
Finish by binding the memory to the object and creating a memory reference structure. Remember that memory used in command buffers needs to be explicitly referenced, so we’ll need this memory reference in command buffers that involve the graphics pipeline.
This process is the same for every object that requires explicit memory allocation, which is why I've abstracted it into a helper function.
Descriptor set and vertex data
Now that we've created a graphics pipeline with descriptor slots telling Mantle what kind of inputs the vertex shader has, we’re now going to actually specify the data.
Each vertex has a float4 position attribute and float4 color attribute. I've decided to combine these into a single array without interleaving. Mantle uses normalized device coordinates, as you would expect (the vertex shader does not transform the position in any way).
To use this data, we have to allocate GPU memory to hold it, much like a vertex buffer in DirectX/OpenGL. This time we’ll have to look for a suitable heap ourselves. Because we want to copy the data in the array to it at some point, we need to look for a memory heap that supports CPU access. Searching for a heap with that property looks like this:
Allocating the memory works exactly the same as we've seen before. We’ll also create a reference object for it, because we’re inevitable going to use this vertex data memory in a command buffer at some point.
To copy the data to the GPU memory we allocated, we can map it:
Allocated memory is in a data transfer state by default. To make it readable by the vertex shader, we need to transition its stage to shader read only with a command buffer by using the grCmdPrepareMemoryRegions command. It’s similar to the grCmdPrepareImages command we've seen before. Remember that we need to explicitly reference the affected memory again when submitting the command buffer.
The data is now loaded into GPU memory and ready to be read by shaders, but we haven’t actually bound it to the shader inputs yet. That is accomplished by creating a descriptor set. A descriptor set describes inputs that are bound to descriptor slots, as shown below.
Just like the pipeline object, we also have to allocate and bind memory to descriptor set objects.
To modify a descriptor set, there’s special begin and end calls, similar to those of a command buffer. The difference is that these won’t clear the existing bindings.
To link the GPU memory to the shader buffer input, we need to attach a memory view. A memory view describes a region of GPU memory that provides the data of the buffer.
The position buffer is bound to slot 0 and its data comes from the first 12 floats in the memory, with 4 floats per vertex (indicated by channel format).
The color buffer (slot 1) is attached to memory in exactly the same way, but with a different offset into the allocated memory:
We now have a descriptor set that describes where the buffers specified in the vertex shader get their data from.
Dynamic state objects
There’s one more thing we need to do before we can get to the actual rendering commands: specifying the dynamic state. The following types of state need to be described:
- MSAA: multisample anti-aliasing configuration
- Viewport: rendering region, depth range, scissors
- Color blend state: how colors are mixed based on alpha channels
- Depth stencil state: depth and stencil operations
- Raster state: fill mode and face culling
Unlike DirectX and OpenGL where most of these have default states, Mantle requires you to specify all of it explicitly. These are known as dynamic state, because they can change at any time without having to recreate the graphics pipeline.
By setting the samples count to 1, MSAA is effectively disabled. You need to bind a special color target to use multisampling.
The viewport state describes the area of the color target to render to, which we’ll set to equal the window size. Scissors can be used to discard regions of the output as the name indicates.
The color blend state tells the GPU how to combine the color and alpha of a render operation and the current value of the pixels in the color target to produce a new pixel value. You can read more about these here.
Stencil and depth operations are both disabled, but Mantle requires specifying a valid configuration nonetheless.
The raster state controls how the rasterizer draws primitives. For example, a fill mode of GR_FILL_WIREFRAME produces the following result:
Face culling (cull mode) can be used to not render primitives that aren't facing the camera or vice-versa. This is often used as an optimization to not waste time drawing backsides of objects. The front face option specifies how to determine if a primitive is facing the camera, in this case by checking if the vertices have a clockwise order onscreen, like our 2D triangle.
By creating these objects that define the dynamic state, we've finally finished the preparation work.
More command buffers
Unlike before, we’re now going to create a bunch of command buffers that won’t be executed immediately. These command buffers will contain all of the steps for rendering a single frame. The next section will cover actually executing them.
Remember that we created an image to render to, which was initialized to the presentable in a window state. Let’s start the frame by clearing that image to black using grCmdClearColorImage. That command requires the image to be in the clear state, so we start with a transition:
The clear command also takes a range of parts that are to be cleared:
To use the image for rendering in the next step, another transition is needed to get it in the render target state.
We could continue using a single command buffer for all of the commands in a single frame, but imagine that you want to draw something other than a triangle at some point. If the clear commands are in a separate command buffer, we can reuse it for that new operation.
I promise that it’s finally time to really draw the triangle now. Let’s create a new command buffer to hold the commands.
Remember that in the “Command buffers” section, a color target view was created from the presentable image. Just like a memory view for the vertex data, this acts like a “pointer” to the image. Using the grCmdBindTargets command, Mantle is told to store the rendered pixels there.
Next, set up the painstakingly crafted dynamic state:
Set the pipeline, with its fixed-function configuration and programmable shaders, that will be used to control drawing.
Bind the memory views to the descriptor slots of the shaders:
We have now told Mantle everything it needs to know about the input, operations and output for rendering. That leaves us with just one step:
The arguments specify that 3 vertices need to be drawn, starting at vertex 0. The last two arguments are used for instanced rendering. If you aren't using that, set the instance index to 0 and instance count to 1, as shown here. Since the primitive type (triangle list) is a property of the pipeline, it’s not specified as part of the draw command, unlike glDrawArrays in OpenGL.
An important property of command buffers that I haven’t explained yet is that all of the state set with Bind commands is scoped to a command buffer. That means that if you created a separate command buffer with just grCmdDraw, nothing would happen because no pipeline is bound there. This is very useful, because it means that your drawing operations will never be messed up by unexpected global state, because there is none!
The image is still in the render target state at this point, so it can’t be presented to the window yet. That requires yet another transition:
That covers all the operations for drawing a single frame. To summarize:
- Transition the image from presentable to clearable
- Clear the image to black
- Transition the image from clearable to usable as render target
- Bind the pipeline, descriptors and dynamic state
- Draw the triangle
- Transition the image from usable as render target to presentable
That’s quite a bit of work for one triangle, but take into account that going from a single triangle to many more is as simple as providing more vertex data and increasing the number in grCmdDraw.
It’s about time we actually created a window to show the resulting image in. I’m going to use the SDL library for that, seeing as we've done enough manual labor already.
Rendering a frame is now as simple as submitting the command buffers to the universal queue:
Make sure to specify all of the memory that is used to draw a frame: the presentable image, the pipeline object, the descriptor set and the memory that contains the vertex data. If you forget something, don’t be surprised if the GPU decides to swap out your vertex data in production. Maintain a good overview of your resource usage and use the validation layer to spot anything you missed. Be extra careful with memory views, because the validation layer does not verify that the vertex data memory is referenced in the case of my driver.
To present the final image to the window, call grWsiWinQueuePresent:
Before we conclude, there’s one more thing you may want to add to your application. Because the execution of all these command buffers for a frame is asynchronous, Mantle offers many synchronization objects like semaphores and fences. A fence is an object that is used to signal that a command buffer has finished execution. Let’s use a fence to make sure the previous frame has finished rendering before rendering a new one.
Create a fence before entering the main loop and change the grQueueSubmit call to store the execution status in this fence.
Now add the grWaitForFences call at the beginning of the frame to wait for the last grQueueSubmit call to complete. If there hasn't been one yet (the case for the first frame), it will return immediately.
The last two parameters specify if we want to wait for all fences to complete and the timeout in seconds. The full code for the main loop is:
Phew, it sure took a lot of work to draw just that triangle! Let’s retrace all the steps it took to get there:
- Selecting a Mantle compatible device and requesting a universal queue
- Creating an image to render to and initializing it
- Creating the graphics pipeline by configuring the fixed-function state, attaching shaders to the programmable vertex and pixel stages and describing the type of shader inputs using descriptor slots
- Allocating memory for the vertex data and copying the data to it
- Creating a descriptor set that links the vertex data memory to the position and color buffers using a descriptor set
- Creating objects to configure the dynamic state
- Creating 3 command buffers to: clear the image, draw the triangle and make the image presentable to the window
- Writing the main loop that waits for the previous frame to finish rendering, executes the 3 command buffers and presents the image to the window
It’s clear that Mantle forces you to do much of the heavy lifting that was previously done by the driver for APIs such as DirectX and OpenGL. So what exactly are the advantages?
- Work like creating pipelines, constructing command buffers and drawing all supports multithreading
- Very explicit control over memory and resource management, taking away the overhead of the driver second-guessing your intentions
- Greatly simplified shader compiler that only needs to support byte code, reducing driver complexity = less bugs
- Extensive validation layer that can be fully disabled in production for yet more performance gains
- No confusing global state or hidden performance gotchas
Vulkan will carry over all of these advantages, so it’s something to be very excited about. Still, an API as low level as this is not for everyone. It takes a lot of work to actually achieve the performance gains, so I think that most people who are currently using DirectX 11 and OpenGL to create games and demos should stick to those APIs. I expect that the investment will only be worth it for developers of advanced engines like Unreal Engine and Unity.
In case you haven’t been scared off yet, there’s much more to play with in Mantle. The command buffer functionality is much more powerful than shown here. For example, it’s possible to create control structures like while loops with grCmdWhile and grCmdEndWhile. We've also skipped entirely over the compute part, which can interface with graphics in sophisticated ways. Some of the things you could try next are:
- Indexed rendering by storing integer indices in GPU memory and binding them with grCmdBindIndexData
- Providing a matrix to the shader and rendering a model in 3D
- Using images and samplers for texturing
- Creating an image as depth target and use depth testing
You can learn to use most of these by reading through the Mantle programming guide and by experimenting. If you like Rust, you should also check out this project, which aims to be the first wrapper for Mantle!