Devlog #002: Graphics.DrawMeshInstancedIndirect

Bagoum
Bagoum
Nov 5 · 15 min read

Unity’s Graphics.DrawMeshInstancedIndirect (from here on out, just DMII) is an absolute necessity for making danmaku games in Unity. At the same time, nobody seems to know what it is or how it works. As one of the confused, I'm somewhat hesitant to publish this, but hopefully it can help future me as well as other Unity randos be a little less lost when working with this odd API.

As you can probably guess, this article is maximally technical and maximally Unity-specific.

Strap in: this is a long ride.

Note that while this piece is oriented towards 2D games, the coding pattern isn’t much different for 3D (although your shaders may be a bit more complex).

All the resources for this post can be found on this Github repo licensed under CC0 (effectively public domain).

0. What is DMII?

DMII is used when you have many renderable things using the same mesh and material with minor variations (color, position, rotation, size, etc). In practice, a “variation” is something that can be reduced to a small number of floats. In danmaku games, all projectiles of a single type share a mesh/material pair, and thus can use DMII. DMII allows you to render all of these projectiles with one draw call instead of several thousand. If you’re not using GameObjects, then DMII and its simpler sister, Graphics.DrawMeshInstanced, are the only practical ways to render a lot of things.

To use DMII, we call the function with a mesh, a material, and a MaterialPropertyBlock that contains all the variations we want to apply. We then have to read all the variations and apply them within the material's shader.

DMII is an example of GPU instancing, but it’s not the same kind of GPU instancing that people normally refer to. Most instancing you see on the internet or in default Unity stuff are built for automatic instancing — ie. instancing the renderers on several hundred game objects. DMII is a more open framework that can be used without GameObjects or renderers, but on the flipside requires you to provide everything that the renderer normally provides. Unless you’ve used DMII, you don’t know DMII’s variation of GPU instancing.

1. The Shader

Shaders are one of the least friendly aspects of Unity. Each shader is written in at least two languages and has a bunch of hardcoded requirements and incomprehensible boilerplate. This said, using DMII requires fiddling extensively with shaders.

As stated, DMII is quite different from other methods of GPU instancing. Shaders are the most egregious example of this. All you need to do to enable instancing on a normal shader is add a keyword. But DMII shaders differ in their basic coding style and functionality.

The shader code below is standard 2D sprite boilerplate, with DMII support added. I’ll go through the stuff unique to our DMII shader.

Shader "DMIIShader" {
Properties {
_MainTex("Texture", 2D) = "white" {}
}
SubShader {
Tags {
"RenderType" = "Transparent"
"IgnoreProjector" = "True"
"Queue" = "Transparent"
}
Cull Off
Lighting Off
ZWrite Off
Blend SrcAlpha OneMinusSrcAlpha
Pass {
CGPROGRAM
#pragma vertex vert
#pragma fragment frag
#pragma multi_compile_instancing
#include "UnityCG.cginc"
#pragma instancing_options procedural:setup

struct vertex {
float4 loc : POSITION;
float2 uv : TEXCOORD0;
UNITY_VERTEX_INPUT_INSTANCE_ID
};
struct fragment {
float4 loc : SV_POSITION;
float2 uv : TEXCOORD0;
};

#ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
StructuredBuffer<float2> positionBuffer;
StructuredBuffer<float2> directionBuffer;
#endif

void setup() {
#ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
float2 position = positionBuffer[unity_InstanceID];
float2 direction = directionBuffer[unity_InstanceID];

unity_ObjectToWorld = float4x4(
direction.x, -direction.y, 0, position.x,
direction.y, direction.x, 0, position.y,
0, 0, 1, 0,
0, 0, 0, 1
);
#endif
}

sampler2D _MainTex;

fragment vert(vertex v) {
UNITY_SETUP_INSTANCE_ID(v);
fragment f;
f.loc = UnityObjectToClipPos(v.loc);
f.uv = v.uv;
//f.uv = TRANSFORM_TEX(v.uv, _MainTex);
return f;
}

float4 frag(fragment f) : SV_Target{
float4 c = tex2D(_MainTex, f.uv);
return c;
}
ENDCG
}
}
}

Here’s the step-by-step:

#pragma vertex vert
#pragma fragment frag
#pragma multi_compile_instancing
#include "UnityCG.cginc"
#pragma instancing_options procedural:setup

The above code declares our vertex, fragment, and instancing setup functions. The two lines key for instancing are #pragma multi_compile_instancing and #pragma instancing_options procedural:setup.

struct vertex {
float4 loc : POSITION;
float2 uv : TEXCOORD0;
UNITY_VERTEX_INPUT_INSTANCE_ID
};

This is a pared-down vertex descriptor, with one extra feature: this strange line UNITY_VERTEX_INPUT_INSTANCE_ID. Declaring this is, to my knowledge, the only way to access data from MaterialPropertyBlock.

#ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
StructuredBuffer<float2> positionBuffer;
StructuredBuffer<float2> directionBuffer;
#endif

These are arrays of data from MaterialPropertyBlock that we want to access in the shader. We can index into these arrays using unity_InstanceID (which only exists if we declare UNITY_VERTEX_INPUT_INSTANCE_ID). If we want to draw a thousand objects at different positions, we need a position array. If we want to draw a thousand objects with different rotations, we need a directions array. We can declare any arrays that we want here and read them in any way we want, as long as the script code actually provides them.

void setup() {
#ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
float2 position = positionBuffer[unity_InstanceID];
float2 direction = directionBuffer[unity_InstanceID];

unity_ObjectToWorld = float4x4(
direction.x, -direction.y, 0, position.x,
direction.y, direction.x, 0, position.y,
0, 0, 1, 0,
0, 0, 0, 1
);
#endif
}

This is mostly boilerplate code. All it does is set up the Matrix4x4 for each instance so Unity knows where to render it. Normally, you don’t have to deal with this, but DMII forces you to manually construct the model matrix.

Warning: Unity also serves a unity_WorldToObject matrix which is supposed to be the inverse of unity_ObjectToWorld, so we should technically rewrite that too. At the same time, WorldToObject isn't, to my knowledge, used anywhere in the default rendering path-- it's only invoked by ObjSpaceLightDir and ObjSpaceViewDir. If you use those functions, you'll probably need to also invert this matrix.

The way you assign position and direction to the ObjectToWorld matrix may differ; in my case, I don’t use the Z-axis and therefore don’t add position.z (which would require float3 instead of float2). Also, I only use Z-rotation, which means that all my directions can be expressed as a single angle. I precalculate the vector2 of direction because my project has CPU code which requires direction as a normalized vector, but you could pass it as a float and do cos/sin calculation within the shader setup function (which you should do if possible-- math is much faster on a GPU).

fragment vert(vertex v) {
UNITY_SETUP_INSTANCE_ID(v);
fragment f;
f.loc = UnityObjectToClipPos(v.loc);
f.uv = v.uv;
//f.uv = TRANSFORM_TEX(v.uv, _MainTex);
f.c = float4(1.0, 1.0, 1.0, 1.0);
return f;
}

The only special line here is UNITY_SETUP_INSTANCE_ID(v). This allows us to access unity_InstanceID within the vertex shader, so we can query the data arrays and apply effects unrelated to position or direction. We'll add an effect like that later in this post.

TRANSFORM_TEX handles some texture features like tiling and offset that you can see in the material shader inspector. If you don't need these, you can simply copy the UV value.

2. Mesh and Material

In 2D, you generally work with sprites, so getting meshes may be a bit of a strange ask. The conversion ultimately isn’t too difficult. I find it convenient to store the mesh/material information together in a struct:

public readonly struct RenderInfo {
private static readonly int MainTexPropertyId = Shader.PropertyToID("_MainTex");
public readonly Mesh mesh;
public readonly Material mat;

public RenderInfo(Mesh m, Material material) {
mesh = m;
mat = material;
}

public static RenderInfo FromSprite(Material baseMaterial, Sprite s) {
var renderMaterial = UnityEngine.Object.Instantiate(baseMaterial);
renderMaterial.enableInstancing = true;
renderMaterial.SetTexture(MainTexPropertyId, s.texture);
Mesh m = new Mesh {
vertices = s.vertices.Select(v => (Vector3)v).ToArray(),
triangles = s.triangles.Select(t => (int)t).ToArray(),
uv = s.uv
};
return new RenderInfo(m, renderMaterial);
}
}

Note that this function creates a copy of the material. In my project, different object types use different textures, but use the same basic material, so I create one material in my Assets and duplicate it for each object type.

3. ComputeBufferPool

The data arrays in the shader need to be provided as ComputeBuffers on a MaterialPropertyBlock. Compute buffers work strangely, and I couldn't find any documentation on the precise way they interact with DMII. Here is my understanding of what goes on:

  • The Unity frame begins.
  • You make some data updates in Update.
  • Camera.OnPreCull is triggered. This is probably where you put your rendering functions. In these functions:

— You create a compute buffer.

— You submit this compute buffer to a DMII call, which doesn’t do anything immediately.

— You repeat the last two steps a bunch of times.

  • Sometime after Camera.OnPreCull is triggered, all DMII calls are resolved (sent to the graphics unit, or the like).
  • The Unity frame ends.

The most important observation is that compute buffers are not sent to the graphics unit immediately. This means that once you submit a compute buffer for a rendering task, you cannot do anything to the compute buffer until the next frame, or else the rendering task won’t work properly. At the same time, compute buffers must be disposed manually after use.

Because of this, we need to create a pooling object to manage our compute buffers across frames. Our object manager will create a pooling object for each type of compute buffer it requires.

public class ComputeBufferPool : IDisposable {
private readonly int count;
private readonly int stride;
private readonly ComputeBufferType cbt;
private readonly Stack<ComputeBuffer> free = new Stack<ComputeBuffer>();
private readonly Stack<ComputeBuffer> active = new Stack<ComputeBuffer>();

public ComputeBufferPool(int batchSize, int stride, ComputeBufferType typ) {
count = batchSize;
this.stride = stride;
cbt = typ;
}
public ComputeBuffer Rent() {
ComputeBuffer cb;
if (free.Count > 0) {
cb = free.Pop();
} else {
cb = new ComputeBuffer(count, stride, cbt);
}
active.Push(cb);
return cb;
}
public void Flush() {
while (active.Count > 0) {
free.Push(active.Pop());
}
}
public void Dispose() {
while (active.Count > 0) {
active.Pop().Dispose();
}
while (free.Count > 0) {
free.Pop().Dispose();
}
}
}

To use this pooling object, we call Flush right before our rendering code. Then, we Rent compute buffers one-by-one as we need them. The rented buffers will be returned to the pool for use in the next frame's Flush call. Finally, we need to add a Dispose method (called in our manager's OnDestroy) to manually destroy the compute buffers, or else Unity will get angry.

4. An Object Manager

4.1 An Object

DMII is usually used when your “objects” are some kind of code abstraction without a GameObject. Here’s an unimpressive class that describes some features of an object we want to render to screen.

public class FObject {
private static readonly Random r = new Random();
public Vector2 position;
public readonly float scale;
private readonly Vector2 velocity;
public float rotation;
private readonly float rotationRate;
public float time;

public FObject() {
position = new Vector2((float)r.NextDouble() * 10f - 5f, (float)r.NextDouble() * 8f - 4f);
velocity = new Vector2((float)r.NextDouble() * 0.4f - 0.2f, (float)r.NextDouble() * 0.4f - 0.2f);
rotation = (float)r.NextDouble();
rotationRate = (float)r.NextDouble() * 0.6f - 0.2f;
scale = 0.6f + (float) r.NextDouble() * 0.8f;
time = (float) r.NextDouble() * 6f;
}

public void DoUpdate(float dT) {
position += velocity * dT;
rotation += rotationRate * dT;
time += dT;
}
}

Note that we don’t define a sprite on the object. This is because the material texture is shared among all instances of a single DMII call.

Also note that we have an update function for this object that takes a deltaTime. This update function will be called by the object manager, which will query Time.deltaTime only once per frame for efficiency.

FObject here is an example, but it’s likely that you’ll have some kind of related setup. In my project, I store data in a linked list, where the nodes are (pooled) class objects that contain structs of data. Linked lists are useful if you need an ordered enumerable data structure that supports arbitrary removal. Regardless of whether you go for arrays or linked lists or whatever, your objects probably need a reference-type wrapper at some point, because mutable lists of structs are… not a good idea.

4.2 A Manager

The manager itself is a lot of boilerplate, but it’s all important boilerplate. Let’s step through writing a manager.

public class ManyObjectHolder : MonoBehaviour {
private static readonly int positionPropertyId = Shader.PropertyToID("positionBuffer");
private static readonly int directionPropertyId = Shader.PropertyToID("directionBuffer");
private static readonly int timePropertyId = Shader.PropertyToID("timeBuffer");

private MaterialPropertyBlock pb;
private static readonly ComputeBufferPool fCBP = new ComputeBufferPool(batchSize, 4, ComputeBufferType.Default);
private static readonly ComputeBufferPool v2CBP = new ComputeBufferPool(batchSize, 8, ComputeBufferType.Default);
private static readonly ComputeBufferPool argsCBP = new ComputeBufferPool(1, 5 * sizeof(uint), ComputeBufferType.IndirectArguments);
private readonly Vector2[] posArr = new Vector2[batchSize];
private readonly Vector2[] dirArr = new Vector2[batchSize];
private readonly float[] timeArr = new float[batchSize];
private readonly uint[] args = new uint[] { 0, 0, 0, 0, 0 };
private const int batchSize = 7;
public int instanceCount;

public Sprite sprite;
public Material baseMaterial;
private RenderInfo ri;
public string layerRenderName;
private int layerRender;
private FObject[] objects;
...

The first boilerplate we need is the shader property ID of each of the data arrays we declared. I’ve requested a third array, timeBuffer, which doesn't yet exist in the shader. This is actually fine, as providing extra data to the shader won't break it.

Next, we need to declare a MaterialPropertyBlock object which passes information to the shader, as well as the compute buffer pools we need. There are three pools: fCBP for float (size 4), v2CBP for float2/Vector2 (size 8), and argsCBP for the special bufferWithArgs argument. (See part 6 for a caveat on sizing.)

We won’t copy information directly from the objects into the compute buffers when it’s time to render, as this isn’t efficient. Instead, we’ll first copy them into intermediate arrays, which can be copied into compute buffers efficiently.

If we want to reuse compute buffers, they always need to be the same length. This length is our batch size: the maximum number of objects that will be dumped into a single DMII call. I use a batch size of about 1000 in my project, and I’m not sure what the maximum is.

While we can’t use sorting layers, we still have to render our objects to a specific camera culling layer. In a multi-camera setup, we can also use the culling layer to block rendering on cameras that don’t have a matching culling mask.

private void Start() {
pb = new MaterialPropertyBlock();
layerRender = LayerMask.NameToLayer(layerRenderName);
ri = RenderInfo.FromSprite(baseMaterial, sprite);
Camera.onPreCull += RenderMe;
objects = new FObject[instanceCount];
for (int ii = 0; ii < instanceCount; ++ii) {
objects[ii] = new FObject();
}
}

private void Update() {
float dT = Time.deltaTime;
for (int ii = 0; ii < instanceCount; ++ii) {
objects[ii].DoUpdate(dT);
}
}

private void OnDestroy() {
Debug.Log("Cleaning up compute buffers");
fCBP.Dispose();
v2CBP.Dispose();
argsCBP.Dispose();
}

Initialization isn’t too complicated. We initialize our rendering information and our objects, and attach our rendering function (below) to Camera.OnPreCull, which is to my knowledge the standard place to do DMII stuff. Update is self-explanatory, and don't forget to dispose your compute buffers on destruction!

private void RenderMe(Camera c) {
if (!Application.isPlaying) { return; }
fCBP.Flush();
v2CBP.Flush();
argsCBP.Flush();
args[0] = ri.mesh.GetIndexCount(0);
for (int done = 0; done < instanceCount; done += batchSize) {
int run = Math.Min(instanceCount - done, batchSize);
args[1] = (uint)run;
for (int batchInd = 0; batchInd < run; ++batchInd) {
var obj = objects[done + batchInd];
posArr[batchInd] = obj.position;
dirArr[batchInd] = new Vector2(Mathf.Cos(obj.rotation) * obj.scale, Mathf.Sin(obj.rotation) * obj.scale);
timeArr[batchInd] = obj.time;
}
var posCB = v2CBP.Rent();
var dirCB = v2CBP.Rent();
var timeCB = fCBP.Rent();
posCB.SetData(posArr, 0, 0, run);
dirCB.SetData(dirArr, 0, 0, run);
timeCB.SetData(timeArr, 0, 0, run);
pb.SetBuffer(positionPropertyId, posCB);
pb.SetBuffer(directionPropertyId, dirCB);
pb.SetBuffer(timePropertyId, timeCB);
var argsCB = argsCBP.Rent();
argsCB.SetData(args);
CallRender(c, argsCB);
}
}

The rendering setup is fairly methodical. First, we flush our compute buffers, which are no longer tied to the previous frame’s rendering calls. Then, we iterate over our object instances in groups of up to batchSize, storing the size of the group in args[1]. Within each group, we iteratively copy data from the objects into the intermediate arrays. Then we create compute buffers, copy the intermediate arrays into them, and attach the compute buffers to the property block. We have to awkwardly do the same with the argument buffer as well. Finally, we invoke the actual DMII call in a separate function (below).

private void CallRender(Camera c, ComputeBuffer argsBuffer) {
Graphics.DrawMeshInstancedIndirect(ri.mesh, 0, ri.mat,
bounds: new Bounds(Vector3.zero, Vector3.one * 1000f),
bufferWithArgs: argsBuffer,
argsOffset: 0,
properties: pb,
castShadows: ShadowCastingMode.Off,
receiveShadows: false,
layer: layerRender,
camera: c);
}

This is the core of DMII. It requires a mesh, a submesh index (if you don’t know what that means, it’s 0), a material, a Bounds object that delineates the drawing space (in my testing this object doesn’t do anything), an argument buffer which contains the number of things to draw as well as a bunch of other numbers that are also 0 if you don’t know what they mean, the MaterialPropertyBlock we modified before calling this function, some shadow information, the target camera layer, and the target camera (or null for all cameras).

With this, our basic model is complete, and we now can render a bunch of moving objects to screen with super efficiency.

5. Adding a Feature: Fade-In Time

Remember timeBuffer? Let's make some use of it by having the objects fade in over time.

First, we create a shader variable for the time over which the object should fade in. We use a shader variable instead of a buffer because this value is shared. (shader and shared are respellings of each other!)

Properties {
_MainTex("Texture", 2D) = "white" {}
_FadeInT("Fade in time", Float) = 10 // New
}

Then, we declare timeBuffer along with the other two data arrays:

#ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
StructuredBuffer<float2> positionBuffer;
StructuredBuffer<float2> directionBuffer;
StructuredBuffer<float> timeBuffer; // New
#endif

Next, we need to decide whether to do our calculations in the vertex or fragment shader. We’ll do it in the fragment shader to show the extra boilerplate. If we want access to unity_InstanceID in the fragment shader, we need to add a few lines:

struct fragment {
float4 loc : SV_POSITION;
float2 uv : TEXCOORD0;
UNITY_VERTEX_INPUT_INSTANCE_ID // New
};
fragment vert(vertex v) {
fragment f;
UNITY_SETUP_INSTANCE_ID(v);
UNITY_TRANSFER_INSTANCE_ID(v, f); // New
f.loc = UnityObjectToClipPos(v.loc);
f.uv = v.uv;
//f.uv = TRANSFORM_TEX(v.uv, _MainTex);
return f;
}
float4 frag(fragment f) : SV_Target{
UNITY_SETUP_INSTANCE_ID(f); // New
float4 c = tex2D(_MainTex, f.uv);
return c;
}

Why does it work like this? No idea.

Finally, we can do the actual fade-in. Once all the boilerplate is out of the way, this is remarkably simple. The “normal” way to do fade-in would be c.a *= smoothstep(0.0, _FadeInT, _Time). The only difference for instancing is that time is no longer a shader variable; we instead need to get it from the timeBuffer data array.

float _FadeInT;                                         // New

float4 frag(fragment f) : SV_Target{
UNITY_SETUP_INSTANCE_ID(f);
float4 c = tex2D(_MainTex, f.uv);
#ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED // New
c.a *= smoothstep(0.0, _FadeInT, timeBuffer[unity_InstanceID]); // New
#endif // New
return c;
}

(Note that this is something that you really should do in the vertex shader; I only do it in the fragment shader here to show ID transferring.)

And here’s the result: moving, rotating sprites that fade in over time. Since we randomized the starting time, some of them start off somewhat opaque.

6. Annoying Details

To my knowledge, it’s not possible to assign a sorting layer to DMII calls. This means that you probably need separate cameras for your DMII calls, since you can’t sort them with non-DMII objects. In my setup, I have six (!) cameras: a ground layer camera, a “LowDirectRender” camera for DMII calls, a middle camera for most standard objects, a “HighDirectRender” camera for other DMII calls, a top camera for effects, high-priority objects, and post-processing, and a UI camera.

DMII has some important rules for ordering. First, DMII calls are ordered by render queue; materials with lower render queue values will render first. Within the same render queue value, different materials are ordered according to their time of creation. This is a really strange feature and you should absolutely circumvent it by making sure you never are using DMII on materials with the same render queue value. For multiple DMII calls on the same material, the calls are ordered by the call order in your scripts (thankfully). Within each DMII call, instance ID 0 is called first, and then increments by 1 (thankfully).

There’s also one issue of formal correctness. In HLSL, float is 4 bytes and float2 is 8 bytes; that’s why we declare our fCBP and v2CBP with strides of 4 and 8 respectively. However, it's not "guaranteed" that C# float and Vector2 are also 4 and 8 bytes, which means that copying data into the compute buffer might technically be incorrect. To be maximally sure, you may want to add runtime checks on sizeof(float) and (unsafe) sizeof(Vector2) to make sure they're the expected sizes.

The bufferWithArgs argument is no longer required in the Unity 2020 alpha; you can instead use DrawMeshInstancedProcedural, which is exactly the same as DMII but takes a count number instead of an awkward args buffer. The alpha also has some nice mesh functions which will show up in one of my next devlogs.

You might be wondering why I use #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED when the shader probably wouldn't handle non-instanced drawing anyways. For whatever reason, I had a lot of problems getting instancing to work the first time around when I tried dropping this tag. So now I use it as a habit. Since it's a compile-time directive, it's not slowing anything down anyways.

Conclusion

DMII is literally magic. I use it heavily and it’s super-effective. For example, consider this almost empty scene, which requires 10 rendering calls (for post-processing effects):

And here’s a scene with about 14000 moving circles (recolored from one sprite using the technique discussed in my first devlog), that requires an astounding 14 more draw calls:

All of these objects are pure code abstractions: no GameObjects! (I suspect that I should look into ECS soon…)

That’s all for this devlog. Again, all the resources for this post can be found on this Github repo licensed under CC0 (effectively public domain).

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade