GPU Particle Research — Bevy Hanabi, Part 1

10 min readMar 16, 2024

Introduction

At the begining of the Chinese New Year 2024, Lakr, Cyan and I released a Open Source Application Firework(https://github.com/Lakr233/Firework) which can emit fireworks from the Dock to the screen. And to make the firework brighter, we used the Metal's HDR features, which also prevents us from using the CAEmitterLayer, so we made the particle system from the scratch based on the one written by Cyan (https://gist.github.com/unixzii/aeefe8edbd6a685cb3e230b5b30841db), which is a simple GPU Particle Simulator and Renderer.

To make it support simulating fireworks, we need to add features such as lifetime, spawner and sub-spawner, but I’m not talented at GPU Particles at that time so I chose to use CPU to simulate the lifetime and spawner, and only use GPU to do physics simulations. Because of this, we need to sync data between CPU and GPU at every frame, as a result there are a number of performance issues.

I’m curious how modern game engine implement high performance GPU Particle Simulators, so I planed to look into Bevy Hanabi, Unity and Unreal Engine. For this post, I will share my research on bevy hanabi.

Core Concepts of Bevy Hanabi

JIT Shader

The bevy hanabi dynamically generates the particle simulation code based on the user’s particle spawner configurations. For exmaple, the first stage of the particle simulator in GPU is vfx_init, we can find that the shader is incomplete, it’s just a template:

#import bevy_hanabi::vfx_common::{
    ForceFieldSource, IndirectBuffer, RenderIndirect, SimParams, Spawner,
    seed, tau, pcg_hash, to_float01, frand, frand2, frand3, frand4,
    rand_uniform, proj
}

struct Particle {
{{ATTRIBUTES}}
}

struct ParticleBuffer {
    particles: array<Particle>,
}

{{PROPERTIES}}

@group(0) @binding(0) var<uniform> sim_params : SimParams;
@group(1) @binding(0) var<storage, read_write> particle_buffer : ParticleBuffer;
@group(1) @binding(1) var<storage, read_write> indirect_buffer : IndirectBuffer;
{{PROPERTIES_BINDING}}
@group(2) @binding(0) var<storage, read_write> spawner : Spawner; // NOTE - same group as update
@group(3) @binding(0) var<storage, read_write> render_indirect : RenderIndirect;

{{INIT_EXTRA}}

@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) global_invocation_id: vec3<u32>) {
    var index = global_invocation_id.x;

    // Cap to max number of dead particles, copied from dead_count at the end of the
    // previous iteration, and constant during this pass (unlike dead_count).
    if (index >= render_indirect.max_spawn) {
        return;
    }

    // Cap to the actual number of spawning requested by CPU, since compute shaders run
    // in workgroup_size(64) so more threads than needed are launched (rounded up to 64).
    let spawn_count : u32 = u32(spawner.spawn);
    if (index >= spawn_count) {
        return;
    }

    // Recycle a dead particle
    let dead_index = atomicSub(&render_indirect.dead_count, 1u) - 1u;
    index = indirect_buffer.indices[3u * dead_index + 2u];

    // Update PRNG seed
    seed = pcg_hash(index ^ spawner.seed);

    // Spawner transform
    let transform = transpose(
        mat4x4(
            spawner.transform[0],
            spawner.transform[1],
            spawner.transform[2],
            vec4<f32>(0.0, 0.0, 0.0, 1.0)
        )
    );

    // Initialize new particle
    var particle = Particle();
    {{INIT_CODE}}

    {{SIMULATION_SPACE_TRANSFORM_PARTICLE}}

    // Count as alive
    atomicAdd(&render_indirect.alive_count, 1u);

    // Always write into ping, read from pong
    let ping = render_indirect.ping;

    // Add to alive list
    let indirect_index = atomicAdd(&render_indirect.instance_count, 1u);
    indirect_buffer.indices[3u * indirect_index + ping] = index;

    // Write back spawned particle
    particle_buffer.particles[index] = particle;
}

We can find some regions of moustaches in the code:

// ...
struct Particle {
{{ATTRIBUTES}}
}

// ...

{{PROPERTIES}}

// ...
{{PROPERTIES_BINDING}}
@group(2) @binding(0) var<storage, read_write> spawner : Spawner; // NOTE - same group as update
@group(3) @binding(0) var<storage, read_write> render_indirect : RenderIndirect;

{{INIT_EXTRA}}

@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) global_invocation_id: vec3<u32>) {
    // ...

    // Initialize new particle
    var particle = Particle();
    {{INIT_CODE}}

    {{SIMULATION_SPACE_TRANSFORM_PARTICLE}}

    // ...
}

Each particle effect has its own shaders and pipeline configurations in bevy hanabi, for the firework example which is located at examples/firework.rs , the spawner configuration is like this:

let effect = EffectAsset::new(
    32768,
    Spawner::burst(2500.0.into(), 2.0.into()),
    writer.finish(),
)
.with_name("firework")
.init(init_pos)
.init(init_vel)
.init(init_age)
.init(init_lifetime)
.update(update_drag)
.update(update_accel)
.render(ColorOverLifetimeModifier {
    gradient: color_gradient1,
})
.render(SizeOverLifetimeModifier {
    gradient: size_gradient1,
    screen_space_size: false,
});

let effect1 = effects.add(effect);

commands.spawn((
    Name::new("firework"),
    ParticleEffectBundle {
        effect: ParticleEffect::new(effect1),
        transform: Transform::IDENTITY,
        ..Default::default()
    },
));

The bevy hanabi engine will generate codes based on the configuration and replace them with the template of the init shader:

// Configure the init shader template, and make sure a corresponding shader
// asset exists
let init_shader_source = PARTICLES_INIT_SHADER_TEMPLATE
    .replace("{{ATTRIBUTES}}", &attributes_code)
    .replace("{{INIT_CODE}}", &init_code)
    .replace("{{INIT_EXTRA}}", &init_extra)
    .replace("{{PROPERTIES}}", &properties_code)
    .replace("{{PROPERTIES_BINDING}}", &properties_binding_code)
    .replace(
        "{{SIMULATION_SPACE_TRANSFORM_PARTICLE}}",
        &init_sim_space_transform_code,
    );
trace!("Configured init shader:\n{}", init_shader_source);

And the final shader looks like this:

#import bevy_hanabi::vfx_common::{
    ForceFieldSource, IndirectBuffer, RenderIndirect, SimParams, Spawner,
    seed, tau, pcg_hash, to_float01, frand, frand2, frand3, frand4,
    rand_uniform, proj
}

struct Particle {
    position: vec3<f32>,
    age: f32,
    velocity: vec3<f32>,
    lifetime: f32,

}

struct ParticleBuffer {
    particles: array<Particle>,
}

@group(0) @binding(0) var<uniform> sim_params : SimParams;
@group(1) @binding(0) var<storage, read_write> particle_buffer : ParticleBuffer;
@group(1) @binding(1) var<storage, read_write> indirect_buffer : IndirectBuffer;
// (no properties)
@group(2) @binding(0) var<storage, read_write> spawner : Spawner; // NOTE - same group as update
@group(3) @binding(0) var<storage, read_write> render_indirect : RenderIndirect;

fn set_position_sphere_AE31FA577EAADA90(particle: ptr<function, Particle>) {
    // Sphere center
    let c = vec3<f32>(0.,0.,0.);

    // Sphere radius
    let r = 2.;

    // Spawn randomly along the sphere surface using Archimedes's theorem
    let theta = frand() * tau;
    let z = frand() * 2. - 1.;
    let phi = acos(z);
    let sinphi = sin(phi);
    let x = sinphi * cos(theta);
    let y = sinphi * sin(theta);
    let dir = vec3<f32>(x, y, z);
    (*particle).position = c + r * dir;
}

@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) global_invocation_id: vec3<u32>) {
    var index = global_invocation_id.x;

    // Cap to max number of dead particles, copied from dead_count at the end of the
    // previous iteration, and constant during this pass (unlike dead_count).
    if (index >= render_indirect.max_spawn) {
        return;
    }

    // Cap to the actual number of spawning requested by CPU, since compute shaders run
    // in workgroup_size(64) so more threads than needed are launched (rounded up to 64).
    let spawn_count : u32 = u32(spawner.spawn);
    if (index >= spawn_count) {
        return;
    }

    // Recycle a dead particle
    // deadcount is initialy equal to the maximum particle capacity
    let dead_index = atomicSub(&render_indirect.dead_count, 1u) - 1u;
    index = indirect_buffer.indices[3u * dead_index + 2u]; // the layer 2 is for dead indices

    // Update PRNG seed
    seed = pcg_hash(index ^ spawner.seed);

    // Spawner transform
    let transform = transpose(
        mat4x4(
            spawner.transform[0],
            spawner.transform[1],
            spawner.transform[2],
            vec4<f32>(0.0, 0.0, 0.0, 1.0)
        )
    );

    // Initialize new particle
    var particle = Particle();
    set_position_sphere_AE31FA577EAADA90(&particle);
        let var0 = frand();
        particle.velocity = normalize(particle.position - (vec3<f32>(0.,0.,0.))) * (((var0) * (20.)) + (60.));
        particle.age = rand_uniform(0., 0.2);
        particle.lifetime = rand_uniform(0.8, 1.2);

    particle.position += transform[3].xyz;

    // Count as alive
    atomicAdd(&render_indirect.alive_count, 1u);

    // Always write into ping, read from pong
    let ping = render_indirect.ping;

    // Add to alive list
    let indirect_index = atomicAdd(&render_indirect.instance_count, 1u);
    indirect_buffer.indices[3u * indirect_index + ping] = index;

    // Write back spawned particle
    particle_buffer.particles[index] = particle;
}

By using JIT Shaders, we can specification the code of the compute shader to improve performance.

Compute Shader Passes

There are three passes of compute shaders for particle simulation in the bevy hanabi, they are:

Init

Spawn and init new particles if needed, otherwise it will bail out at the begining. And we cover the ping-pong double-buffering mechanism later. Please note that the we will increase the alive_count in the render_indirect when we emit a particle, which will be used later in the Indirect Pass to calculate the number of thread groups needed to update the alive particles.

// vfx_init.wgsl

@group(3) @binding(0) var<storage, read_write> render_indirect : RenderIndirect;

// Recycle a dead particle
// deadcount is initialy equal to the maximum particle capacity
let dead_index = atomicSub(&render_indirect.dead_count, 1u) - 1u;
index = indirect_buffer.indices[3u * dead_index + 2u]; // the layer 2 is for dead indices

// ...

// Initialize new particle
var particle = Particle();
// ...
// Count as alive
atomicAdd(&render_indirect.alive_count, 1u);

// Always write into ping, read from pong
let ping = render_indirect.ping;

// Add to alive list
let indirect_index = atomicAdd(&render_indirect.instance_count, 1u);
indirect_buffer.indices[3u * indirect_index + ping] = index;

// Write back spawned particle
particle_buffer.particles[index] = particle;

The work group count is based on how many particles we should emit in this frame, please note that the spawner count is ticked on the CPU side.

// src/render/mod.rs

// Dispatch init compute jobs
for batch in self.effect_query.iter_manual(world) {
    num_batches += 1;

    if let Some(init_pipeline) =
        pipeline_cache.get_compute_pipeline(batch.init_pipeline_id)
    {
        // Do not dispatch any init work if there's nothing to spawn this frame
        let spawn_count = batch.spawn_count;
        if spawn_count == 0 {
            continue;
        }

        const WORKGROUP_SIZE: u32 = 64;
        let workgroup_count = (spawn_count + WORKGROUP_SIZE - 1) / WORKGROUP_SIZE;
                // ...
                compute_pass.dispatch_workgroups(workgroup_count, 1, 1);
                // ...
}

Indirect

Calculate the number of thread groups to dispatch for the update pass, here the dispatch_indirect_buffer is the same as the one named render_indirect used in the Init Pass, they are just the same buffer bound to different groups.
Please note that the number of thread group is rounded up to 64, but the actual update count is based on the alive_count.

// vfx_update.wgsl
@group(0) @binding(0) var<storage, read_write> render_indirect_buffer : array<u32>;
@group(0) @binding(1) var<storage, read_write> dispatch_indirect_buffer : array<u32>;

// Calculate the number of thread groups to dispatch for the update pass, which is
// the number of alive particles rounded up to 64 (workgroup_size).
let alive_count = render_indirect_buffer[ri_base + RI_OFFSET_ALIVE_COUNT];
dispatch_indirect_buffer[di_base + DI_OFFSET_X] = (alive_count + 63u) >> 6u;

// Update max_update from current value of alive_count, so that the update pass
// coming next can cap its threads to this value, while also atomically modifying
// alive_count itself for next frame.
render_indirect_buffer[ri_base + RI_OFFSET_MAX_UPDATE] = alive_count;

Reset the instance_count so we can determine how many particles are actually alive after the Update Pass.

// Clear the rendering instance count, which will be upgraded by the update pass
// with the particles actually alive at the end of their update (after aged).
render_indirect_buffer[ri_base + RI_OFFSET_INSTANCE_COUNT] = 0u;

Swap the ping-pong buffers, this is the key to double-buffering mechanism to make the simulation and rendering to run simultaneously (Please note that the computer pass and the rendering pass is running serially, the word simultaneously means that some changes to buffer will not affect the rendering in this frame). We will cover this mechanism later.

// Swap ping/pong buffers
let ping = render_indirect_buffer[ri_base + RI_OFFSET_PING];
let pong = 1u - ping;
render_indirect_buffer[ri_base + RI_OFFSET_PING] = pong;

// Copy the new pong into the dispatch buffer, which will be used during rendering
// to determine where to read particle indices.
dispatch_indirect_buffer[di_base + DI_OFFSET_PONG] = pong;

Update

Retrieve the particle from buffer, update its properties (the update code is dynamically generated based on the user’s configuration)

// vfx_update.wgsl

// Always write into ping, read from pong
let ping = render_indirect.ping;
let pong = 1u - ping;

let index = indirect_buffer.indices[3u * thread_index + pong];

var particle: Particle = particle_buffer.particles[index];

// the following code is dynamically generated
particle.age = particle.age + sim_params.delta_time;
var is_alive = particle.age < particle.lifetime;
particle.velocity *= max(0., (1.) - ((5.) * (sim_params.delta_time)));particle.velocity += (vec3<f32>(-0.,-8.,-0.)) * sim_params.delta_time;
particle.position += particle.velocity * sim_params.delta_time;
// ...
particle_buffer.particles[index] = particle;

Check whether the particle is alive after this frame, if it is alive then write the index of the particle to the alive area, otherwise write it to the dead area. We will cover the areas of the indices buffer later.

// vfx_update.wgsl

is_alive = is_alive && (particle.age < particle.lifetime);
// ...
// Check if alive
if (!is_alive) {
    // Save dead index
    let dead_index = atomicAdd(&render_indirect.dead_count, 1u);
    indirect_buffer.indices[3u * dead_index + 2u] = index;
    // Also increment copy of dead count, which was updated in dispatch indirect
    // pass just before, and need to remain correct after this pass
    atomicAdd(&render_indirect.max_spawn, 1u);
    atomicSub(&render_indirect.alive_count, 1u);
} else {
    // Increment alive particle count and write indirection index for later rendering
    let indirect_index = atomicAdd(&render_indirect.instance_count, 1u);
    indirect_buffer.indices[3u * indirect_index + ping] = index;
}

Thoes passes will be dispatched before the rendering passes on every frame, and the Init Pass is skipped when there is no need to emit particles.

Buffer Management

To store the particles in the GPU Buffer and perform particle simulation and rendering simultaneously, a ParticleBuffer is used to store the data of all particles, and a IndexBuffer with three regions is used to index to alive and dead particles.

ParticleBuffer

The layout of the ParticleBuffer is based on the user’s configuration, for the firework emitter, the particle buffer is defined as follows:

// vfx_init.wgsl

struct Particle {
    position: vec3<f32>,
    age: f32,
    velocity: vec3<f32>,
    lifetime: f32,

}

struct ParticleBuffer {
    particles: array<Particle>,
}

// The buffer is created based on the capacity of particles on the CPU side
@group(1) @binding(0) var<storage, read_write> particle_buffer : ParticleBuffer;

IndexBuffer

The index buffer is a bit complex since it has three regions, two for alive particle indices and one for dead particle indices.

At the begining of the simulation, the dead region is filled with particle indices in the reverse order, and the dead_count is equal to the capacity of particles (aka the maximium number of particles in this simulation). Each time when we allocate a particle and fill it to the alive region, we will retrieve the index from dead regions based on the dead_count. For example, in the picture above the capacity of the particles is 3, so the dead_count is 3, and the instance_count is 0, and we will fill the alive particles in the alive region 0, so we have these pointers as follows:

Based on the picture above, let’s go over the code in the Init Pass again, we can now figure out how it works:

// Recycle a dead particle
// deadcount is initialy equal to the maximum particle capacity
let dead_index = atomicSub(&render_indirect.dead_count, 1u) - 1u;
// the layer 2 is for dead indices
index = indirect_buffer.indices[3u * dead_index + 2u]; 

// Always write into ping, read from pong
let ping = render_indirect.ping;

// Add to alive list
let indirect_index = atomicAdd(&render_indirect.instance_count, 1u);
indirect_buffer.indices[3u * indirect_index + ping] = index;

// Write back spawned particle
particle_buffer.particles[index] = particle;

Ping-pong Buffering

The last question is what is ping and pong? Why we need to use two alive regions? That is because for the particles have only one frame lifetime, we should remove them on the next frame, so there is a rule says “Always write into ping and read from pong”. In the init pass, we’ll write the new particle indices into the ping buffer, and in the indirect pass we’ll swap the value of ping and pong, so we will update the survial state of the particles at the next frame of their creation. The following image will demonstrate it to you.