WebGPU for Fintech Monte Carlo

GPU radix sort for empirical cumulative distribution generation

7 min readJun 27, 2024

Nice hex head bolts. Photo by Nana Dua on Unsplash

With the great speedups found for the “lite” web version of our MCarloRisk3D application, which were gained from CPU multi-thread parallelism — enabled by the excellent parallel.js library which wraps WebWorker threads — we thought we would do a little research to see if the performance bottleneck in our monte carlo algorithm

(bottleneck = sorting to generate empirical cumulative price/probability distributions)

could be offloaded to the WebGPU environment for processing by all these amazingly fast GPUs that people have in their computers these days.

It turns out, the answer yes. We found a great WebGPU based library for a radix sort routine, which is sufficient for our purposes versus the general purpose array .sort( ) in vanilla Javascript.

Radish sort? Photo by Jo Lanta on Unsplash

Radix sort has certain limitations and use cases versus more general sorts, and we will let you look those up in the above references. For our case (positive floating point numbers [prices], with 4 byte floats being of sufficient accuracy), it seems to work perfectly.

We followed the excellent example code provided by the originators of this WebGPU version of radix sort, added it to our parallel implementation of monte carlo price drift/diffusion which allows for asynchronous sorting, and further added one critical missing piece which pulls the sorted data back from GPU memory to CPU memory so it could be further processed by the main application that is written in vanilla JS. This example to pull back data we found somewhere in the Google Chrome WebGPU docs. It may or may not be optimized.


    // Input array should be Float32Array or maybe the unsigned int version of this.
    // Of course if we have US dollar prices, we would want to multiply each price by 1000 if
    // we use uints to avoid losing the pennies, then dividing by 1000
    // after sort.  Maybe a bit of proper rounding on the tenth of a penny if desired.  
    async function runGPUSort(device, inputArray) {

      const keys = inputArray // new Uint32Array(settings.element_count)

      const [keysBuffer, keysBufferCPU] = create_buffers(device, new Float32Array(keys))

      //console.log("keysBuffer", keysBuffer)

      let valuesBuffer = null

      // can't re-use kernel at this level because of keysBuffer and count passed in
      // are custom for each run

      const kernel = new RadixSortKernel({
          device,
          keys: keysBuffer,
          values: valuesBuffer,
          count: inputArray.length,
          bit_count: 32,
          workgroup_size: { x: 16, y: 16}, // could be tuned?
          check_order: false, 
          local_shuffle: false,        // final 2 params could be tuned
          avoid_bank_conflicts: false, // for different array sizes?
      })

      // console.log("kernel = ", kernel)

      const encoder = device.createCommandEncoder()
      const pass = encoder.beginComputePass(/*query.descriptor*/)
      kernel.dispatch(pass)
      pass.end()

      device.queue.submit([encoder.finish()])

      await device.queue.onSubmittedWorkDone()

      // console.log("gpu sort is done")
      
      // Encode commands for copying buffer to buffer
      // For optimizing, we may want to put these into
      // the above command encoder?

      // This portion from a Google WebGPU example.

      const copyEncoder = device.createCommandEncoder()

      copyEncoder.copyBufferToBuffer(
        keysBuffer /* source buffer */,
        0 /* source offset */,
        keysBufferCPU /* destination buffer */,
        0 /* destination offset */,
        inputArray.length * 4 /* size in bytes... we are using 4 byte floats */
      );

      // Submit copy commands.
      const copyCommands = copyEncoder.finish();
      device.queue.submit([copyCommands]);

      // Read buffer back to CPU side JS array.
      await keysBufferCPU.mapAsync(GPUMapMode.READ);
      const copyArrayBuffer = keysBufferCPU.getMappedRange();
      const keysBufferCPUFloat32 = new Float32Array(copyArrayBuffer);

      /*
      for (let i = 0; i < 10; i++) {
        console.log(i, keysBufferCPUFloat32[i])
      }
      */

      return keysBufferCPUFloat32 
      
    }

    // The non-trivial data buffer (array) create routine,
    // also from the radix sort sample code.

    // Create a GPUBuffer with data from an Uint32Array (orig)
    // For Monte Carlo price generation, we are using a Float32Array.
    // Also create a second buffer to read back from GPU.
    const create_buffers = (device, data, usage = 0) => {

      // Transfer data to GPU
      const dataBuffer =  device.createBuffer({
          size: data.length * 4,
          usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC | usage,
          mappedAtCreation: true
      })
      new data.constructor(dataBuffer.getMappedRange()).set(data)
      dataBuffer.unmap() // make dataBuffer available to GPU only now
      
      // Create buffer to read back data from GPU
      const dataBufferMapped = device.createBuffer({
          size: data.length * 4,
          usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST,
      })

      return [dataBuffer, dataBufferMapped]
    }

The main routine RadixSortKernel that we call here is the GPU code itself, and can be found in its github reference:

GitHub - kishimisu/WebGPU-Radix-Sort: WebGPU implementation for the radix sort algorithm as…

WebGPU implementation for the radix sort algorithm as described in the paper "Fast 4-way parallel radix sorting on…

github.com

We leave out details of GPU device init, as that is also in the example.

When we invoke the async runGPUSort function, we use it as a promise so our code doesn’t wait as it would if we invoked it using await. While await does have its moments, this was not one of them.

runGPUSort(this.GPUDevice, GPU_tmp_finalpricedist).then((output) => {
                      afterSortI(output) // output is the sorted array
})

The internals of RadixSortKernel are “non trivial” as the saying goes, but so is Generative A.I., right? And there are all sorts of “thought leaders” coming out of the woodwork on that topic these days. We are just using RadixSortKernel as a pre-built module here. Where are the radix sort thought leaders?! (Oh yeah, we linked a couple above, and they are thought leaders in the vero implied rather than the fugazi express sense. Meaning, they have a lot of useful things to say about it.)

Ours is a reference example and not performance-optimized. For example, GPU algorithms can often benefit from tuning of GPU thread counts and structure for various data sizes, they can sometimes gain advantage from the batching of calls to the GPU so that the application sends, say, several arrays or blocks of data at a time to GPU memory for sorting or otherwise processing en-masse with one GPU “compute” call, and there are nice features of many GPU SDKs that allow data transfer to and from GPU memory while compute operations are going on simultaneously… all to hopefully achieve faster overall thruput. After all, AMD uses the term “stream processors” as one of their sub-brands, giving the visual picture of data constantly streaming in and out of the GPU (hopefully “adding value” to the data —a phrase that is “taught in business schools” — while doing it).

However, we did not yet do this level of performance tuning on our test code, as this was mainly an experiment to see if we could get WebGPU sort to work at all in the context of our monte carlo application, and to see if there were any performance gains from this GPU sort. Hence, even more speedups may be possible than what we report here.

There is also the radix sort compute kernel itself, which we are using as a black box, that might have optimization opportunities within it… though the low hanging fruit is likely at the vanilla/WebGPU interface.

Not all browsers support the WebGPU features by default, but they may have settings that can be switched on to try out this newer web API. Check your browser documentation for details. Chrome seems to have WebGPU enabled by default these days, so we will report speedup numbers for the Chrome browser.

These are coarse metric numbers. For more precise metrics, one would want to do several tests on each setup and analyze the variance among them before coming to more precise conclusions.

The test system: 8 core Intel i9. Fairly low end (these days) GPU: Intel UHD Graphics 630 1536 MB.

500,000 monte carlo iterations 100 days forward forecast (= 100 sorts)
Single thread CPU: 28 seconds
Multi thread CPU: 9 seconds [via parallel.js, see prior article]
Multi thread GPU: 7 seconds

1,000,000 monte carlo iterations 100 days forward forecast
Single thread CPU: 58 seconds
Multi thread CPU: 19 seconds
Multi thread GPU: 12 seconds

2,000,000 monte carlo iterations 100 days forward forecast
Single thread CPU: 136 seconds
Multi thread CPU: 39 seconds
Multi thread GPU: 18 seconds

We see a 7.5x performance gain versus a single thread CPU run, and about a 2x gain versus multithread CPU on this particular machine for the large problem size. The GPU sort shows even greater relative performance gains as the problem size increases, even with the added cost of data transfer to and from the GPU, similar to the retail advertising quip: “The more you spend, the more you save!”

Enticing new GPU switch in our web app mcr3d.diffent.com.

Why all these iterations?

The more monte carlo iterations, the more that the price/probability/time envelope generated will correspond to the underlying theory; though not necessarily to reality. For the latter, model backtesting and tuning is required.

A nice, smooth (time, price, probability) surface generated from 2M monte carlo iterations on the GPU from our web application MCarloRisk3DLite, at mcr3d.diffent.com. Bulk backtest shown (withholding 100 days of reality and building a model into that withheld data region).

Observation

WebGPU could use a simpler higher-level application SDK to access it, much like parallel.js provided over WebWorkers. For example, creating data buffers and transferring data back and forth to WebGPU are a bit low level versus most Javascript code, as is setting up GPU execution pipelines. Javascript linear code execution is a pipeline too, but the interpreter / JIT compiler handles it for you. This granularity of WebGPU is important for various low level optimizations, but is not needed for all use cases. Hence, instead of making yet another web framework, another cross platform mobile phone development environment, a new programming language, or another package manager (how many are there, anyway?), interested coders are encouraged to make a nice compact and easy to use SDK on top of WebGPU! Note: The pay will likely be low or zero; glory is fleeting. And yet…

Note

We haven’t launched the GPU version of mcr3d.diffent.com to the public just yet. Watch this story for updates!

Update Jun 27 2024

The WebGPU enabled version of MCarloRisk3D is launched! mcr3d.diffent.com

WebGPU for Fintech Monte Carlo

GPU radix sort for empirical cumulative distribution generation

GitHub - kishimisu/WebGPU-Radix-Sort: WebGPU implementation for the radix sort algorithm as…

WebGPU implementation for the radix sort algorithm as described in the paper "Fast 4-way parallel radix sorting on…

Written by NTTP