Parallel Javascript for Fintech Monte Carlo

Parallel.js, for the win!

NTTP
9 min readJun 18, 2024
Apparently, an AMD Ryzen 7 has 8 cores. Would be useful for running our parallel computation web site that we describe here. “Diffused in USA”… I wonder if A.I. generated images from Stable Diffusion and similar will say that same thing, in some sort of hidden stamp pixel watermark? Photo by Olivier Collet on Unsplash

Taking a break from the slow running Excel macro[e]s of our stochastic volatility model trials, we thought we would add some parallelism to the web version of our MCR3D stock & crypto risk analysis application to speed up the compute time for larger model setups.

This project had a dual purpose, as many of our projects do. In this case, the two purposes were: 1) make an application faster, and 2) try to find a decent browser-based library that might simplify the task of multi-threaded coding for node.js and web Javascript. We already have a certain amount of parallelism in our macOS and iOS versions of this app, and so we started digging to find a way that we could transfer some of this parallelism to the “lite” browser version.

Based on some articles I have read here recently (posh? posh!), the use of the Oxford comma before the “and 2)” is probably not necessary, since not only do we have an and, but also a 2) to indicate and separate. However, we will continue with our risk reduction use of the comma in a belt & suspenders sense. Belt & braces?

The simplest way that we have found (so far) to achieve this was the parallel.js library, which provides an easier to use interface on top of WebWorkers (easier versus the plain WebWorkers API), each WebWorker handling one CPU thread of computation. It turns out that the biggest performance hit in our monte carlo random walk + aggregation process was the portion of the aggregation that does an ordinary numeric sort to get the cumulative price/probability distribution estimate at each day forward.

Numeric sorts are used for empirical Cumulative Distribution Function [CDF] construction in our app using the following logic: If we have a data set of say 1000 points, if we sort that data and pick the 10th point from the bottom, that particular data value is approximately the 1st percentile of the CDF. If we pick the 20th point from the bottom, that data value is approximately the 2nd percentile. Et cetera. This is much easier than trying to figure out what distribution type you have out of a myriad of analytic (“formula-defined”) probability distribution types, and then fitting the distribution to the data via complicated and possibly slow maximum likelihood formulations, and only then being able to estimate percentiles.

The performance hit from sort was measured by running the code through the browser performance tools and putting some millisecond resolution time checks in the code. So for example if we set up an MC study with 1 million points and 100 days forward, we sort 1M (different) values 100 times. (100 sorts total). These sorts are independent from one another algorithmically and so can be parallelized with some slight adjustments of the code.

We first tried the .spawn( ) method of parallel.js to spawn off a thread that does an individual sort, since it seemed to be the quickest to implement in our code.

Snapshot from parallel.js.org showing the .spawn function usage. The first (anonymous) function of the spawn is run on another thread, and the function of the .then is run back on the main thread. Wow, how much simpler than WebWorker setups! All one file! Inline code! parallel.js does the setting up for you, internally.

We did get some performance improvement out of that, so then we tried the slightly more involved .map( ) method (a standard computer sciencey “map” operation run in parallel, where each element of an array is mapped to the same indexed element of an output array).

Snapshot from parallel.js.org showing the .map function usage. For some reason, leet coders and “technical interviewers” seem obsessed with Fibonnaci numbers. That old Golden Ratio calling their names, like the Lorelei of the Rhine. Watch out for the rocks, sailors! The Greeks I guess called them Sirens, but Tomato / Tomahto. Or I guess Tomate / ντομάτα.

For this, we queued up the arrays to sort into a containing array and sent that containing array to the .map( ) function of parallel.js, getting an “array of sorted arrays” back and processing those sorted arrays as we do in the serial case.

Annotated bit of our code that does the parallel processing of the many sorts needed

This seemed to give more performance improvement versus the individual thread spawns, but sometimes hit out-of-memory issues with larger problem setups when using the Chrome browser. Reaching the memory limits of web workers, perhaps? To attempt to partly alleviate this — and as part of an experiment to try to send the data to the WebWorkers via an ArrayBuffer, where data is transferred (moved) and not copied serially to the worker thread — we packed our input array which was an ordinary Javascript array of numbers into a Float32Array type (similar or maybe even identical internally to a C++ float array, ordinary single precision values rather than the default double precision 8 byte floating point values that JS uses) before sending the array to the worker thread.

FYI: The version of parallel.js that we are using doesn’t seem to support ArrayBuffer transfer of data to WebWorkers directly. Maybe there is one of the many forks of it out there that do? If you know, let us all know in the comments!

This switch to Float32Arrays cuts memory use in half for the worker threads, and also — important if serialization (to strings) or data copy operations occur when sending data to worker threads — cuts the data transfer size in half. Our experimentation with ArrayBuffer sends of large arrays to the worker threads were inconclusive. Did they not give better performance because they actually weren’t any faster, or did we set it up wrong? Not sure yet. ArrayBuffer transfers of data to worker threads purportedly have some runtime advantage when using WebWorkers (see above links), so this may be worth another look later. Or perhaps for our 2M*(4 bytes per float) array test data sizes, the speed difference due to using ArrayBuffer data transfer was negligible.

Another item that is on our to-do list for this project is to try the SharedArrayBuffer mechanism to allow bi-directional low-runtime-cost data sharing between the main thread and the worker threads. This would be ideal, since the standard Javascript array sort is “in place” in the same array that you send to the sort routine. Or rather, the array that you invoke the sort routine from, if you like to think object-oriented. However, the SharedArrayBuffer requires some additional setup regarding some security concerns, so we also leave this as an enhancement. It does not seem that the main github repository of the parallel.js library supports SharedArrayBuffer transfer at the moment, but please correct us if we are wrong.

For re-synching the output data after all separate thread operations were done, we used a simple poll loop via setTimeout to trigger the final aggregation and monte carlo results processing code after all WebWorkers send back their individual sorted results. Crude but effective, and could be restructured to a more elegant way later:

const waitUntilDone = () => {
setTimeout(() => {
console.log("checking this.nQsortsDone = ", this.nQsortsDone)
if (this.nQsortsDone != ndays) {
waitUntilDone()
} else {
const totalParallelMapTime = Date.now() - this.parallelMapStartTime
console.log("mc total parallel map time", totalParallelMapTime)
afterQsortsDone()
}
}, 200)
}
waitUntilDone()

Adding a UI switch to control this feature put the little spiral of icing on the Hostess Cupcake of this project —

From Wikipedia: https://en.wikipedia.org/wiki/File:Hostess-Cupcake-Whole.jpg

— and with that, we could quickly do A/B timing tests on various browsers and runtime counts… also checking that output results were reasonable.

New multi CPU switch on the right. Snapshot from the Monte Carlo tab of mcr3d.diffent.com

We set the max threads available simultaneously (via the parallel.js API) to the max number of threads that the browser allows via:

window.navigator.hardwareConcurrency

minus 1 to let the main thread have some breathing room to manage all the sub-threads and spool-out/in data to those threads. Not to mention that maybe we don’t want to lock up your computer by using all of its CPU cores. On our test system, CPU core count was 8.

So here we present some speed-up numbers: Not great, not terrible. 8 core Intel i9. There is overhead to the WebWorkers (data transfer and other), as is normally the case with CPU threading mechanisms, and we haven’t fully parallelized the algorithm we use for the forecast envelope generation. We just parallelized the biggest performance hit, the sorting. In our native code versions of this app for iOS and macOS, we used the GrandCentralDispatch (GCD) of those runtime environments to parallelize a larger chunk of the code that surrounded the sort routine. This is easier to do with GCD than with parallel.js, because one merely sends a code block — or “closure” I guess is the newer, more hip term — to the dispatcher:

“Blocks in Objective-C have the ability to capture values from their enclosing scope, and this makes them similar to closures which are really lambdas that capture parameters from outside scopes such as in Swift.”

Above from: https://www.steveclarkapps.com/using-blocks-obj-c

Block, closure, lambda: Whew. Where the heck does “lambda” come from anyway? Why not alpha (alpha = a = anonymous function)? I think I remember lambda from some A.I. course in LISP maybe, back before A.I. was really A.I. They may be saying this 10 years from now too: “Well, back in the Twenties, they had this strange code that they called A.I. that cost a fortune to run, and all it did was compute probabilities, and everybody believed what it said… can you believe that? Man, things were so wacky back then!” [The word “wacky” is likely to make a comeback in the Thirties, this author predicts. It may take on an alternate spelling such as zwacky (silent zed).]

This block mechanism automatically captures needed variables and their data from earlier executed code in enclosing scopes (as the above quote mentions) so that the parallelizing coder doesn’t need to take an inventory of needed variables and send them manually to the sub-treads. This results in an amazing (coding) simplicity of parallelization with GCD. There are of course non simple issues with parallel code, but at least one step of coding complexity is removed from parallelization efforts.

Intel 8 core i9 2.4GHz Timing Results

2M points, 100 days forward (involving 100 sorts of 2 million points each)

Chrome serial baseline: 135 seconds

Chrome parallel: 42 seconds (3.2x speedup)

Safari serial baseline: 92 seconds

Safari parallel: 52 seconds (1.8x speedup)

The comparable task on the macOS native code MCarloRisk3D app (running in parallel mode) took about 10 seconds.

Regarding our Excel macro from our prior articles… we are not even going to bother testing it at this level. It will be very slow doing a 2 million path monte carlo.

Our new parallel implementation can run a study for 2M points easily in a browser. Note the smooth empirical probability distribution during this bulk backtest of an forward looking price/probability envelope based risk model for BTC (Bitcoin). As an added note, this is an untuned model, but it seems to have captured the lower envelope BTC behavior pretty well within the last 100 days. The real blue curve of reality here is just blipping past the magenta 5th percentile risk trace and not violating the 1% red curve on the bottom of the envelope at all. The reality blue curve seems to be blipping past the 5% estimation a bit more than 5%, so this model may have underestimated risk a little, and could maybe use some tuning. But not much tuning.
Baseline 5000 trial monte carlo run for the same BTC backtest. Note the coarser probability density function (PDF)… though not too bad when you compare the curves on the yellow plot at the top (this snapshot versus prior), which is just the cumulative form of the bell shaped curve here. The cumulative form is the integration (over 𝜕(price), not 𝜕(time)) of the bell curve shown here. Since this is a discrete data situation, is more of a sum than a true continuous integration. We know that integration is a low pass filter, and so a lot of the jittery noise gets filtered out.

To do list

For the exhaustive backtest under the Validate tab of this web app, we had to shut off this parallel run mode (which is where it is most needed, <ahem, cough>) until we can figure out how to get the multiple runs of parallelism sync’d up properly so they don’t overload the system and try to run something like 100*100 sorts of large arrays all at once… because: “It would be bad,” as one of the Ghostbusters once said.

Further reading

For more information on the types of analysis that this lite web app and related native apps on various app stores can do, please see our earlier articles, starting with this one:

Check out our follow-on article, which describes using WebGPU to speed up this algorithm:

https://medium.com/@nttp/webgpu-for-fintech-monte-carlo-31049de186cb

--

--