I feel the need, the need for speed
Menyhert describes the new Aladdin Economic Scenario Simulator and the various High-Performance Computing (HPC) techniques that they are using to build an interactive framework for users to run tens of thousands of portfolio simulations in seconds.
By: Menyhert Hegedűs, Director on the Portfolio Simulation team in BlackRock’s Analytics & Modeling team
Analyzing value and risk of financial instruments is a key component of finance. To do this well we need models that are often extremely complex and, therefore, need a lot of compute power. Analysing a 1OY structured note may require billions of random numbers and millions of cashflows to evaluate before one gets to compute just one single price in one single economic scenario.
The good news is — this now can often be solved in sub second with proper engineering using High Performance Computing (HPC) techniques.
But you may ask the question: “Ok, this is sound cool, but what is the business benefit?”
- Better models. Model research can be faster as model iterations and testing takes less time, which translates into better designed and tested models that ultimately save money for the firm. I remember working on my diploma thesis focused on a complex graph algorithm that took a whole day to run. The lengthy runtime made it extremely difficult to iterate. Luckily I was able to finish on time, but the experience stayed with me.
- Interactive apps. If you are a portfolio manager thinking about whether to do a certain complex trade under changing market conditions, you cannot wait for hours to understand its value. Having the ability to produce near real-time analytics for complex products opens a completely new set of client opportunities that can ultimately make money for the firm. The new Aladdin Economic Scenario Simulator is an example of this — by bringing the techniques described in this post to bear, we were able to build an interactive framework for users to run tens of thousands of portfolio simulations in seconds.
- The cost. If your algorithm requires an hour to finish and you have to run it a million times a day, it is going to cost you a fortune. Monte-Carlo simulations for pricing structured products are a good example. But imagine you could make it 10x faster while only marginally raising the hardware cost. It is usually possible to optimize code to run 10x faster on an Azure VM which costs only twice as much as a regular VM. You can also realize savings if your software is cross-platform. For example, Windows machines may cost less than RHEL machines on Azure. And our code is cross-platform…
- Intellectual stimulation. Sure, this is not a direct business benefit, but happier and smarter engineers make better products. It is a fun journey from the engineering point of view. I promise — you will learn a lot. But there is no free lunch here: to make it work you must really get down in the gory low-level details of how the computer works.
Use the right programming language
We had to pick a language which fits our requirements to code these heavy models. There are many programming languages, each with its strengths and weaknesses. What’s good for a web app is not necessarily good for a backend server.
Having taken a hard look at the various options, we decided to go with a combination of C/C++ and Lua for the following reasons:
- C++ offers direct access to the hardware
- C++ is high-level enough (object-oriented)
- C++ is fast. Of course, assembly could be even faster, but then we would never finish coding anything on time and maintenance would be a nightmare. Nevertheless, we are keeping our minds open to coding some hot spots in assembly in the future, if really really needed.
- Lua is a very simple language and has LuaJIT, which is considerably faster than any other JIT compiler even though it was written by just one person — partly in assembly! — as a side project. So Lua could theoretically be good choice, but the language has only a few features compared with C++.
Given all these, we decided to use Lua as a scripting engine for greater flexibility on top of core functionality coded in C++. For example, you can express pricing logic or differential equations in Lua to transform the results produced by our C++ engine. I will delve into the details of this in an upcoming blog post.
Also-rans:
- NodeJs has the V8 javascript engine which claims to be fast for numerical computations thanks to its JIT compiler, but C++ is still faster and using JS objects and the modern ESn stuff is way slower. Even though I am a big fan of Node, it’s not yet ready for the high performance arena.
- Python is generally slow, though it has Numba to JIT compile certain functions and supports GPUs as well, but it is still just a library. We may use Numpy/Tensorflow or other libraries which are primarily written in C/C++ and exposed to Python, but these are not flexible enough.
- Java has JIT as well, but autogenerated code will never be as fast as hand-tweaked C functions and fundamentally Java was designed for ease of getting things done over performance.
There are relative newcomer languages like Julia or Rust which are claimed to be fast, but they have relatively small developer communities and tool sets. We will keep monitoring their evolution to reassess as necessary.
Thus, I believe the right paradigm as of 2022 to develop computation-heavy mathematical models — not just in finance, but in general — is that you code the hotspots in C/C++ and CUDA and use C++ as a high-level language to control your workflow. Then you build a nice, user-friendly C++ API which can be used to generate language bindings for Python, Java, Node, C# or anything you want/like. Incidentally, this is the approach taken by most well-known machine learning frameworks. TensorFlow is written in C++ and has a Python API to build neural nets easily. PyTorch is slightly different, but basic mathematical operations like convolutions are written in C/C++ (it comes with Intel MKL-DNN for CPU and cuDNN for GPU by default) with those operations wired together via Python to form more complex operations like Self-Attention/LSTM/etc..
High Performance Computing techniques
In this section I am going to describe the approaches we used in our computation engine to simulate markets including equities, rates, commodities, etc. Using these, our engine can simulate all the stocks in S&P 500 in under a second.
Do not reinvent the wheel! Use the right libraries in the right way
There are many libraries built by hardware vendors to perform numerical computations such as matrix math, FFT, convolutions, etc. To name a few: Nvidia cuBLAS, cuFFT, cuDNN, Intel MKL, AMD Core Math Library (ACML), AMD Optimizing CPU Libraries (AOCL), etc.
Memory access and cache awareness are key
On many occasions, memory access is the bottleneck. The CPU can perform instructions very quickly once the data is in the registers. Simple matrix transpose and matrix multiplication can benefit from this, because it is much faster to do the additions and multiplications than reading the data from RAM. Mathematicians optimize on the number of multiplications, but engineers optimize on memory access.
Do you really need double precision? Use the right precision for what you need
Most common programming languages support both single- and double-precision floating point numbers. As I wrote above, memory access can be a bottleneck — if you can solve your problem with single-precision numbers, do it! If you can solve it with half precision or with integers, that is even better, because you have to read fewer KBs from RAM as double precision takes 64bits per number, whereas single precision or integers take 32bits on most compilers and half precision requires just 16 bits. Using integers also has a benefit compared with floating point numbers: they are just much faster to handle. If you have a model, try to convert it to use at least single precision by simply changing the computation space. For example, if you need to do a lot of multiplications of very small numbers, you can convert those to logarithm space where multiplication turns into addition. At the end of the computation, you can map the model output back to the original space by exponentiating. Using C++ templates it is easy to make most of your code indifferent to the particular level of precision you are using, selecting the one you need at compile time.
Vectorize (Tensorize)
When you need to perform the same operations on multiple data (SIMD), vectorization is your friend. Imagine you want to add two long vectors. You can write a for loop to do that element by element or you can apply vectorization to do this in chunks. Modern CPUs have registers that are 512 bits wide and can operate on 8 doubles or 16 floats at a time. Even if you buy a basic notebook, your CPU will support 256 bit registers at least. This means that you can load 4 doubles from vector A and another 4 doubles from vector B into the CPU register and perform addition on all 8 doubles at the same time.
Different vendors have different instruction sets: Intel and AMD support AVX2/AVX512, ARM has Neon, whereas IBM has AtliVec in their PowerPC.
The good thing is that complex math functions like exp or log can have vectorized versions that operate on multiple doubles at the same time.
Complications begin when your code runs differently on different data. In that case you cannot apply vectorization. You can still do comparisons, min/max, etc., but if your code has a lot of different branches it will not work.
Another problem arises when you have to randomly access the data. In this case you can use special instructions like gather, but you have to be prepared for performance to be worse than in the case of contiguous memory access.
Now let’s see a simple code example in C with double precision.
The following loop can be simply vectorized by AVX2 instructions. In reality, many compilers will vectorize automatically, but let’s assume that it is not the case for the sake of this simple example.
Regular FOR loop
for (auto i = 0; i < n; ++i)
c[i] = sqrt(a[i]*a[i] + b[i]*b[i]);
Vectorized FOR loop
for (auto i = 0; i < n; i += 4)
{
auto ai = _mm256_load_pd(&a[i]);
auto bi = _mm256_load_pd(&b[i]);
auto a2 = _mm256_mul_pd(ai, ai);
auto b2 = _mm256_mul_pd(bi, bi);
auto sum = _mm256_add_pd(a2, b2);
auto res = _mm256_sqrt_pd(sum);
_mm256_store_pd(&c[i], res);
}
Leverage multithreading
Nearly all modern CPUs have multiple cores. Even a simple CPU has at least 2 cores, but CPUs used in BLK production usually have 16–32 cores. This is good because we can run 32 different calculations at the same time. When you do heavy computations you need to be aware that running more threads than the number of cores you will kill your performance because your threads will compete for the same resources. It if fine if your threads wait for the filesystem or a database, but they should not compete for CPU.
Luckily, at this point in time you do not have to build your own threading. There are many tools and libraries which can help you.
In C++ you can use std::thread, std::async, Intel TBB, OpenMP, etc.
Our choice is OpenMP because it is
- Natively supported by compilers
- Very easy to use
- Able to synchronize threads in a fast and convenient way via its explicit barriers
Now let’s look at the code above. It is a very good candidate to parallelize. Each thread can operate on different data, so there is no need for any kind of data protection. It is a lock-free loop. We do not have to synchronize the data access via mutexes, read-write locks, etc.
Using OpenMP, the parallelized for loop looks like this:
Regular FOR loop with OpenMP
#pragma omp parallel for
for (auto i = 0; i < n; ++i)
c[i] = sqrt(a[i]*a[i] + b[i]*b[i]);
And the vectorized version with OpenMP looks like this:
Vectorized FOR loop with OpenMP
#pragma omp parallel for
for (auto i = 0; i < n; i += 4)
{
auto ai = _mm256_load_pd(&a[i]);
auto bi = _mm256_load_pd(&b[i]);
auto a2 = _mm256_mul_pd(ai, ai);
auto b2 = _mm256_mul_pd(bi, bi);
auto sum = _mm256_add_pd(a2, b2);
auto res = _mm256_sqrt_pd(sum);
_mm256_store_pd(&c[i], res);
}
When you do a lot of these for
loops, you have to know that if N is small the overhead of threading will dominate your computation time and false sharing can occur which means that CPUs will keep synchronizing their caches.
Performance comparison
I have done a simple comparison among the implementations on a RHEL7 machine with 16 cores and AVX2 support. The number of Pythagorean formulas was 1 billion.
Regular FOR loop 8.7 sec
Regular FOR loop with OpenMP 0.6 secVectorized FOR loop 2.9 sec
Vectorized loop with OpenMP 0.47 sec
From the measurements it is very clear that both parallelization and vectorization give us a huge bump in performance and the loop that was running for almost 9 seconds could be calculated in half a second after applying all these techniques.
Use GPUs
If your algorithm is heavily parallelizable, you can really leverage GPUs. CPUs were originally designed to do fast sequential calculations. GPUs, on the other hand, were designed to do large amounts of parallel calculations, but at a generally lower clock rate. If you compare a simple Intel i7 CPU with a Nvidia P100 GPU you will see that usually i7 CPUs have 4–8 cores operating at 3–5Ghz whereas an Nvidia P100 GPU operates at only ~1.4Ghz, but has 3,584 FP32 cores.
Even though both GPUs and CPUs share many concepts such as registers, caches, cores and threads, writing code for GPUs requires a different mindset. GPU programming takes some time to learn even for an experienced C/C++ coder. To take full advantage of GPUs, a deep understanding of the underlying hardware is needed. However, once you adopt the mindset and master the techniques, the payoff can be very significant.
When you do N different computations, you get a grid of size N. There can be multiple dimensions to the grid depending on the dimensionality of the data you are operating on. Your grid is further subdivided into blocks and each Streaming Multiprocessor works on one block at a time, using M threads per block. The dimensions of the grid as well as the number of blocks and threads must be specified by the developer, and a given computation could require different setups on different GPU architectures.
Realistically, there are two options to code for Nvidia GPUs: CUDA and OpenCL. OpenCL is platform-independent and the same code can run on CPUs as well as GPUs. However, we chose CUDA because this is the official “language” of Nvidia GPUs and even though it is a vendor-specific solution, Nvidia dominates the market for now. Once we need to support AMD GPUs we will need to revisit this choice.
Working with CUDA offers some important benefits:
- There are tons of libraries officially supported by Nvidia to work with.
- You do not have to write your own kernels to compute common math stuff.
The code that I used to demonstrate the power of vectorization and OpenMP would look like this in CUDA:
CUDA loop
__global__ void calc(
const double* a,
const double* b,
double* c,
int n)
{
for (int i = blockIdx.x * blockDim.x + threadIdx.x;
i<n;
i += blockDim.x * gridDim.x)
c[i] = sqrt(a[i]*a[i] + b[i]*b[i]);
}
This CUDA version runs in ~0.062 seconds on an Nvidia P100 GPU when calculating the formula 1 billion times. Compared to the fastest CPU version we’ve seen so far, which takes 0.47 seconds, this is a ~7.5x speedup.
Use more GPUs!
Sometimes “Less is more”, but in this case “More is more”!
You can plug two or more GPUs into the same machine and delegate different computations to different GPUs, or just split one big computation into chunks and assign those to different GPUs.
Once done with points 1 through 7, distribute
But you must be careful. If you distribute a bad algorithm across ten machines simply because you were too lazy to apply points 1 through 7, or move the “heavy” parts from Python/Java/etc. to C, you will just waste money on unnecessary hardware that would have been better spent on better algorithms.
Putting it all together — the Aladdin way
The Aladdin Economic Scenario Simulator was designed to project future realizations of various economic variables including macro factors (GDP, etc.), rates, equities, and so on. There are two general approaches to simulation supported within the framework: conditional and unconditional. We will not delve here into the particular mathematical distinction between the two here, but broadly speaking the unconditional simulation is a classic Monte-Carlo simulation whereas the conditional one uses a more complex technique based on Kalman filtering.
In case of the conditional simulation, 90% of the calculation relies on regular matrix math, for which standard libraries are available. The remaining 10% contains formulas where custom GPU and CPU kernels were written in CUDA and C with AVX2/AVX512.
The unconditional simulation is more interesting from the engineering point of view, since 90% of the code consists of custom functions and only 10% of the calculations could be handled by libraries like Intel MKl or cuBLAS/cuRAND.
A good example of the kind of thing that needs to be done in unconditional Monte Carlo is simulating the Shifted Lognormal Two Factor Model (SLN2f), which is the production model for most of the interest rates (IR) products in Aladdin. The following SDE describes the model dynamics:
As you see, these equations are well-suited to vectorization, since they consist of simple math functions and involve no branching. On the other hand, Monte Carlo simulation is a very good candidate for parallelization, because the individual paths are completely independent of each other (such problems are known as embarrassingly parallel).
On the CPU:
- If the simulation has N paths and there are X cores in the CPU, the calculation can be split into X chunks and each core can compute N/X paths
- Inside each chunk, Y paths can be computed at once via vectorization, where Y is (bitness of the instruction set) / (bitsize of double/float)
- In the example below, with AVX2 and floats, Y is 8, because AVX2 has 256-bit-wide registers and floats take up 32 bits
- Correlated Gaussian random numbers are generated using Intel MKL across X threads
On the GPU:
- Each thread calculates one path and each thread block contains a maximum of:
- ((max size of shared memory) / (max number of states of SDEs * sizeof(T) where T is a double/float)
- predefined maximum based on the number of CUDA cores per Streaming Multiprocessor)
Threads
- Each thread block has its own shared memory and states of the SDEs are stored in there to avoid using global memory for this
- Gaussian random numbers are generated using cuRAND and correlation is applied via cuBLAS as matrix multiplication
Let’s not forget MultiGPU
We have two strategies:
- Delegate different simulations to different GPUs
- Split one big simulation into smaller chunks across paths, simulate these chunks on separate GPUs, then aggregate the results on the primary GPU or on the CPU
In production we decided to go with option 1, namely delegate different simulations to different GPUs.
For Monte-Carlo simulation distribution is quite simple, because different machines can run different simulations using different random seeds. So we got this feature for free by exposing the random seed to our API.
The overall performance improvement versus current production, once points 1 through 7 were applied, was a whopping 26,600%. No, that is not a typo! In fact, even a simple four-core VDI machine with an AVX512 instruction set can outperform current production systems by a factor of 14, as you see in the plot below:
Optimization is an iterative process which proceeds by trial and error. It usually starts with a naive version of the algorithm, which can then be used to validate outputs and unit-test the optimal version. Next, memory access patterns are optimized to get the most of CPU/GPU caches and minimize the global memory footprint on the GPU. CPU vectorization is further applied to the hot spots and finally parallel sections are identified and OpenMP is added. On the GPU, we leverage the shared and constant memory wherever it is possible and makes sense. We then use the profiler — or just measure the running times — and repeat the process. Sometimes assembly code is generated from C++/CUDA to validate certain things like the location of local variables (register vs memory), etc. While plenty of people believe that if you simply parallelize the code you are “good”, as you can see there are many other things that you can do besides.
Overall, this effort has made building interactive, real-time simulation-based applications possible. Creating APIs and UIs for these financial models is not a problem any more, because the backend can respond in under a second. As a matter of fact, database communication (loading exposures, covariance matrices) has now become the bottleneck.
Now let’s compare the performance in terms of hardware costs (speedup / hardware price in USD (Azure Pay-As-You-Go VM pricing):
As you see, optimizing performance also results in a significant reduction in costs. Even accounting for the initial investment associated with the development effort, in the long run your product will be not only performant, but cost-effective as well.
In order to take it to the next level, we plan to apply these approaches to more complex mathematical models including pricing options/MBS, building lattices, solving PDEs, and more. The potential impact could be dramatic!
The bottom line is: using the techniques described in this post, you can get your code to run orders of magnitude faster. While accomplishing this is no mean feat, the payoff is well worth it.