I feel the need, the need for speed

Menyhert describes the new Aladdin Economic Scenario Simulator and the various High-Performance Computing (HPC) techniques that they are using to build an interactive framework for users to run tens of thousands of portfolio simulations in seconds.

  • Interactive apps. If you are a portfolio manager thinking about whether to do a certain complex trade under changing market conditions, you cannot wait for hours to understand its value. Having the ability to produce near real-time analytics for complex products opens a completely new set of client opportunities that can ultimately make money for the firm. The new Aladdin Economic Scenario Simulator is an example of this — by bringing the techniques described in this post to bear, we were able to build an interactive framework for users to run tens of thousands of portfolio simulations in seconds.
  • The cost. If your algorithm requires an hour to finish and you have to run it a million times a day, it is going to cost you a fortune. Monte-Carlo simulations for pricing structured products are a good example. But imagine you could make it 10x faster while only marginally raising the hardware cost. It is usually possible to optimize code to run 10x faster on an Azure VM which costs only twice as much as a regular VM. You can also realize savings if your software is cross-platform. For example, Windows machines may cost less than RHEL machines on Azure. And our code is cross-platform…
  • Intellectual stimulation. Sure, this is not a direct business benefit, but happier and smarter engineers make better products. It is a fun journey from the engineering point of view. I promise — you will learn a lot. But there is no free lunch here: to make it work you must really get down in the gory low-level details of how the computer works.

Use the right programming language

We had to pick a language which fits our requirements to code these heavy models. There are many programming languages, each with its strengths and weaknesses. What’s good for a web app is not necessarily good for a backend server.

  • C++ is high-level enough (object-oriented)
  • C++ is fast. Of course, assembly could be even faster, but then we would never finish coding anything on time and maintenance would be a nightmare. Nevertheless, we are keeping our minds open to coding some hot spots in assembly in the future, if really really needed.
  • Lua is a very simple language and has LuaJIT, which is considerably faster than any other JIT compiler even though it was written by just one person — partly in assembly! — as a side project. So Lua could theoretically be good choice, but the language has only a few features compared with C++.
  • Python is generally slow, though it has Numba to JIT compile certain functions and supports GPUs as well, but it is still just a library. We may use Numpy/Tensorflow or other libraries which are primarily written in C/C++ and exposed to Python, but these are not flexible enough.
  • Java has JIT as well, but autogenerated code will never be as fast as hand-tweaked C functions and fundamentally Java was designed for ease of getting things done over performance.

High Performance Computing techniques

In this section I am going to describe the approaches we used in our computation engine to simulate markets including equities, rates, commodities, etc. Using these, our engine can simulate all the stocks in S&P 500 in under a second.

Do not reinvent the wheel! Use the right libraries in the right way

There are many libraries built by hardware vendors to perform numerical computations such as matrix math, FFT, convolutions, etc. To name a few: Nvidia cuBLAS, cuFFT, cuDNN, Intel MKL, AMD Core Math Library (ACML), AMD Optimizing CPU Libraries (AOCL), etc.

Memory access and cache awareness are key

On many occasions, memory access is the bottleneck. The CPU can perform instructions very quickly once the data is in the registers. Simple matrix transpose and matrix multiplication can benefit from this, because it is much faster to do the additions and multiplications than reading the data from RAM. Mathematicians optimize on the number of multiplications, but engineers optimize on memory access.

Do you really need double precision? Use the right precision for what you need

Most common programming languages support both single- and double-precision floating point numbers. As I wrote above, memory access can be a bottleneck — if you can solve your problem with single-precision numbers, do it! If you can solve it with half precision or with integers, that is even better, because you have to read fewer KBs from RAM as double precision takes 64bits per number, whereas single precision or integers take 32bits on most compilers and half precision requires just 16 bits. Using integers also has a benefit compared with floating point numbers: they are just much faster to handle. If you have a model, try to convert it to use at least single precision by simply changing the computation space. For example, if you need to do a lot of multiplications of very small numbers, you can convert those to logarithm space where multiplication turns into addition. At the end of the computation, you can map the model output back to the original space by exponentiating. Using C++ templates it is easy to make most of your code indifferent to the particular level of precision you are using, selecting the one you need at compile time.

Vectorize (Tensorize)

When you need to perform the same operations on multiple data (SIMD), vectorization is your friend. Imagine you want to add two long vectors. You can write a for loop to do that element by element or you can apply vectorization to do this in chunks. Modern CPUs have registers that are 512 bits wide and can operate on 8 doubles or 16 floats at a time. Even if you buy a basic notebook, your CPU will support 256 bit registers at least. This means that you can load 4 doubles from vector A and another 4 doubles from vector B into the CPU register and perform addition on all 8 doubles at the same time.

for (auto i = 0; i < n; ++i)
c[i] = sqrt(a[i]*a[i] + b[i]*b[i]);
for (auto i = 0; i < n; i += 4)
{
auto ai = _mm256_load_pd(&a[i]);
auto bi = _mm256_load_pd(&b[i]);
auto a2 = _mm256_mul_pd(ai, ai);
auto b2 = _mm256_mul_pd(bi, bi);
auto sum = _mm256_add_pd(a2, b2);
auto res = _mm256_sqrt_pd(sum);
_mm256_store_pd(&c[i], res);
}

Leverage multithreading

Nearly all modern CPUs have multiple cores. Even a simple CPU has at least 2 cores, but CPUs used in BLK production usually have 16–32 cores. This is good because we can run 32 different calculations at the same time. When you do heavy computations you need to be aware that running more threads than the number of cores you will kill your performance because your threads will compete for the same resources. It if fine if your threads wait for the filesystem or a database, but they should not compete for CPU.

  • Very easy to use
  • Able to synchronize threads in a fast and convenient way via its explicit barriers
#pragma omp parallel for
for (auto i = 0; i < n; ++i)
c[i] = sqrt(a[i]*a[i] + b[i]*b[i]);
#pragma omp parallel for
for (auto i = 0; i < n; i += 4)
{
auto ai = _mm256_load_pd(&a[i]);
auto bi = _mm256_load_pd(&b[i]);
auto a2 = _mm256_mul_pd(ai, ai);
auto b2 = _mm256_mul_pd(bi, bi);
auto sum = _mm256_add_pd(a2, b2);
auto res = _mm256_sqrt_pd(sum);
_mm256_store_pd(&c[i], res);
}
Regular FOR loop               8.7  sec
Regular FOR loop with OpenMP 0.6 sec
Vectorized FOR loop 2.9 sec
Vectorized loop with OpenMP 0.47 sec

Use GPUs

If your algorithm is heavily parallelizable, you can really leverage GPUs. CPUs were originally designed to do fast sequential calculations. GPUs, on the other hand, were designed to do large amounts of parallel calculations, but at a generally lower clock rate. If you compare a simple Intel i7 CPU with a Nvidia P100 GPU you will see that usually i7 CPUs have 4–8 cores operating at 3–5Ghz whereas an Nvidia P100 GPU operates at only ~1.4Ghz, but has 3,584 FP32 cores.

  • You do not have to write your own kernels to compute common math stuff.
__global__ void calc(
const double* a,
const double* b,
double* c,
int n)
{
for (int i = blockIdx.x * blockDim.x + threadIdx.x;
i<n;
i += blockDim.x * gridDim.x)
c[i] = sqrt(a[i]*a[i] + b[i]*b[i]);
}

Use more GPUs!

Sometimes “Less is more”, but in this case “More is more”!

Once done with points 1 through 7, distribute

But you must be careful. If you distribute a bad algorithm across ten machines simply because you were too lazy to apply points 1 through 7, or move the “heavy” parts from Python/Java/etc. to C, you will just waste money on unnecessary hardware that would have been better spent on better algorithms.

Putting it all together — the Aladdin way

The Aladdin Economic Scenario Simulator was designed to project future realizations of various economic variables including macro factors (GDP, etc.), rates, equities, and so on. There are two general approaches to simulation supported within the framework: conditional and unconditional. We will not delve here into the particular mathematical distinction between the two here, but broadly speaking the unconditional simulation is a classic Monte-Carlo simulation whereas the conditional one uses a more complex technique based on Kalman filtering.

On the CPU:

  • If the simulation has N paths and there are X cores in the CPU, the calculation can be split into X chunks and each core can compute N/X paths
  • Inside each chunk, Y paths can be computed at once via vectorization, where Y is (bitness of the instruction set) / (bitsize of double/float)
  • In the example below, with AVX2 and floats, Y is 8, because AVX2 has 256-bit-wide registers and floats take up 32 bits
  • Correlated Gaussian random numbers are generated using Intel MKL across X threads

On the GPU:

  • Each thread calculates one path and each thread block contains a maximum of:
    - ((max size of shared memory) / (max number of states of SDEs * sizeof(T) where T is a double/float)
    - predefined maximum based on the number of CUDA cores per Streaming Multiprocessor)

Threads

  • Each thread block has its own shared memory and states of the SDEs are stored in there to avoid using global memory for this
  • Gaussian random numbers are generated using cuRAND and correlation is applied via cuBLAS as matrix multiplication

Let’s not forget MultiGPU

We have two strategies:

  • Split one big simulation into smaller chunks across paths, simulate these chunks on separate GPUs, then aggregate the results on the primary GPU or on the CPU

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
BlackRockEngineering

Official BlackRock Engineering Blog. From the designers & developers of industry-leading platform Aladdin®. Important disclosures: http://bit.ly/17XHCyc