Java on GPU: Pricing options with Monte Carlo simulation

9 min readApr 27, 2019

About

In this article, I’m going to compare the performance of Java application running on CPU vs GPU. As a benchmark, I’ll write an application that prices options using Monte Carlo simulation. For the fair comparison, I’ll run the benchmark in AWS as there we can select a box with GPU and CPU with similar price. Also, I’ll share limitations that were noticed while programming on GPU.

Why GPU?

GPU computation is quite popular nowadays. People use GPU from mining crypto-currency to Machine Learning. Why GPU is preferable in some areas while CPU in another?

GPU was designed for massively parallel tasks. Nowadays CPUs in laptops has around 8 cores, while CPUs in the servers have up to 50 cores. On the other hand, GPU may have thousands of cores. If your task could be divided into more than a thousand chuncks, there is a high probability that GPU will be faster (much faster).

GPU has a huge amount of cores. But these cores aren’t the same as general-purpose CPU cores. GPU was designed using a Single instruction, multiple data (SIMD) architecture. This means that we can only execute the same code in different threads.

GPU has some limitations, most of them are described later. But here are two main limitations:

GPU is designed to execute the same code (input variables could be different) on all threads.
GPU doesn’t have access to the RAM. It has own memory (actually a few types of memory). As a result, we need to copy external application inputs from RAM to GPU memory before we start calculations. This may take quite long.

CUDA vs OpenCL

Natively, GPUs doesn’t support general-purpose programming languages. The situation becomes worth as different GPU-manufactures trying to promote their technology. For instance, NVidia is famous for CUDA platform that has a lot of useful libraries.

However, in the benchmark I used OpenCL. OpenCL has less functionality than CUDA but it is supported by most chip-producers (NVidia, AMD, Intel, etc.). This means that your app will run even on a low-level laptop with integrated GPU.

Why Java?

OpenCL (Open Computing Language) is a C based language. Why do we need Java here? I think running OpenCL app from Java may make sense in the next situation:

You have an existing JVM-based application (Java, Scala, Kotlin, etc.).
You aren’t sure if running on GPU will give any benefits. And you not ready to write OpenCL app from the scratch.

In this case, running a computation on a GPU will require relatively small efforts. If results will be good, you may think about writing native OpenCL or CUDA app to get even better performance.

As I said GPUs don’t support Java. There are some libraries that may help. We will use aparapi library. It was originally created by AMD but currently is developed by the community.

More about limitations

As mentioned before, GPU has some limitations. But this isn’t the end of the story, OpenCL and aparapi also have their own limitations. Here is a list of limitations that I found:

No recursion.
Thread can’t start another thread.
No double type, only float.
No multidimensional arrays (you need to emulate them by your own using 1d-array).
No java enums.
No debug.
Sometimes misleading error messages.

Why Monte Carlo?

We discussed GPU architecture and how to write the code for it. It is time to implement something. I was looking for the task that is massively parallel and at the same time used in a real-life.

Monte Carlo simulation is one of such tasks. We run a simulation using random variables thousands or millions of times and then calculate the average value as the expected value. We will use Monte Carlo (MC) to price financial options.

Three groups of numerical techniques are used in finance to price options:

Solving the Partial Differential Equation;
Lattice method (I talked about it in another post);
Simulation: Monte Carlo simulation;

The fastest solution is to use Partial Differential Equations. However, such a solution isn’t available for all financial instruments (e.g. some kinds of path dependent options). If so, MC could be used.

Also, MC offers additional flexibility, as we define the behaviour of random variables used in the simulation. For example, we can assume a non-normal distribution of the option’s underlying asset price.

Pricing options with Monte Carlo

My plan was to price options with MC. As a first step, I implemented the pricing of the European call option. Here is the algorithm:

Simulate the price of the option underlying in the future (assuming the random nature of the price movement);
Repeat the process many times (e.g. 100k or 1m) and store the results;
For every future price calculate the options payoffs in the future;
Calculate the average of option payoffs;
Discount average of option payoffs (in the future) to the current day;

Future asset price:

S(0) — current asset price (stock, FX, etc.); 
S(t) - price in the future period t; 
r - interest rate; 
Sigma - standard deviation;
T - number of periods; 
N(0,1) - random variable with mean 0 and standard deviation 1;

As you may see the pricing of European options is straightforward. We don’t need any loops or if-conditions, just put inputs into a mathematical formula.

For GPU most overhead is copying inputs from RAM to GPU and back after computation is ready. That is why I decided to create the second benchmark with more computation intensive task.

Knock out barrier options

Pricing knock out barrier options (KO) isn’t as trivial as European options. KO belong to the path dependent options, and require some computation during each time period. In this case, we increase GPU utilization.

Pricing the barrier in some sense similar to pricing European option. However, if the price of the asset will exceed some threshold at any time, the option becomes worthless. That is why we need to introduce a loop with a fixed number of iterations (time periods). To simulate the behaviour of an asset price during each time interval (day, hour, etc.).

Show me the code!

I tried to make introduction brief. Finally, it is time to see the code. I think you will be surprised how much less code we need to write for CPU implementation comparing to GPU.

CPU Code:

Pricing European and Barrier options on CPU

GPU Code:

Pricing European Options on GPU

GPU Code for Barrier Options:

Pricing Barrier KO options on GPU

GPU code to schedule kernel (described below) execution

As you may noticed, the GPU version requires significantly more code than CPU. This is caused by Aparapi design, limitation on types we can use and GPU architecture. Let’s go through the code.

GPU executes special code — kernel. Aparapi has a special abstraction for it com.aparapi.Kernel. After we extend Kernel class we should override run() method to implement our logic. Every GPU thread executes this code.

Next step is to define a number of threads. Let’s think about the maximum number of computations we can run in parallel. If we have N trades and M Monte-Carlo simulations we need to calculate N*M options price. Then for each trade, we need to find the average price (sum all prices per trade and divided by M).

Using createRange(…) we can tell Aparapi the maximum number of threads it should use. Minimum of this number and number of available threads will be used.

Technically, we can run MC simulation and calculate the average price in the same kernel method. To do this, we must make a trick, select N threads from N*M to aggregate results for N trades after the main computation of price is done.

However, such approach is quite slow. I think the reason is that each thread must execute the same code and can’t progress until other threads executed all previous statements. For example in the next code-snippet int a=… will be executed after all threads complete if-else statement, doesn’t matter what branch was executed by a particular thread.

Thit is why I split up the calculation of price for each iteration and finding the average into two steps.

Calculate the price for every MC simulation and store the result to GPU memory (calcMc method, using N*M threads for N trades with M simulations).
Calculate the final price by finding the average price for every trade (aggregate method, using N threads — one for each trade).

Benchmarks and cloud

In some articles about GPU vs CPU you can find something similar “On my PC with i7xxxx CPU and GTXxxx GPU, GPU is n-times faster”. Such results don’t mean a lot to me for the next reason:

Nobody will use desktop i7 in production.
Even if I had a server Xeon, other people can have Xeon that is a few times more/less expensive/faster.

Here we are coming to the cloud. I think that’s amazing that nowadays we can pay a few dollars per hour and get GPU worth 10k dollars to experiment with.

To compare apples to apples, I started with running benchmarks on 1$/hour CPU Optimized instance and 1$/hour GPU Optimized instances in AWS. It turned out that AWS has a few different families of GPU optimized instances (e.g. g3 vs p3). They differ by GPU type and price. That is why I decided to make an extra comparison for 3$/hour CPU vs GPU optimized instances. As you will see later distances between winners and looser in 1$ and 3$ categories were similar.

Regarding the test, I made two groups of tests. One group for European Call options which are quite easy to calculate and another group for Barrier KO options which require more computations. For each group, I created two tests with different ratio between required memory and computations (number of options vs the number of Monte Carlo simulations).

List of tests:

Price 500 European call option with 1m MC simulations;
Price 5000 European call option with 100k MC simulations;
Price 500 Barrier out call option with 1m MC simulations;
Price 5000 Barrier out call option with 100k MC simulations;

Results:

As we can see, for European options CPU on average is 4 times fast than GPU. While for Barrier options GPU is from 9 to 17 times faster than GPU.

These results are in line with the assumption that GPU main overhead is copying data from RAM to GPU memory. And invoking a trivial operation on a large amount of data doesn’t make sense.

AWS instances specs:
3$ GPU — p3.x2large 3.06$, Xeon E5–2686v4 2.3, Tesla V100-SXM2–16GB;
3$ CPU — c5.18xlarge 3.06$, Xeon Platinum 8124M 3GHz (2processors);
1$ GPU g3.4xlarge 1.14$; Nvidia Tesla M60–8GB;
1$ CPU. There was no CPU Optimized instance with price 1$. That is why I run the benchmark on 2 instances with average price 1.105$ and averaged results. c5.4xlarge — 0.68$, Xeon 8124 3GHz and c5.9xlarge — 1.53$, Xeon Platinum 8124M 3GHz;

Conclusion

If you have Java code you want to run on GPU, Aparapi library may help you a lot. You may have a significant increase in performance depending on your task. In my case pricing of Barrier options on GPU was from 9 to 17 times faster. If you get a good result, you may think to rewrite your code using native OpenCL or CUDA platform. In this case, your code may become even faster and you will get access to the useful libraries (e.g. CUDA has useful random number generators).

Code used in the blog: https://github.com/AlexeyPirogov/MonteCarloGPUJava.