by Yi Dong and Alex Volkov
Our prior blog gave a high-level overview of examples in the gQuant repository using GPU accelerated Python. Here we will dive more deeply into the technical details. The examples in gQuant are built on top of NVIDIA’s RAPIDS framework and feature fast data access provided by cuDF dataframes residing in high bandwidth GPU memory and benefit from the vast compute capabilities of modern GPUs. We demonstrate a task-centric workflow that models dependencies within a directed acyclic graph (DAG) using the idea of “dataframe-flow”. This shows how it is possible to develop workflows that manipulate the data and the compute at the graph level. Nodes are dataframe processors, and the edges are the directions of passing resulting dataframes. This graph approach can organize the quant’s workflow at a high level that addresses the complicated workflow challenge. By switching from the cuDF to Dask-cuDF dataframes, the computation automatically becomes multiple node and uses multiple GPUs for distributed computation.
In this blog, we present a simple example to show how easy it is to accelerate the quant workflow in the GPU and visualize the data flow.
A Toy Example:
To mimic the end-to-end quantitative analyst task, we are going to backtest a simple mean reversion trading strategy. The workflow includes the following steps:
- Load the 5000 end-of-day stocks CSV data into the dataframe.
- Add rate of return feature to the dataframe.
- Clean up the data by removing low volume stocks and extreme rate of returns stocks.
- Compute the slow and fast exponential moving average and compute the trading signal based on it.
- Run backtesting and compute the returns from this strategy for each of the days and stock symbols.
- Run a simple portfolio optimization by averaging the stocks together for each of the trading days.
- Compute the Sharpe ratio and cumulative return results.
The whole workflow can be organized into a computation graph, which is fully described in a YAML file. Here is a snippet of the YAML file:
Each node has a unique id, a node type, configuration parameters, and input nodes ids. The example takes this YAML file and wires it into a graph to
From the graph, we notice that it is doing the 7 steps of computation as we described above.
We implement a few common nodes that are useful for quantitative finance in the examples provided in the gQuant repository. With the help of Numba, we demonstrate 36 technical indicators that are used in computing trading signals. To extend one of these examples to a new computation, you would just need to override two functions in the parent class “Node”: “process” and “columns_setup”. The “process” function is the main function that takes input dataframes and computes the output dataframe. The “columns_setup” defines the required input dataframe column names and types and the output dataframe column names and types. In this way, the dataframes are strongly typed and errors can be detected early before the time-consuming computation happens. Here is the code example for implementing MaxNode, which computes the maximum value for a specified column in the dataframe.
It is straightforward to customize nodes, as shown in the example above. In case there is no direct dataframe method for a particular logic, a Numba GPU kernel can be used to implement it.
The example graphs are evaluated by specifying the output nodes and input nodes replacement.
The parameters for the “node_filterValue” node are overridden in the ‘input_replace’ arguments. The outputs from four nodes in the graph are computed and saved into the variable ‘o_gpu’. We cached the ‘node_csvdata’ and ‘node_sort2’ results in the variables for later use.
After ‘node_filterValue’, the graph filters out the stocks that are not suitable for backtesting. Running the following commands shows that there are 5052 stocks loaded initially and 4598 stocks remaining after filtering.
To visualize the backtesting results in the JupyterLab notebooks, we use “bqplot”.
This toy example gets a Sharpe ratio of 1.6 without considering the transaction cost. Nice!
We now compare the performance difference between CPU and GPU execution. The same computation graph can be used to flow the CPU Pandas dataframe with a few changes.
- The root node needs to be changed to load the Pandas dataframe and
- A few computation nodes that use Numba GPU kernels need to be changed to use CPU implementations.
Those nodes that use compatible dataframe API calls can be left as-is. We can simply alter the Node type in the graph to change the implementations:
It produces the exact same results as the GPU version, but much slower. All the computation happens in a Pandas dataframe on the CPU. To quantify the performance, we run the following experiment.
We ran this using a V100 Tesla GPU and an Intel(R) Xeon(R) Gold 6148 CPU. It takes 73 seconds to run in the CPU and 4 seconds to run in the GPU. In other words, we achieve an 18x speed up by using cuDF. (Note that the input nodes load the dataframes from the cache variables to save the disk IO time.)
Running this toy example in a Dask distributed environment is easy. We just need to swap the cuDF dataframe to Dask-cuDF. We split the large dataframe into small pieces to be loaded by different workers in the cluster.
The distributed computation is turned on by changing the root node type:
It produces the same results, but the performance is slow. Distributed computation only makes sense if we have a very large dataset that cannot be fit into one GPU. In this example, the dataset is small enough to be loaded into a single GPU. The between-GPU communication overhead dominates the runtime. Using too many resources to solve the problem isn’t always better, even when it is this easy to set up!
Strategy parameter search
Quantitative analysts usually need to explore different parameters for their trading strategy. The exploration process is an iterative process. We can use cached dataframes and evaluating sub-graphs to accelerate this process.
To find the optimal parameters for this simple mean reversion strategy, we only need the dataframe from “note_sort2” node, which is cached in the “strategy_cached” variable. Because the GPU computation is so fast, we can make the parameter exploration interactive in the JupyterLab notebook.
In this blog, we demonstrated how to perform a simple trading strategy backtest for 5000 stocks. This example shows GPU accelerated Python is able to deal with the challenges that data science is facing due to large datasets and complicated data science models and workflows. We can even organize the data and processing as a graph, making the workflow easier to understand and maintain for quants. We showed it is trivial to implement a dataframe processing node to adapt to many data science application. In addition, we showed that we can build dataframe agnostic features into our examples that allow them to switch easily between cuDF, Pandas, and Dask-cuDF dataframes. The benefits of using a GPU dataframe can be huge compared to a CPU dataframe. In our toy example, we achieved up to 20x speedup by switching to GPU execution. The more data processed in the GPU, the more speedup we can get. A recent benchmark uses a similar GPU dataframe approach to do backtest in the GPU via customized Numba GPU kernels, achieving speedups of over 1000x.
Hopefully, you find this blog useful for your quant applications. We would love to hear your feedback!