Fractional Differencing is a signal processing technique used to remove nonstationarity from time series while maintaining as much time-series memory as possible. Nonstationarity of prices often occurs in stock markets and other capital markets and it makes them hard to predict. If we can remove nonstationarity then the prices would be easier to predict. Econometric time series techniques have been studied for years. Fractional differencing is widely used today in the financial services industry for preparing training data for machine learning algorithms to generate signals for stock trading.
Fractional differencing models such as the Auto-Regressive Fractional Integrated Moving Average (ARFIMA) model are based upon the more familiar ARIMA(p,d,q) model where the differencing component, d, can be a fraction instead of the usual 1 or 0, difference 1 period vs. no differencing, hence the extra ‘F’ in the acronym. Our goal is to hardware accelerate the process on GPUs using two approaches: (1) the RAPIDS cuDF open-source, GPU-accelerated DataFrame library, and (2) Numba which is a high-performance Python compiler which utilizes the NVIDIA CUDA primitives to utilize GPU acceleration.
In this open-source project done by Ensemble Capital, fractional differencing computation is accelerated via the
cudf.apply_chunks() method in the GPU. This GPU approach achieves hundreds of times acceleration (over 100x) compared with a CPU implementation, as explained in their blog.
apply_chunks() methods from the RAPIDS cuDF is the easiest way of customizing GPU computations. In this previous blog, we covered in detail how to take advantage of these two methods for customized computations. Through it, please remember that, while the cuDF approach is easier to use, the approach sacrifices some ultimate performance for convenience. The Numba compiler approach requires a steeper learning curve, but we improve Python program GPU performance.
In this blog, we are going to show how to use Numba to do fractional differencing computation efficiently. In the second part of the blog, we will show how easy it is for data scientists to compute fractional differencing signals and use them to generate alpha signals.
The original implementation takes two steps to compute the fractional differencing. First, it calculates the weights using the
get _weights_floored() method to be applied later to compute the fractional differencing. This step is not computationally demanding and we are not going to accelerate it. Secondly, it uses
apply_chunks() to apply the weights to individual elements in the array. It uses
threads_per_block to compute it in parallel. However, it only uses one GPU block to do the computation, and therefore the GPU is severely underutilized.
A More Efficient Implementation
Identifying the problems in the above code, we plan to accelerate it at three places:
- Use Numba, a more flexible library, to write the customized GPU kernel,
- Maximize GPU threads to accelerate the calculation, and
- Cache all the necessary numbers in the GPU shared memory to accelerate the IO.
Following is the revised fractional differencing driver function, which you can find with our other GPU-accelerated Python examples in the gQuant repository. It converts the input array to GPU array, computes the weights, number of blocks and passes in the inputs to launch the GPU kernel.
Following is the GPU kernel implemented in Numba:
It uses the same
get_weights_floored() method as in the original implementation to get the weights. Check the original notebook for the details of the meanings of the weights. Each of the threads in the GPU is responsible for
thread_tile_number of elements in the input array. It divides the large input array into
thread_tile*number_of_threads_per_block and sends each piece to different blocks in the GPU. All the necessary information is loaded into GPU shared memory for that block.
The shared memory has 3 chunks of information. The first is the necessary historical information for computing the fractional differencing for the first few elements. The second is the input elements for this particular computing GPU block. The last is the fractional differencing weights.
Fractional differencing is essentially doing 1D convolution computation with the kernel values set to be the weights computed from
get_weights_floored. The device function
conv_window is doing the convolution computation for one thread. We can think about convolution as an operation which applies a filter to the signal. For us here, convolution applies the filter defined by the fractional differencing weights to the input array signal.
We can compare the performance of this new implementation vs the original one.
Here’s the output.
array size 100000, Ensemble: time 0.019 s, optimized Time 0.004 s, speed up 4.55, error 0.0000
array size 1000000, Ensemble: time 0.072 s, optimized Time 0.003 s, speed up 21.49, error 0.0000
array size 10000000, Ensemble: time 0.603 s, optimized Time 0.007 s, speed up 84.59, error 0.0000
array size 100000000, Ensemble: time 5.886 s, optimized Time 0.048 s, speed up 122.88, error 0.0000
For the array of length 100m, the new implementation can achieve 100x speedup compared with the Ensemble Capital’s GPU implementation.
In this blog, we demonstrated how to use Numba to implement fractional differencing calculation in GPU. By taking advantage of all the CUDA cores and fast IO memory, it achieves 100x speedup compared with the method done by Ensemble Capital.
To see how to integrate this with other examples found in gQuant to implement fractional differencing in a full backtest algorithm, please read part 2 of this blog.
 Marcos Lopez de Prado, Advances in Financial Machine Learning, ISBN: 978–1–119–48208–6, Wiley, 2018.
 Francisco Flores-Muñoz, Alberto Javier Báez-García, Josué Gutiérrez-Barroso, Fractional differencing in stock market price and online presence of global tourist corporations, Journal of Economics, Finance and Administrative Science, ISSN: 2218–0648, 24 October 2018.