Fast Fractional Differencing on GPUs using Numba and RAPIDS (Part 1)

Yi Dong
Yi Dong
Oct 8, 2019 · 4 min read

By: Yi Dong and Mark J. Bennett


Fractional Differencing is a signal processing technique used to remove nonstationarity from time series while maintaining as much time-series memory as possible. Nonstationarity of prices often occurs in stock markets and other capital markets and it makes them hard to predict. If we can remove nonstationarity then the prices would be easier to predict. Econometric time series techniques have been studied for years. Fractional differencing is widely used today in the financial services industry for preparing training data for machine learning algorithms to generate signals for stock trading[1][2].

Fractional differencing models such as the Auto-Regressive Fractional Integrated Moving Average (ARFIMA) model are based upon the more familiar ARIMA(p,d,q) model where the differencing component, d, can be a fraction instead of the usual 1 or 0, difference 1 period vs. no differencing, hence the extra ‘F’ in the acronym. Our goal is to hardware accelerate the process on GPUs using two approaches: (1) the RAPIDS cuDF open-source, GPU-accelerated DataFrame library, and (2) Numba which is a high-performance Python compiler which utilizes the NVIDIA CUDA primitives to utilize GPU acceleration.

In this open-source project done by Ensemble Capital, fractional differencing computation is accelerated via the cudf.apply_chunks() method in the GPU. This GPU approach achieves hundreds of times acceleration (over 100x) compared with a CPU implementation, as explained in their blog.

Using the apply_rows()and apply_chunks() methods from the RAPIDS cuDF is the easiest way of customizing GPU computations. In this previous blog, we covered in detail how to take advantage of these two methods for customized computations. Through it, please remember that, while the cuDF approach is easier to use, the approach sacrifices some ultimate performance for convenience. The Numba compiler approach requires a steeper learning curve, but we improve Python program GPU performance.

In this blog, we are going to show how to use Numba to do fractional differencing computation efficiently. In the second part of the blog, we will show how easy it is for data scientists to compute fractional differencing signals and use them to generate alpha signals.

Original Implementation

In order to follow our implementation, first, copy the fractional differencing code from Ritchie Ng’s open-source project. We will use this as our baseline.

The original implementation takes two steps to compute the fractional differencing. First, it calculates the weights using the get _weights_floored() method to be applied later to compute the fractional differencing. This step is not computationally demanding and we are not going to accelerate it. Secondly, it uses apply_chunks() to apply the weights to individual elements in the array. It uses threads_per_block to compute it in parallel. However, it only uses one GPU block to do the computation, and therefore the GPU is severely underutilized.

A More Efficient Implementation

Identifying the problems in the above code, we plan to accelerate it at three places:

  1. Use Numba, a more flexible library, to write the customized GPU kernel,
  2. Maximize GPU threads to accelerate the calculation, and
  3. Cache all the necessary numbers in the GPU shared memory to accelerate the IO.

Following is the revised fractional differencing driver function, which you can find with our other GPU-accelerated Python examples in the gQuant repository. It converts the input array to GPU array, computes the weights, number of blocks and passes in the inputs to launch the GPU kernel.

Following is the GPU kernel implemented in Numba:

It uses the same get_weights_floored() method as in the original implementation to get the weights. Check the original notebook for the details of the meanings of the weights. Each of the threads in the GPU is responsible for thread_tile_number of elements in the input array. It divides the large input array into thread_tile*number_of_threads_per_block and sends each piece to different blocks in the GPU. All the necessary information is loaded into GPU shared memory for that block.

The shared memory has 3 chunks of information. The first is the necessary historical information for computing the fractional differencing for the first few elements. The second is the input elements for this particular computing GPU block. The last is the fractional differencing weights.

Fractional differencing is essentially doing 1D convolution computation with the kernel values set to be the weights computed from get_weights_floored. The device function conv_window is doing the convolution computation for one thread. We can think about convolution as an operation which applies a filter to the signal. For us here, convolution applies the filter defined by the fractional differencing weights to the input array signal.

Performance Comparison

We can compare the performance of this new implementation vs the original one.

Here’s the output.

array size 100000, Ensemble: time 0.019 s, optimized Time 0.004 s, speed up 4.55, error 0.0000
array size 1000000, Ensemble: time 0.072 s, optimized Time 0.003 s, speed up 21.49, error 0.0000
array size 10000000, Ensemble: time 0.603 s, optimized Time 0.007 s, speed up 84.59, error 0.0000
array size 100000000, Ensemble: time 5.886 s, optimized Time 0.048 s, speed up 122.88, error 0.0000

For the array of length 100m, the new implementation can achieve 100x speedup compared with the Ensemble Capital’s GPU implementation.


In this blog, we demonstrated how to use Numba to implement fractional differencing calculation in GPU. By taking advantage of all the CUDA cores and fast IO memory, it achieves 100x speedup compared with the method done by Ensemble Capital.

To see how to integrate this with other examples found in gQuant to implement fractional differencing in a full backtest algorithm, please read part 2 of this blog.

[1] Marcos Lopez de Prado, Advances in Financial Machine Learning, ISBN: 978–1–119–48208–6, Wiley, 2018.

[2] Francisco Flores-Muñoz, Alberto Javier Báez-García, Josué Gutiérrez-Barroso, Fractional differencing in stock market price and online presence of global tourist corporations, Journal of Economics, Finance and Administrative Science, ISSN: 2218–0648, 24 October 2018.


RAPIDS is a suite of software libraries for executing…

Yi Dong

Written by

Yi Dong



RAPIDS is a suite of software libraries for executing end-to-end data science & analytics pipelines entirely on GPUs.

More From Medium

More from RAPIDS AI

More from RAPIDS AI

More from RAPIDS AI

Show Me The Word Count


Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade