Cuda in FB Prophet: Bayesian modeling (Stan)

3 min readJun 25, 2023

Ever since I was a kid, AI and new technology breakthroughs had my attention. I spent my childhood deconstructing my toys to understand just how they worked. That childlike wonder has persisted throughout my life and helped to drive me to work hard on this project.

Before embarking on this process, I had never used Stan before. Everything I know now is from teaching myself as well as trial and error.

My need for a Stan model that could run on CUDA started when I tried to run predictions on FB’s Prophet that took over 14 hours to complete my dataset. At this point, I tried using Neuralprophet, but the process of training took even longer than FB Prophet.

Version 1: this consisted of using OpenCL to increase the speed. As It turns out it was already implemented in the code, but just needed to be compiled with Stanc and the OpenCl option enabled:

OPEN_CL = "-L\"" + "%CUDA_PATH%/lib/Win32/OpenCL.lib\""
cpp_options = {
            'STAN_THREADS': True,
            'STAN_CPP_OPTIMS': True,
            'STAN_OPENCL': True,
            'OPENCL_DEVICE_ID': 0,
            'OPENCL_PLATFORM_ID': 0,
            "PRECOMPILED_HEADERS":False,
            'LDFLAGS': OPEN_CL+' -lOpenCL ',
            'DSTAN_THREADS' : True,
            'TBB_CXX_TYPE':'gcc',
            'STAN_NO_RANGE_CHECKS': True,        
        }

After this I saw an improvement, about 10%–20% faster. This was not fast enough as I was aiming to run it multiple times between 9–5.

Version 2: At this point I knew I had one option: put my RTX 3070 to work. After a bit of research, I setup my .cu file and started using external “C” to make my functions available to Stan. After some trial and error, I discovered that replacing Stan Functions needed to happen in the User Header that was not compiled using NVCC. I compiled my .cu file as a shared library and was able to call the functions in the User header.

One of my biggest problems when testing, was the way NVCC optimized the code vs Stanc. When optimizing NVCC uses Fmad. After setting Fmad to false, the numbers that were calculated by Cuda were the same as the CPU.

Here is an example of how to implement a shared Cuda function into the User Header that StanC compiles:

#define DLLIMPORT __declspec(dllimport)
DLLIMPORT void chek_pont(double* d_tz, double* d_t_change, double* d_A, int T, int S);

#ifndef STAN_CUDA_DLL
#define STAN_CUDA_DLL
  HMODULE cudaDll = LoadLibrary("cuda_func_help.dll");
#endif

Eigen::Map<Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic>, 0, Eigen::Stride<0, 0>> get_changepoint_matrixx(Eigen::Matrix<stan::math::var_value<double>, Eigen::Dynamic, 1>& t, Eigen::Matrix<stan::math::var_value<double>, Eigen::Dynamic, 1>& t_change, const int& T, const int& S){
  void (*chek_pont)(double*, double*, double*, int, int) = (void (*)(double*, double*, double*, int, int))GetProcAddress(cudaDll, "chek_pont");  
  Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic> A(T, S);
  const int T_C = T;
  const int S_C = S;
  // Allocate device memory
  double* d_tz;
  double* d_t_change;
  double* d_A;
  cudaMalloc((void**)&d_tz, T_C * sizeof(double));
  cudaMalloc((void**)&d_t_change, S_C * sizeof(double));
  cudaMalloc((void**)&d_A, T_C * S_C * sizeof(double));
  double* t_arr = new double [T_C];
  double* t_c_arr = new double [S_C];
    // Initialize the array.
  for (int j = 0; j < T_C; j++) {
      t_arr[j] = t(j, 0).val();
  }
     // Initialize the array.
  for (int j = 0; j < S_C; j++) {
      t_c_arr[j] = t_change(j, 0).val();
  }
  // Copy data from host to device
  cudaMemcpy(d_tz, t_arr, T_C * sizeof(double), cudaMemcpyHostToDevice);
  cudaMemcpy(d_t_change, t_c_arr, S_C * sizeof(double), cudaMemcpyHostToDevice);
  // Launch the kernel
  chek_pont(d_tz, d_t_change, d_A, T_C, S_C);
  double* temp = new double [T_C * S_C];

  // Copy result from device to host
  cudaMemcpy(temp, d_A, T_C * S_C * sizeof(double), cudaMemcpyDeviceToHost);
  
  double* col_wise_data = new double [T_C * S_C];
  //switch from rowwise to colwise because of eigen
  for (int i = 0; i < T_C; i++) {
    for (int j = 0; j < S_C; j++) {
      col_wise_data[j * T_C + i] = temp[i * S_C + j];
    }
  }

Eigen::Map<Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic>, 0, Eigen::Stride<0, 0>> Ab = Eigen::Map<Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic>, 0, Eigen::Stride<0, 0>>(col_wise_data, T_C, S_C);
  // Free device memory

  cudaFree(d_tz);
  cudaFree(d_t_change);
  cudaFree(d_A);
  delete[] t_arr;
  delete[] t_c_arr;
  delete[] temp;
  return Ab;
}

namespace prophet_model_namespace {

Eigen::Map<Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic>, 0, Eigen::Stride<0, 0>> cuStan_xmake(Eigen::Matrix<stan::math::var_value<double>, Eigen::Dynamic, Eigen::Dynamic>&& X, Eigen::Matrix<stan::math::var_value<double>, Eigen::Dynamic, 1>&& s_a, const int& T, std::ostream *pstream__ = nullptr) {
      return Xmaker(X, s_a, T);
    }
}

Then, all you have to do is declare and use the cuStan_xmake in the .Stan file.

Link to CuProphet: https://github.com/StoffAudioLLC/CuProphet

Cuda in FB Prophet: Bayesian modeling (Stan)

Written by Jonathan Stoff