Apache Spark + RAPIDS: The Future of Enterprise Data Science with Native GPU Acceleration.

By: Clement Farabet and Matei Zaharia

Data scientists spend a considerable amount of time exploring data, iterating over machine learning (ML) experiments. Every hour of compute required to sort through datasets, extract features, fit ML algorithms, hinders the ability of data scientists to drive towards results.

Apache Spark™ is the most popular data processing engine in data centers for data science. It is used for interactive data science, from data preparation, to running ML experiments, and all the way to deployment of ML applications. Apache Spark™ has a vibrant community, with thousands of contributors worldwide. A few months ago a new Apache Spark™ effort named Project Hydrogen was announced. Project Hydrogen enables Apache Spark to schedule and run jobs with multiple distributed ML frameworks, and to run these jobs on GPUs.

RAPIDS is NVIDIA’s open-source accelerated platform for data science built on CUDA, launched today and available at www.rapids.ai. We believe that data science workflows can benefit tremendously from being accelerated, to enable data scientists to explore many more and larger datasets to drive towards their business goals, faster, and more reliably.

Databricks, which was founded by the original creators of Apache Spark, continues to contribute to the Apache Spark™ project as the basis for the Databricks Unified Analytics Platform, which provides a unified platform for data and AI. Matei Zaharia, Chief Technologist at Databricks, commented on the RAPIDS platform: “Databricks is excited about RAPIDS’ potential to accelerate Apache Spark workloads. Databricks has multiple ongoing projects to integrate Spark better with native accelerators, including Apache Arrow support and GPU scheduling with Project Hydrogen, and we believe that RAPIDS is an exciting new opportunity to scale our customers’ data science and AI workloads.”

NVIDIA is working on RAPIDS integration into Apache Spark™ through multiple stages. Up to now, and until the launch today, we’ve focused primarily on Python integration. We’re excited to immediately start collaborative work with Spark on the following new integrations:

  • Spark Streaming to single GPU cuDF
  • cuML and cuGraph integration
  • multi-GPU cuDF UDF
  • Longer-term, native integration