BlazingSQL Part 1: The GPU DataFrame (GDF) and cuDF in RAPIDS AI

Mike Beaumont
RAPIDS AI
Published in
6 min readFeb 25, 2019

By: Rodrigo Aramburu

In November 2018, we released BlazingSQL, a GPU-accelerated SQL engine built entirely on top of RAPIDS AI. RAPIDS AI is a set of open source libraries for GPU accelerating data preparation and machine learning built by multiple contributors, contributed to heavily by teams at BlazingDB, NVIDIA, and Anaconda.

This is the first of 3(maybe more, we don’t know) posts where we talk about BlazingSQL’s underlying architecture and how multiple players came together to build an end-to-end analytics solution on GPUs.

This post addresses, above all else, our GPU in-memory layer, commonly referred to as a GPU DataFrame (GDF) and cuDF, the DataFrame manipulation library in RAPIDS AI and used by BlazingSQL.

The GDF was initially a project under GOAI, the GPU Open Analytics Initiative. In 2017 BlazingDB and other GPU application developers (Anaconda, Graphistry, Gunrock, OmniSci, H20.ai) decided we needed to come up with a common data layer in GPU memory to extract the value of columnar GPU in-memory data.

Much of the vision behind the GDF comes from the Apache Arrow blueprint. Arrow is a highly adopted technology for supporting big data analytics pipelines by dramatically increasing performance, and simplifying the integration of multiple technologies into a single stack.

At the core of Apache Arrow is a columnar memory representation for data.

Source: Apache Arrow

Columnar data formats are well known for demonstrating performance improvements on data analytics relative to row-wise data representations by academics paper’s such as “C-Store: A Column-oriented DBMS” and “MonetDB/X100: Hyper-Pipelining Query Execution” both from 2005 and Google’s “Dremel: Interactive Analysis of Web-Scale Datasets” paper in 2010. All these papers touch on how column-based analytics can drive performance improvements by leveraging SIMD CPU’s and better usage of CPU cache. All told, Apache Arrow’s key triumph is just setting itself as the standard in-memory data representation.

In the world of large scale data analytics, pipelines and integrations, having a standard across various complementing or even competing technologies mean end users can combine them and interoperate them as needed. With Arrow as a standard, data types are shared, meaning you don’t have to convert between different data representations. Additionally, Apache Arrow can share data without having to serialize it or even copy it, delivering improved performance and efficiency to all these technologies. The Apache Arrow Project used the common data layer to create an ecosystem that was collaborative and widely adopted.

With the increasing maturity of GPGPU processing, the large memory bandwidth in GPU which exceeds any system memory, and the enormous parallel processing capacity of the GPU, the GPU analytics community wanted to join the Apache Arrow ecosystem. These trends led to the creation of the GDF, or the GPU Data Frame. The GDF is a columnar-based memory representation for data, in an ecosystem of GPU based data analytics.

The GDF or GPU DataFrame is, in essence, a collection of columns (gdf_columns). Each one of these gdf_columns is a simple structure that contains a data buffer, an optional null bitmask, and metadata.

The actual data is in the data buffer, which holds the data in one contiguous memory block so it can be operated efficiently. If any of the data elements in the data buffer can be null, then the null bitmask is used. The null bitmask is a continuous buffer at least long enough to hold one bit for every element in the data buffer and where the bit is set to 1 if the corresponding data element is valid (not null) or 0 if the element is null. The metadata contains information about the size of the column, the number of nulls it contains and its data type. With data in GDFs, we are now able to leverage cuDF, the DataFrame manipulation library.

cuDF is the RAPIDS AI library focused on data preparation. cuDF holds the definition of the GDF memory model, specifically the gdf_column, and data processing functions which can operate on those columns. These functions cover a range, from simple data transformation manipulation primitives such as concatenation, arithmetic functions and sorting, to higher level data analysis functions such as joining, grouping, and aggregating. Additionally, cuDF contains functions that allow many additional operations users familiar with Pandas, Numpy, or SQL might require. Finally, the cuDF library contains data ingest functions, that let users read CSV or Apache Parquet (Apache ORC coming soon) files directly into a GDF.

cuDF currently has two APIs, a C++ API that gives access to all the high-level functions, and a Python API that strongly mimics Pandas. The Python API is the most user-friendly since data scientists can interact with it almost exactly as they would with Pandas for data transformations, cleaning, and analysis.

Source: RAPIDS AI

The C++ API allows software developers to use the tools and library exposed by cuDF to build new powerful software tools using the same data compute primitives.

BlazingSQL’s core is the GDF memory model and the data processing functions that are in the C++ API of cuDF.

It is one of the main reasons we are contributing so heavily into cuDF. We believe in the strength of an ecosystem where multiple groups of developers all contribute to building common primitives which allow the technologies to intercommunicate with a standard data memory model. This ecosystem allows data scientists and engineers to build scalable and complex data pipelines leveraging the full computational capacity of the GPU.

In the above graphics, we show two other libraries beyond cuDF in the RAPIDS AI ecosystem, cuML and cuGraph. These libraries work on the same GDF memory model, which means that we can hand off result sets from data preparation to subsequent GPU processes; this is a principal benefit of the ecosystem.

cuML is a collection of GPU-accelerated machine learning libraries that provide GPU versions of machine learning algorithms available in Scikit-learn. The vision of this project is to provide every algorithm available in Scikit-learn. cuGraph is a framework and collection of graph analytics libraries.

Over the next few weeks and months, we’ll be releasing more posts describing how BlazingSQL works on the RAPIDS AI ecosystem. You can expect:

  • BlazingSQL Architecture — We will explain the unique components of BlazingSQL, and how we intertwine multiple open source and internal projects into a GPU query execution engine.
  • BlazingSQL Distribution — We’ll discuss how BlazingSQL distributes queries across multiple GPUs, multiple nodes.

As always, feel free to comment below with any questions, or learn more at our website. For daily updates and info, follow us on twitter.

Originally published at blog.blazingdb.com on February 25, 2019.

--

--