Winston Robson
Aug 26 · 5 min read

Overview

  • Can’t find it? Build it. — What is BlazingSQL?
  • Seconds vs Hours — Why use BlazingSQL?
  • Community & Demos —Try Blazing for free.

Can’t find it? Build it. — What is BlazingSQL?

BlazingSQL is a scalable and intuitive SQL interface for loading massive data sets from persistent storage solutions to GPU memory.

Developed in collaboration with the GPU accelerated data science suite known as RAPIDS AI, Blazing’s 5G-esque technology is open source and free to use.

Origins

The idea first spawned out of need when the consultancy of founders Felipe and Rodrigo Aramburu was hired to build fraud detection software for Peru’s Ministry of Finance.

Multi-terabyte SQL joins were commonplace during the construction of this software, some taking 35 hours to query each time they were run. Waiting 35 hours to reload the data was an issue. Constrained by budget, the team reexamined their inventory for solutions, and found underutilized GPUs.

Though not yet “the big sexy,” the seeds promising high performance GPU computing were long sewn. Felipe had some familiarity, took a shot, and engineered a simple SQL table joiner for GPU in hopes of reducing ETL time.

It worked. Join execution on GPU sliced the 35 hour query to a mere 30 second operation. That’s 1/4,200 the time to execute. — BlazingSQL (originally BlazingDB) was born

The GPU DataFrame

Developed to resemble Apache Arrow, the GPU DataFrame was focused in 2017 by a number of GPU developers in hopes of hashing out a common means of storing and handing off data on GPUs.

Prior to this collaborated effort, many GPU computing developments were thwarted by compatibility issues and the inability of applications to communicate or engage with one another.

The GPU DataFrame (GDF) is a project with the goal to support interoperability between GPU applications and define a common GPU in-memory data layer. When we understood NVIDIA and Anaconda were both looking for ways to expand the compute capability of the GDF, we wanted in.

Seconds vs Hours — Why use BlazingSQL?

With the GDF’s Apache-esque infrastructure, BlazingSQL benchmarks are set against Arrow equivalent, Apache Spark.

We aren’t interested in optimizing our engine around industry benchmarks, we want to demonstrate our value with real world workloads.

Benchmark Workload

Production of a home loan risk assessment model from the 400GB Fannie Mae loan performance dataset for the years 2000 to 2016 is the staple benchmark of the RAPIDS community, and is reverberated by BlazingSQL.

We are running an end to end workload that loads the CSV files from an HDFS cluster, performs a series of functions for ETL and feature engineering, and then submits it to XGBoost in order to produce a score between 0–1 indicating the risk of delinquency and the risk of prepayment.

geographical visualization of loan risk analysis example (NVIDIA)

Just becoming distributed last week with v0.4, BlazingSQL has only released single-node benchmarks, testing on 2 rather than 16 years of data.

Benchmark Outcome

As of February 2019, BlazingSQL executes the ETL phase of this workload 20x faster than Apache Spark.

ETL Phase (Load + SQL + Data Conversion)

This 20x performance boost is a 400% leap from the 5x improvement offered by Blazing just a few weeks prior (January 2019).

More recent displays of Blazing’s speed include:

  • 100x time reduction for log analysis than Spark in Graphistry since April
  • 100x faster, with the help of free T4 instances, than Spark in Google Colab since May

Community & Demos — Try BlazingSQL for free

Numbers are great, results are better; the team at BlazingSQL has built out a group of launch ready Google Colab demos anyone can experience for free.

Included in the collection of notebooks:

  • Federated Query — In a single query, join an Apache Parquet Gilem a CSV, and a GPU DataFrame (GDF) in GPU memory
  • Netflow — Query 65M rows of network security data (netflow) with BlazingSQL and then pass to Graphistry to visualize and interact with the data
  • Getting Started — Walk through the process for getting BlazingSQL and cuDF running; then go through a basic ETL process and query

For those interested in running BlazingSQL locally, Docker details are available here, source here.

Community & Contributing

As a core contributor of the RAPIDS community, BlazingSQL — team, info, troubleshooting — is most easily found in the growing RAPIDS-GoAi Slack workspace (#blazingsql).

Developers interested in contributing to BlazingSQL can explore:

  • Flagship, and newly open sourced pyBlazing repo on GitHub
  • BlazingSQL Blog for the latest on the ins and outs of the continually evolving software
  • Notebooks Contrib — a RAPIDS community repo of Jupyter Notebooks covering various implementations of RAPIDS

Continued Reading

References

Aramburu, Rodrigo. “BlazingSQL Part 1: The GPU DataFrame (GDF) and CuDF in RAPIDS AI.” Medium, BlazingSQL, 25 Feb. 2019, blog.blazingdb.com/blazingsql-part-1-the-gpu-dataframe-gdf-and-cudf-in-rapids-ai-96ec15102240

Aramburu, Rodrigo. “Introducing BlazingSQL V0.3 — CuStrings and a New Python API.” Medium, BlazingSQL, 24 Apr. 2019, blog.blazingdb.com/introducing-blazingsql-v0–3-custrings-and-a-new-python-api-4352fc4e3375

Weiss, Kyle. “Data Lake to AI — BlazingSQL + RAPIDS Initial Benchmark.” Medium, BlazingSQL, 22 Jan. 2019, blog.blazingdb.com/data-lake-to-ai-blazingsql-rapids-initial-benchmark-aa753031ac8b

Future Vision

A publication centered around high quality storytelling

Winston Robson

Written by

Terrible in onset. Swift in execution. winstonrobson.com

Future Vision

A publication centered around high quality storytelling

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade