What is BlazingSQL?
Nearing 100x faster than Apache Spark, BlazingSQL is backed by Samsung and NVIDIA — but what is it? (+ the GPU DataFrame)
- Can’t find it? Build it. — What is BlazingSQL?
- Seconds vs Hours — Why use BlazingSQL?
- Community & Demos —Try Blazing for free.
Can’t find it? Build it. — What is BlazingSQL?
BlazingSQL is a scalable and intuitive SQL interface for loading massive data sets from persistent storage solutions to GPU memory.
Developed in collaboration with the GPU accelerated data science suite known as RAPIDS AI, Blazing’s 5G-esque technology is open source and free to use.
The idea first spawned out of need when the consultancy of founders Felipe and Rodrigo Aramburu was hired to build fraud detection software for Peru’s Ministry of Finance.
Multi-terabyte SQL joins were commonplace during the construction of this software, some taking 35 hours to query each time they were run. Waiting 35 hours to reload the data was an issue. Constrained by budget, the team reexamined their inventory for solutions, and found underutilized GPUs.
Though not yet “the big sexy,” the seeds promising high performance GPU computing were long sewn. Felipe had some familiarity, took a shot, and engineered a simple SQL table joiner for GPU in hopes of reducing ETL time.
It worked. Join execution on GPU sliced the 35 hour query to a mere 30 second operation. That’s 1/4,200 the time to execute. — BlazingSQL (originally BlazingDB) was born
The GPU DataFrame
Developed to resemble Apache Arrow, the GPU DataFrame was focused in 2017 by a number of GPU developers in hopes of hashing out a common means of storing and handing off data on GPUs.
Prior to this collaborated effort, many GPU computing developments were thwarted by compatibility issues and the inability of applications to communicate or engage with one another.
The GPU DataFrame (GDF) is a project with the goal to support interoperability between GPU applications and define a common GPU in-memory data layer. When we understood NVIDIA and Anaconda were both looking for ways to expand the compute capability of the GDF, we wanted in.
Seconds vs Hours — Why use BlazingSQL?
With the GDF’s Apache-esque infrastructure, BlazingSQL benchmarks are set against Arrow equivalent, Apache Spark.
We aren’t interested in optimizing our engine around industry benchmarks, we want to demonstrate our value with real world workloads.
Production of a home loan risk assessment model from the 400GB Fannie Mae loan performance dataset for the years 2000 to 2016 is the staple benchmark of the RAPIDS community, and is reverberated by BlazingSQL.
We are running an end to end workload that loads the CSV files from an HDFS cluster, performs a series of functions for ETL and feature engineering, and then submits it to XGBoost in order to produce a score between 0–1 indicating the risk of delinquency and the risk of prepayment.
Just becoming distributed last week with v0.4, BlazingSQL has only released single-node benchmarks, testing on 2 rather than 16 years of data.
As of February 2019, BlazingSQL executes the ETL phase of this workload 20x faster than Apache Spark.
This 20x performance boost is a 400% leap from the 5x improvement offered by Blazing just a few weeks prior (January 2019).
More recent displays of Blazing’s speed include:
- 100x time reduction for log analysis than Spark in Graphistry since April
- 100x faster, with the help of free T4 instances, than Spark in Google Colab since May
Community & Demos — Try BlazingSQL for free
Numbers are great, results are better; the team at BlazingSQL has built out a group of launch ready Google Colab demos anyone can experience for free.
Included in the collection of notebooks:
- Federated Query — In a single query, join an Apache Parquet Gilem a CSV, and a GPU DataFrame (GDF) in GPU memory
- Netflow — Query 65M rows of network security data (netflow) with BlazingSQL and then pass to Graphistry to visualize and interact with the data
- Getting Started — Walk through the process for getting BlazingSQL and cuDF running; then go through a basic ETL process and query
Community & Contributing
As a core contributor of the RAPIDS community, BlazingSQL — team, info, troubleshooting — is most easily found in the growing RAPIDS-GoAi Slack workspace (#blazingsql).
Developers interested in contributing to BlazingSQL can explore:
- Flagship, and newly open sourced pyBlazing repo on GitHub
- BlazingSQL Blog for the latest on the ins and outs of the continually evolving software
- Notebooks Contrib — a RAPIDS community repo of Jupyter Notebooks covering various implementations of RAPIDS
What is RAPIDS AI?
NVIDIA’s new GPU acceleration of Data Science promises to rock the world — but what is it? (Quick & Easy Overview)
Aramburu, Rodrigo. “BlazingSQL Part 1: The GPU DataFrame (GDF) and CuDF in RAPIDS AI.” Medium, BlazingSQL, 25 Feb. 2019, blog.blazingdb.com/blazingsql-part-1-the-gpu-dataframe-gdf-and-cudf-in-rapids-ai-96ec15102240
Aramburu, Rodrigo. “Introducing BlazingSQL V0.3 — CuStrings and a New Python API.” Medium, BlazingSQL, 24 Apr. 2019, blog.blazingdb.com/introducing-blazingsql-v0–3-custrings-and-a-new-python-api-4352fc4e3375
Weiss, Kyle. “Data Lake to AI — BlazingSQL + RAPIDS Initial Benchmark.” Medium, BlazingSQL, 22 Jan. 2019, blog.blazingdb.com/data-lake-to-ai-blazingsql-rapids-initial-benchmark-aa753031ac8b