Experience BlazingSQL Running 100X Faster than Apache Spark on Google Colab
By: Rodrigo Aramburu
Unlock the potential of Google Colab with RAPIDS AI + BlazingSQL
If you’ve been following our blog posts, you’ll know that last week we launched a version of BlazingSQL + RAPIDS AI ecosystem with a free NVIDIA T4 GPU on Google Colab. Read our blog on how you can launch BlazingSQL on NVIDIA T4 GPUs in less than two minutes, for free.
As we’ve demonstrated in the past, the NVIDIA T4 is an amazingly powerful GPU, but now in a Google Colab, users can build end-to-end analytical workloads on millions of rows of data in Google Drive for free. That’s a price point we can all get behind!
Google Colab is pretty impressive, and also supports Apache Spark (which we think is awesome). The only issue is, while you can mock up Apache Spark workloads inside Google Colab, you can’t do real analytics since the runtime instance is not very powerful. The T4 GPU attached to said runtime instance, however, is very fast.
In this example, I run the same exact workload on over 20M rows of Netflow data two times. First, I run on the BlazingSQL + RAPIDS AI stack, and then I run it again using PySpark (Apache Spark Version 2.4.1).
The difference is staggering. When you include the time it takes to load the CSV from Google Drive into their respective DataFrames, BlazingSQL was 71X faster than Apache Spark.
If we look at just the ETL times, we can see just how fast BlazingSQL and the RAPIDS AI stack is at 100X faster than Apache Spark!
This is just the beginning, but what we want to make sure everyone understands is that Google Colab is a very big opportunity for data analytics.
With only two minutes of setup, a few lines of code, and very little expertise, even novice Python users can start querying millions of rows of data and building end-to-end graph, machine learning, or AI workloads.
For anyone who has an issue with us comparing a GPU accelerated Colab vs. a non-GPU accelerated Colab (which is totally fair), you are welcome to look at our price parity comparisons below.
- Distributed Execution: Testing is going well, we just need to get a lot of glue code built out!
- New demo with BlazingSQL feeding cuML
- Data Skipping
Originally published at https://blog.blazingdb.com on May 15, 2019.