Announcing the General Availability of Dask-SQL on GPUs

Published in

RAPIDS AI

2 min readOct 6, 2022

We’re excited to announce General Availability of Dask-SQL on GPUs! This matters to you because it means more interoperability between data analytics and data science teams, no need for a ‘special’ language for queries on very large data sets, less code rewriting, and the ability to use one of data’s true time-tested languages.

Dask-SQL provides a familiar SQL syntax on top of DataFrames without the need to move or convert data‚ and now has full GPU support. With it, you can seamlessly interoperate with SQL, Pandas, cuDF, and Dask DataFrames:

Core Features of Note:

Dask-SQL can run on both CPUs and GPUs without modification just like Dask DataFrame workflows, making it easy to develop locally and then deploy to GPU clusters for a performance boost.
It includes SQL language extensions for training and scoring with ML models such as XGBoost, LogisticRegression, and others.
Support for reading from a variety of data sources and formats:
Local or Cloud Storage
Parquet, ORC, CSV, or Pandas, cuDF, & Dask DataFrames
Apache Hive

Since we first demoed Dask-SQL at GTC last year, we’ve added support for more ML models, improved UDF support (now including strings), and have been adding greater SQL grammar coverage to expand the types of queries it can support. This is also our first release using Apache Arrow DataFusion for query parsing and planning. That’s allowed us to reduce both our package sizes and per worker compute overhead. We are going to continue to enhance and improve Dask-SQL, so expect more SQL semantics, query planning, and optimization features. This is just the start of the work we plan to do with Dask-SQL so expect more SQL semantics to be added as well as additional query planning and optimization features.

To get started, check out the Dask-SQL docs, as well as our overview notebook, which demonstrates core features and CPU-GPU transparent execution. You can even try running Dask-SQL queries yourself on a free GPU instance on SageMaker Studio Lab! If you’re really curious, maybe grab a data set from folks at Google Research and try out what including SQL in your data science work and let us know how it does on Twitter @rapidsai and we might feature your project!

As with all RAPIDS projects, we love contributions, issues, and feature requests on GitHub! Come and let us know how you’re using it!

Announcing the General Availability of Dask-SQL on GPUs

Written by Randy Gelhausen