How Apache Spark SQL taught me advanced SQL tricks…again (GROUPING SETS, CUBE, and ROLLUP)

I’m in the midst of preparing the curriculum for the upcoming 5-day Apache Spark workshop. It is for data analysts and statisticians and therefore focusing mostly on Spark SQL 2.x (with just enough Scala to get going).

While exploring the available operators, I found rollup and cube that turned out a more advanced variant of groupBy. It didn’t take long before I realized how much I’ve been missing without even being aware of the existence of the aggregation operators in pure SQL and now in Spark SQL. It’s the same earth-shattering moment which I experienced after I found out about window operator.

All in all, just exploring the features of Spark SQL in Apache Spark made me better at SQL (in the databases like Hive, PostgreSQL, Microsoft SQL Server) as well as gave me some foundation on how data analysts and statisticians could use SQL and Spark SQL in their job.

You can read about rollup and cube in the Mastering Apache Spark 2 notes of mine in Advanced Aggregation Using ROLLUP and CUBE. Note that I’m yet to describe the performance difference between these groupBy variants and when one could give you a better performance over the others. Stay tuned!

Let me know what you think and how you’ve been using rollup, cube and possibly window operators in your SQL.


p.s. I strongly recommend reading up on the advanced aggregation operators in PostgreSQL’s official documentation in 7.2.4. GROUPING SETS, CUBE, and ROLLUP as well as Deeper into Postgres 9.5 — New Group By Options for Aggregation which I used as one of the many learning sources and found them the most comprehensive.


Contact me at jacek@japila.pl if you need support with Apache Spark or just follow @jaceklaskowski on twitter to learn more in 144-char-long chunks. #SparkLikePro