“Get Grouped and Groovy: How Spark Grouping Sets Can Turn Your Data Analysis into a Party-Advance”

Archana Goyal
7 min readMay 6, 2023

My articles are open to everyone; non-member readers can read the full article by clicking this link.

I covered the basics of Grouping sets in my previous article. Lets dive into the performance tuning and analyze limitations and boundaries of the Gouping Sets functionality :

Grouping Sets:

Grouping Sets in Spark is an operator that allows to perform aggregations on multiple subsets of a data set. It’s similar to the GROUP BY operator, but allows to group by multiple sets of columns in a single query.

Let’s consider an example where we have a dataset of food items and their respective prices.

We want to perform a Grouping Sets operation to calculate the total revenue generated by each combination of food items. Specifically, we want to look at the revenue generated by different combinations of burgers, pizzas, and hot dogs, and also the revenue generated by different combinations of orange juice and watermelon juice.

import org.apache.spark.sql.functions._
val df = Seq(
("Burger", "Crispy Burger", 100),
("Burger", "Cheese Burger", 120),
("Burger", "Veggie Burger", 90),
("Pizza", "Margherita Pizza", 150),
("Pizza", "Pepperoni Pizza", 180)…

--

--