Apache Spark Optimization — Avoid groupByKey()

Published in

AllThingData

3 min readDec 18, 2023

Apache Spark, a powerful open-source framework for big data processing, excels in both batch and real-time tasks. In this discussion, we focus on optimizing Spark pipelines to make the most efficient use of resources, be it compute, memory, storage, disk IOPS, or network IOPS. While tactical details abound online, including code and step-by-step guides, my focus is on deeper understanding — the ‘why’ and ‘what,’ before the ‘how.’ A great starting point for beginners is this insightful article: Apache Spark Optimization Toolkit by Xinran Waibel

The Case Against groupByKey()
From the article, we learn about the pitfalls of groupByKey(), like causing skewed partitions due to its wide transformation nature. This operation, necessitating data shuffling across partitions and potentially different nodes, is costly in terms of serialization, network transfer, and deserialization, and poses a significant risk of node crashes due to Out Of Memory (OOM) errors.

Consider a theoretical example: logs spread across 6 partitions in 2 nodes, with the task to count occurrences of WARN, FATAL, and ERROR logs. Using groupByKey() could, depending on the data, overwhelm node memory with “WARN” logs, leading to an OOM crash.

A Smarter Approach: Using mapToPair() and reduceByKey()

“mapToPair() is available in Java, for python it may be “map”. Please check the manuals.”

Instead, We can do maptoPair() transformation. A narrow transformation which requires no shuffling. It converts data(RDD) to pairRDD (key, value). mapToPair() is used mostly to prepare data for any pair oriented operation such as groupByKey or reduceBy operation. After mapToPair(), this is what the partitions may look like.

The next step is to use reduceByKey() instead of groupByKey(). reduceByKey() efficiently performs (1) a map-side join within partitions and then (2) shuffles the data. Although both reduceByKey() and groupByKey() operations involve shuffling, reduceByKey() greatly reduces the volume of data transferred, thus enhancing performance and reducing OOM risks.Below is the snapshot of how it would look after the above (1) and (2) steps.

As we notice above, the pairs are merged within partitions, reducing the number of pairs shuffled across partitions and nodes. The number of logs, for explaining purposes, are kept small but in real life scenario these would be in 100s or 1000s which when reducedByKey() are reduced by a lot.

Is It Possible to Eliminate groupByKey() Entirely?
While there are alternative, more performant operators that can often replace groupByKey(), sometimes its use is unavoidable. However, by adhering to other optimization practices, the amount of data shuffling required can be significantly minimized.

Closing Thoughts:
Optimizing Spark for resource efficiency is a balance of choosing the right operators and understanding their impact on the overall system.

Apache Spark Optimization — Avoid groupByKey()

Written by Puneet Saha