SHUFFLE: Why Am I Getting OOM Error’s !!

Aditya Sahu
Curious Data Catalog
3 min readNov 26, 2021

This blog is the 4th blog in the series of 5 Most Common Spark Performance Problems . Until now we discussed spark’s Skew and Spill problems and their mitigation strategies. If you are interested in reading those, please do visit here.

In this blog we will be talking about spark’s most common problem i.e. shuffle. Shuffles are one of the key stage in many stages of spark execution plane. Shuffles are required and here we will be discussing mainly about how we can minimise it to get better performance. Shuffles can’t be mitigated altogether (that’s self understandable right ? because, data distributed on multiple machines, down the line, somewhere needs to be shuffled, correct ?)

What is Shuffle 🤔 ?

Shuffle in very simple terms means movement of data across executors to get job done. By this definition itself we can assume that there will be some IO and this IO can become bottleneck if we are dealing with huge volume of data. Often when this amount of data exceeds the available memory we get OOM (Out Of Memory) Errors. As I said earlier we can not altogether remove it but we are more interested towards how we can reduce this amount of data that is being shuffled over.

source: Databrick’s data engineering pathway course

There are some cases where shuffles can be avoided or mitigated but don’t get hung up in trying to remove each and every shuffle. Many shuffle operations are actually quite fast and also targeting spill, skew, tiny files etc problems often yields better payoffs.

Mitigating Shuffle

There is no straightforward technique to mitigate shuffle rather we will be discussing some of the ways we can implement to reduce shuffle.

The biggest pain point of shuffle operation is the amount of data that is being shuffled across the cluster.

There are multiple ways we can reduce this amount of data such as -

  • Denormalize the dataset, especially when shuffle is rooted in the join operation.
  • Reduce network IO by using fewer and larger workers (I’ll be creating separate blog on designing performant cluster, we will discuss this strategy there in depth)
  • Broadcast smaller tables (There will be separate blog on broadcasting as it is vital strategy to reduce shuffle and build optimized spark jobs)
  • For joins, pre-shuffle the data using bucketing (There will be separate blog especially on bucketing which will have, it’s use-case, implementation and challenges)
  • Employ spark feature such as auto broadcasting, AQE etc.

These are at high level the ways we can use to reduce amount of data that gets shuffled across clusters. The two main techniques are broadcasting and bucketing and often yields better results. Though both techniques are good but use-cases of both the techniques are very different this is also the another reason i’ll be creating two separate blogs to discuss both these techniques in depth with use-cases and tradeoff.

The take away from this blog is to keep in mind what a shuffle problem is and what are it’s mitigation strategies. At this point we are not concerned about how to implement these rather we just want to know the ways that are available. Subsequent blogs will have detailed discussion on techniques along with their code implementation.

The next blog in the series will be on implementing broadcasting. So stay tuned, be safe and keep surfing 😛.

Till then Byeee Byeeee !!!

--

--