“spark-sight” Shows Spill: Skewed Data and Executor Memory

3 min readJun 7, 2022

spark-sight is a less detailed, more intuitive representation of what is going on inside your Spark application in terms of performance.

This story is part of a series:

Part 1: Meet “spark-sight”: Spark Performance at a Glance
Part 2: This story

When you launch your Spark application, do you ever wonder

Whether it suffers from skewed data?
Whether the value of spark.executor.memory is right?

Wonder no more!

spark-sight v0.1.8 adds a new chart for spill information:

The middle chart shows, for each executor, when and how much the executor has spilled to disk in that time interval.

And maybe, with the help of spark-sight, one day you will get to see this:

Ok, I need it NOW

$ pip install spark-sight>=0.1.8

Why care about spill?

I’ll let this wonderful story guide you

Understanding common Performance Issues in Apache Spark — Deep Dive: Data Spill

Simply put, when one of the partitions is too large to fit in the executor memory, the disk is accessed and performance decreases by orders of magnitude.

A real-world scenario

This chart is so helpful because it gives you a lot of information at a glance.

To illustrate this, the following is a real-world Spark application I analyzed with spark-sight:

First half: spark.executor.memory is too low

First half of the spill chart shows that

all executors are spilling
the executors are spilling equal amounts

This is a clear indication that the parameter spark.executor.memory is too low: in fact, all executors are equally struggling to keep in memory their partitions.

Second half: skewed data

Second half of the spill chart shows that

the executors are spilling different amounts

For example, during stage 180, executor 7 has spilled three times more than the other executors.

This is a clear indication of skewed data. In fact, if you open stage 180 in the Spark UI, you will see the following:

What’s next?

Let me know in the comments what I should address next:

Adding a chart for peak memory usage per executor
Converting the current figure into a full-fledged Dash (Plotly) application to improve the UX

What’s next for you?

Clap this story
Read part 1 of this series: Meet “spark-sight”: Spark Performance at a Glance
Follow me here on Medium for future stories
If you find this project useful, head over to the spark-sight Github repo and don’t be gentle on the star and watch buttons
If you encounter any problems, head over to the spark-sight Github repo and don’t be gentle on the issue button