“spark-sight” Shows Spill: Skewed Data and Executor Memory
spark-sight is a less detailed, more intuitive representation of what is going on inside your Spark application in terms of performance.
This story is part of a series:
- Part 1: Meet “spark-sight”: Spark Performance at a Glance
- Part 2: This story
When you launch your Spark application, do you ever wonder
- Whether it suffers from skewed data?
- Whether the value of spark.executor.memory is right?
Wonder no more!
spark-sight v0.1.8 adds a new chart for spill information:
The middle chart shows, for each executor, when and how much the executor has spilled to disk in that time interval.
And maybe, with the help of spark-sight, one day you will get to see this:
Ok, I need it NOW
$ pip install spark-sight>=0.1.8
Why care about spill?
I’ll let this wonderful story guide you
Understanding common Performance Issues in Apache Spark — Deep Dive: Data Spill
Simply put, when one of the partitions is too large to fit in the executor memory, the disk is accessed and performance decreases by orders of magnitude.
A real-world scenario
This chart is so helpful because it gives you a lot of information at a glance.
To illustrate this, the following is a real-world Spark application I analyzed with spark-sight:
First half: spark.executor.memory is too low
First half of the spill chart shows that
- all executors are spilling
- the executors are spilling equal amounts
This is a clear indication that the parameter spark.executor.memory is too low: in fact, all executors are equally struggling to keep in memory their partitions.
Second half: skewed data
Second half of the spill chart shows that
- the executors are spilling different amounts
For example, during stage 180, executor 7 has spilled three times more than the other executors.
This is a clear indication of skewed data. In fact, if you open stage 180 in the Spark UI, you will see the following:
What’s next?
Let me know in the comments what I should address next:
- Adding a chart for peak memory usage per executor
- Converting the current figure into a full-fledged Dash (Plotly) application to improve the UX
What’s next for you?
- Clap this story
- Read part 1 of this series: Meet “spark-sight”: Spark Performance at a Glance
- Follow me here on Medium for future stories
- If you find this project useful, head over to the spark-sight Github repo and don’t be gentle on the star and watch buttons
- If you encounter any problems, head over to the spark-sight Github repo and don’t be gentle on the issue button