Meet “spark-sight”: Spark Performance at a Glance
An open-source project I created
spark-sight is a less detailed, more intuitive representation of what is going on inside your Spark application in terms of performance.
This story is part of a series:
- Part 1: This story
- Part 2: “spark-sight” Shows Spill: Skewed Data and Executor Memory
I love Spark, but I don’t love the Spark UI.
This is why, at first, I was excited to find out that Data Mechanics was developing Delight, sharing with the world
- an open-source agent collecting information from within the Spark application;
- a closed-source dashboard displaying the performance of a Spark application in terms of CPU efficiency and memory usage.
Even though the service is free of charge,
- you need to integrate into your Spark application their custom listener that collects and sends out data to their servers;
- it may be difficult for your boss to approve such practice (privacy concerns), or outright unable to (e.g. your application runs in a Glue job inside a VPC without internet access).
Willing to gather everything I know about Spark, I took on the challenge of recreating the same amazing experience of Delight for everybody to enjoy and contribute to.
This is why I am sharing with you spark-sight.
spark-sight is a less detailed, more intuitive representation of what is going on inside your Spark application in terms of performance:
- CPU time spent doing the “actual work”
- CPU time spent doing shuffle reading and writing
- CPU time spent doing serialization and deserialization
- Spill intensity per executor (Out now! Read part 2 of the series: “spark-sight” Shows Spill: Skewed Data and Executor Memory)
- (coming) Memory usage per executor
spark-sight is not meant to replace the Spark UI altogether, rather it provides a bird’s-eye view of the stages allowing you to identify at a glance which portions of the execution may need improvement.
Plotly figure
The Plotly figure consists of two charts with synced x-axis.
Top: efficiency in terms of CPU cores available for tasks
Bottom: stages timeline
No more talk, let me use it IMMEDIATELY
To install it,
$ pip install spark-sight
To meet it,
$ spark-sight --help
To launch it,
- Unix
$ spark-sight \
--path "/path/to/spark-application-12345" \
--cpus 32 \
--deploy_mode "cluster_mode"
- Windows PowerShell
$ spark-sight `
--path "C:\path\to\spark-application-12345" `
--cpus 32 `
--deploy_mode "cluster_mode"
A new browser tab will be opened.
For more information, head over to the spark-sight Github repo.
Ok, but how did you do this?
The Spark event log is a simple text file that Spark is natively able to store somewhere for you to open and look through:
--conf spark.eventLog.enabled=true
--conf spark.eventLog.dir=file:///c:/somewhere
--conf spark.history.fs.logDirectory=file:///c:/somewhere
As described in the Spark documentation, hidden somewhere in the text file are the performance data:
- SparkListenerTaskEnd events: for how long the task was shuffling, serializing, deserializing, and doing the “actual work” it was supposed to do in the first place
- SparkListenerStageCompleted events: for how long the corresponding stage was in the submitted state
Given a stage, the efficiency of a stage is the ratio between
- Used CPU time:
total CPU time of “actual work”
across all the tasks of the stage - Available CPU time:
total CPU time (idle or busy)
across all cluster nodes during the stage submission
Well, that was simple, right?
No.
Parallel stages
Onto the cluster, multiple stages can be submitted at the same time. So you can’t really compute the efficiency of a stage by itself, because in the meantime the cluster could be executing other stages.
We need to change the definition of efficiency.
Given a time interval, the efficiency of the time interval is the ratio between
- Used CPU time:
total CPU time of “actual work”
a̶c̶r̶o̶s̶s̶ ̶a̶l̶l̶ ̶t̶h̶e̶ ̶t̶a̶s̶k̶s̶ ̶o̶f̶ ̶t̶h̶e̶ ̶s̶t̶a̶g̶e̶
across all the tasks across all the stages submitted in that time interval - Available CPU time:
total CPU time (idle or busy)
̶a̶c̶r̶o̶s̶s̶ ̶a̶l̶l̶ ̶c̶l̶u̶s̶t̶e̶r̶ ̶n̶o̶d̶e̶s̶ ̶d̶u̶r̶i̶n̶g̶ ̶t̶h̶e̶ ̶s̶t̶a̶g̶e̶ ̶s̶u̶b̶m̶i̶s̶s̶i̶o̶n̶
across all cluster nodes in that time interval
Well, which time interval? Splitting tasks
You may think stage boundaries may be enough, but tasks can run across those boundaries. In the following diagram, notice task 1 running across the boundary:
How do you split task metrics in two?
For task 1, compound information regarding CPU usage is reported. As an example, the following situations are reported equivalently as
- the task ran for 10 seconds
- the task used the CPU for 4 seconds
{
"Event":"SparkListenerTaskEnd",
"Stage ID":0,
"Task Info": {
"Task ID":0,
"Task Metrics": {
"Executor CPU Time": 4000000000 (nanoseconds)
}
}
The simplest solution for splitting the task is splitting CPU usage and the other metrics proportionally with respect to the resulting duration of the split (technically speaking, we assume a uniform distribution of probability across the interval).
Notice that this approximation may create artifacts, e.g. going above 100%.
The figure
I used the Python library Plotly, so easy to enable you to streamline a simple visualization like this one, providing a lightweight and interactive interface.
Notice that the visualization has an improvement over the time intervals discussed above. In fact, the top bar chart further splits the time intervals identifying when the first and last task of the stage has actually started.
What’s next?
Medium term
I plan to add charts for
- Spill intensity per executor (Out now! Read part 2 of the series: “spark-sight” Shows Spill: Skewed Data and Executor Memory)
- Memory usage per executor
and then convert the simple figure into a full-fledged Dash (Plotly) application to improve the UX.
Long term
I plan to add
- the ability to read the Spark event log from other data sources (e.g. S3, GCS, Azure Storage, HDFS, …)
- showing multiple Spark applications at the same time so that performance can be compared (e.g. you ran the same application with different
spark.sql.shuffle.partitions
,spark.executor.memory
, …) - showing the efficiency of multiple Spark applications running on the same cluster at the same time
- taking into account non-static configuration of the cluster (now it is assumed the number of CPUs does not change)
What’s next for you?
- Clap this story
- Read part 2 of this series: “spark-sight” Shows Spill: Skewed Data and Executor Memory
- Follow me here on Medium for future stories
- If you find this project useful, head over to the spark-sight Github repo and don’t be gentle on the star and watch buttons
- If you encounter any problems, head over to the spark-sight Github repo and don’t be gentle on the issue button