Meet “spark-sight”: Spark Performance at a Glance

An open-source project I created

6 min readMay 7, 2022

spark-sight is a less detailed, more intuitive representation of what is going on inside your Spark application in terms of performance.

This story is part of a series:

Part 1: This story
Part 2: “spark-sight” Shows Spill: Skewed Data and Executor Memory

I love Spark, but I don’t love the Spark UI.

This is why, at first, I was excited to find out that Data Mechanics was developing Delight, sharing with the world

an open-source agent collecting information from within the Spark application;
a closed-source dashboard displaying the performance of a Spark application in terms of CPU efficiency and memory usage.

Even though the service is free of charge,

you need to integrate into your Spark application their custom listener that collects and sends out data to their servers;
it may be difficult for your boss to approve such practice (privacy concerns), or outright unable to (e.g. your application runs in a Glue job inside a VPC without internet access).

Willing to gather everything I know about Spark, I took on the challenge of recreating the same amazing experience of Delight for everybody to enjoy and contribute to.

This is why I am sharing with you spark-sight.

spark-sight is a less detailed, more intuitive representation of what is going on inside your Spark application in terms of performance:

CPU time spent doing the “actual work”
CPU time spent doing shuffle reading and writing
CPU time spent doing serialization and deserialization
Spill intensity per executor (Out now! Read part 2 of the series: “spark-sight” Shows Spill: Skewed Data and Executor Memory)
(coming) Memory usage per executor

spark-sight is not meant to replace the Spark UI altogether, rather it provides a bird’s-eye view of the stages allowing you to identify at a glance which portions of the execution may need improvement.

Plotly figure

The Plotly figure consists of two charts with synced x-axis.

Top: efficiency in terms of CPU cores available for tasks

Bottom: stages timeline

No more talk, let me use it IMMEDIATELY

To install it,

$ pip install spark-sight

To meet it,

$ spark-sight --help

To launch it,

Unix

$ spark-sight \
    --path "/path/to/spark-application-12345" \
    --cpus 32 \
    --deploy_mode "cluster_mode"

Windows PowerShell

$ spark-sight `
    --path "C:\path\to\spark-application-12345" `
    --cpus 32 `
    --deploy_mode "cluster_mode"

A new browser tab will be opened.

For more information, head over to the spark-sight Github repo.

Ok, but how did you do this?

The Spark event log is a simple text file that Spark is natively able to store somewhere for you to open and look through:

--conf spark.eventLog.enabled=true
--conf spark.eventLog.dir=file:///c:/somewhere
--conf spark.history.fs.logDirectory=file:///c:/somewhere

As described in the Spark documentation, hidden somewhere in the text file are the performance data:

SparkListenerTaskEnd events: for how long the task was shuffling, serializing, deserializing, and doing the “actual work” it was supposed to do in the first place
SparkListenerStageCompleted events: for how long the corresponding stage was in the submitted state

Given a stage, the efficiency of a stage is the ratio between

Used CPU time:
total CPU time of “actual work”
across all the tasks of the stage
Available CPU time:
total CPU time (idle or busy)
across all cluster nodes during the stage submission

Well, that was simple, right?

No.

Parallel stages

Onto the cluster, multiple stages can be submitted at the same time. So you can’t really compute the efficiency of a stage by itself, because in the meantime the cluster could be executing other stages.

We need to change the definition of efficiency.

Given a time interval, the efficiency of the time interval is the ratio between

Used CPU time:
total CPU time of “actual work”
a̶c̶r̶o̶s̶s̶ ̶a̶l̶l̶ ̶t̶h̶e̶ ̶t̶a̶s̶k̶s̶ ̶o̶f̶ ̶t̶h̶e̶ ̶s̶t̶a̶g̶e̶
across all the tasks across all the stages submitted in that time interval
Available CPU time:
total CPU time (idle or busy)
̶a̶c̶r̶o̶s̶s̶ ̶a̶l̶l̶ ̶c̶l̶u̶s̶t̶e̶r̶ ̶n̶o̶d̶e̶s̶ ̶d̶u̶r̶i̶n̶g̶ ̶t̶h̶e̶ ̶s̶t̶a̶g̶e̶ ̶s̶u̶b̶m̶i̶s̶s̶i̶o̶n̶
across all cluster nodes in that time interval

Well, which time interval? Splitting tasks

You may think stage boundaries may be enough, but tasks can run across those boundaries. In the following diagram, notice task 1 running across the boundary:

How do you split task metrics in two?

For task 1, compound information regarding CPU usage is reported. As an example, the following situations are reported equivalently as

the task ran for 10 seconds
the task used the CPU for 4 seconds

{
  "Event":"SparkListenerTaskEnd",
  "Stage ID":0,
  "Task Info": {
    "Task ID":0,
    "Task Metrics": {
      "Executor CPU Time": 4000000000 (nanoseconds)
    }
}

The simplest solution for splitting the task is splitting CPU usage and the other metrics proportionally with respect to the resulting duration of the split (technically speaking, we assume a uniform distribution of probability across the interval).

Notice that this approximation may create artifacts, e.g. going above 100%.

The figure

I used the Python library Plotly, so easy to enable you to streamline a simple visualization like this one, providing a lightweight and interactive interface.

Notice that the visualization has an improvement over the time intervals discussed above. In fact, the top bar chart further splits the time intervals identifying when the first and last task of the stage has actually started.

What’s next?

Medium term

I plan to add charts for

Spill intensity per executor (Out now! Read part 2 of the series: “spark-sight” Shows Spill: Skewed Data and Executor Memory)
Memory usage per executor

and then convert the simple figure into a full-fledged Dash (Plotly) application to improve the UX.

Long term

I plan to add

the ability to read the Spark event log from other data sources (e.g. S3, GCS, Azure Storage, HDFS, …)
showing multiple Spark applications at the same time so that performance can be compared (e.g. you ran the same application with different spark.sql.shuffle.partitions, spark.executor.memory, …)
showing the efficiency of multiple Spark applications running on the same cluster at the same time
taking into account non-static configuration of the cluster (now it is assumed the number of CPUs does not change)

What’s next for you?

Clap this story
Read part 2 of this series: “spark-sight” Shows Spill: Skewed Data and Executor Memory
Follow me here on Medium for future stories
If you find this project useful, head over to the spark-sight Github repo and don’t be gentle on the star and watch buttons
If you encounter any problems, head over to the spark-sight Github repo and don’t be gentle on the issue button