Understanding Spark UI

Sriramrimmalapudi
6 min readJul 27, 2020

--

Apache spark provides a suite of web user interfaces (UIs) that you can use to monitor the status and resource consumption of your Spark cluster.

Before going into Spark UI learn about these two concepts

Let me give a small brief on those concepts, Your application code is the set of instructions that instructs the driver(sparkSession) to do a job and let the driver decide how to achieve it with the help of executors.

Instructions to the driver are called Transformations and action will trigger the execution.

I had written a small application which do transformation and action.

pic1: Application Code

here we are creating a DataFrame by reading a .csv file and checking the count of the DataFrame.so two things happening here,
1.Create a DataFrame by reading a .csv file
2.Get Count of the DataFrame.

Let’s understand how an application gets projected in Spark-UI

pic2: Spark UI

The above snippet is how opener Spark-UI looks.
Basic things you would like to have in Spark UI are -

  • Jobs
  • Stages
  • Tasks
  • Storage
  • Environment
  • Executors
  • SQL

Jobs Tab

Pic3: Jobs Tab

The details that I want you to be aware of under the jobs section are Scheduling mode, number of Jobs, the number of stages it has, and Description in your job.

Scheduling Mode

We have different Scheduling modes
1. Standalone mode
2. YARN mode
3. Mesos
As I was running in a local machine, I tried using Standalone mode

Pic4: Number of jobs

Number of Jobs

Always keep in mind, the number of jobs is equal to the number of actions in the application and each job should have at least one Stage.
In our above application, we have performed 3 jobs (0,1,2)
0. read the CSV file.
1. Inferschema from the file.
2. Count Check

So if we look at the fig.It clearly shows 3 jobs are the results of 3 actions.

Number of Stages

Each Wide Transformation results in a separate Stage.

In our case, jobId 0 and jobId1 have individual single stages but when it comes to job 2, we can see two stages that are because of the partition of data. Data is partitioned into two files by default.

Description

Description links the complete details of the associated job like JobStatus, DAG Visualization, Completed Stages
I had explained the description part in the coming part.

Stage Tab

Pic5: StagesTab

We can navigate into stages in two ways
1. Select the Description of the respective job (Shows stages only for the job opted)
2.On the top of Job tab select Stages option (Shows all stages in Application)
In our application, we have a total of 4 Stages.

The Stages tab displays a summary page that shows the current state of all stages of all jobs in the spark application

The number of tasks you could see in each stage is the number of partitions that spark is going to work on and each task inside a stage is the same work that will be done by spark but on a different partition of data.

Stage detail

Details of stage showcase Directed Acyclic Graph (DAG) of this stage, where vertices represent the RDDs or DataFrame and edges represent an operation to be applied.

let us analyze operations in Stages

Stage0

Operations in Stage0 are
1.FileScanRDD
2.MapPartitionsRDD

FileScanRDD

FileScan represents reading the data from a file.
It is given FilePartitions that are custom RDD partitions with PartitionedFiles (file blocks)
In our scenario, the CSV file is read

MapPartitionsRDD

MapPartitionsRDD will be created when you use map Partition transformation

Stage1

Operation in Stage(1) are
1.FileScanRDD
2.MapPartitionsRDD
3.SQLExecutionRDD

As File Scan and MapPartitionsRDD is already explained, let us look at SQLExecutionRDD

SQLExecutionRDD

Spark property that is used to track multiple Spark jobs that should all together constitute a single structured query execution

Stage2&3

Operation in Stage(2) and Stage(3) are
1.FileScanRDD
2.MapPartitionsRDD
3.WholeStageCodegen
4.Exchange

Exchange

Exchange is performed because of calling the COUNT method.
As data is divided into partitions and shared among executors, to get count there should be adding of the count of from individual partition.

Represents the shuffle i.e data movement across the cluster(Executors).
It is the most expensive operation and if number of partitions is more exchange of data between executors will also be more.

Wholestagecodegen

A physical query optimizer in Spark SQL that fuses multiple physical operators

Task

Tasks

Tasks are located at the bottom space in the respective stage.
Key things to look task page are:
1. Input Size — Input for the Stage
2. Shuffle Write-Output is the stage written.

Storage

The Storage tab displays the persisted RDDs and DataFrames, if any, in the application. The summary page shows the storage levels, sizes and partitions of all RDDs, and the details page shows the sizes and using executors for all partitions in an RDD or DataFrame.

Environment Tab

Environment tab

This environment page has five details. It is a useful place to check whether your properties have been set correctly.
Runtime Information: simply contains the runtime properties like versions of Java and Scala.
Spark Properties: lists the application properties like ‘spark.app.name’ and ‘spark.driver.memory’.
Hadoop Properties: displays properties relative to Hadoop and YARN. Note that properties like spark.hadoop’ are shown not in this part but in ‘Spark Properties’.
System Properties: shows more details about the JVM.
Classpath Entries: lists the classes loaded from different sources, which is very useful to resolve class conflicts

Env tab extended View

The Environment tab displays the values for the different environment and configuration variables, including JVM, Spark, and system properties.

Executors Tab

The Executors tab displays summary information about the executors that were created for the application, including memory and disk usage and task and shuffle information. The Storage Memory column shows the amount of memory used and reserved for caching data.
The Executors tab provides not only resource information (amount of memory, disk, and cores used by each executor) but also performance information

In Executors
Number of cores = 3 as I gave master as local with 3 threads
Number of tasks = 4

SQL Tab

SQL TAB

If the application executes Spark SQL queries, the SQL tab displays information, such as the duration, jobs, and physical and logical plans for the queries.

In our application we performed read and count operation on files and DataFrame. So both read and count are listed SQL Tab.

some of the resources are gathered from sparkbyexamples.com and spark.org, thanks for the information.

If you would like too, you can connect with me on LinkedIn — SriramRimmalapudi.

If you enjoyed reading it, you can click the clap and let others know about it. If you would like me to add anything else, please feel free to leave a response 💬

“…………….Keep learning and keep growing…………………”

--

--

Sriramrimmalapudi

Data Engineer. I write about BigData Architecture, tools and techniques that are used to build Bigdata pipelines and other generic blogs.