Understanding Spark UI — Part 1

Tharun Kumar Sekar
Analytics Vidhya
Published in
3 min readJan 5, 2020

The basic things that you would have in a Spark UI are
1. Jobs
2. Stages
3. Tasks
4. Storage
5. Environment
6. Executors
7. SQL

Jobs

Job Snapshot

A job can be considered to be a physical part of your ETL code. The details that you want to be aware of in a job section are the number of Stages it has. In the above image, it is 4.
The second detail that needs attention is the number of running tasks in your job. In this case, it is 97. This is the direct count of the cores in your cluster.

Stages

Stages Snapshot

To navigate to the stages page, click on the Description in the respective job. All the stages, that you see in the stages page are related to the specific job you are working on.

Let’s consider you have 2 stages in your job, it means 1 shuffle is happening in your job. If you have 5 stages in your job 4 shuffles are happening. Always the number of shuffles is less than the number of stages by 1. In an ideal job, you should be having 1 stage inside your job, which means no shuffle and every task in your job is running in parallel.
The number of tasks you could see in each stage is the number of partitions that spark is going to work on and each task inside a stage is the same work that will be done by spark but on a different segment of data.

Same work is done on different partitions of data

Tasks

Tasks Snapshot

To navigate to the tasks page, click on the description in the respective stage.
Key things to look for in the Tasks page are :
1. Input Size — Input for the stage.

  • The expectation is the min, 25th percentile, Median, 75th percentile and Max values of the input size should almost be same and it must be somewhere between 128 and 256MB.
  • The reason for expecting the task size to be the same across various groups, is to put equal volume of data or equal size of data on the cores that we allot for the job.
  • In the above case, the min input size is 0, 25th percentile is 27.1 MB, which means we have some cores which will function on 0 volume of data. So we are under-utilizing the cores.
  • In some cases the input size may begin from 1GB, which means we are over-utilizing the cores and we should partition the data.

2. Shuffle Write-Output of the stage written to the disks that are locally mounted. In the next stage, this will be the input and it will be again read from the disk.

To understand the spark memory management, refer to this article

Summary Metrics is one of the most important part of the Spark UI. It gives you information about how your data is distributed among your partitions.

Storage

Storage tab is where we would see details about your Persisted data.

Environment

This page gives you information about your spark configurations.

Executors

This tab helps you in identifying node level issues.

SQL

This is similar to the execution plan in SQL. This is a great place to start with if you are trying to optimize your code.

To understand more about SQL in Spark UI, refer to this article.

--

--