Understanding Spark UI

Tharun Kumar Sekar

Published in

Analytics Vidhya

2 min readJan 1, 2020

At the end of this article, you will be able to look at the spark UI and understand the basic stuff.

Some of the basic operators that show up in Spark UI are

FileScan

It represents reading the data from a file format.

Things to look out for in FileScan

Number of Files Read
— number of files or part files that Spark needs to read into memory
File System Read Data Size Total
— The actual size of data that will go the next stage
— This file size will match with the input size of the next stage
Size of Files Read Total
— The total size of data that spark reads while scanning the files
Rows Output
— Number of records that will be passed to the next stage
— The row count will match with the input row count in the next stage

Exchange

It represents Shuffle — physical data movement on the cluster.

Exchange is one of the most expensive operation in a spark job.

There are multiple ways in which data will be re-partitioned when it is shuffled. To know the type of partitioning that happens, you have to mouse over the Exchange in Spark UI.

Single Partition

During Single partition, all your data is moved to a single partition and it is going to run on a single core. This will create a serious bottleneck in your job. If your data is large, you might even face OutOfMemory error.

This happens when we use a window function, but do not provide a key by which the data has to be split.

Triggers

window.partitionBy()

Hash Partitioning

Triggers

groupBy
distinct
join
repartition(column)
window.partitionBy(key)

Round Robin Partition

Triggers

repartition(number)

Range Partition

Triggers

orderBy

Aggregate

Triggers

groupBy
distinct
dropDuplicates

There are 3 types of aggregates that can happen on your data

Hash Aggregate
1. Comes in Pairs
a. Partial Sum that happens on each executor after which data is shuffled
b. Final Merge Sum happens after exchange. This merges all partial sum
The exchange between the hash aggregate pair may or may not happen.
Sort Aggregate
Object Hash Aggregate

Sort Merge Join

Sort Merge Join represents joining two dataframes.

Exchange and Sort happens when the Output Partitioning and Output Ordering requirements are already met by the individual datasets.

If you mouse over the Sort Merge Join in your Spark UI, you will be able to see what join actually happened.

Broadcast Hash Join

Broadcast Hash Join comes in pairs

Broadcast Exchange
— This is where the data will be broadcasted to each executor
Broadcast Hash Join

Understanding Spark UI

Written by Tharun Kumar Sekar