Understanding Spark UI

Tharun Kumar Sekar
Analytics Vidhya
Published in
2 min readJan 1, 2020

At the end of this article, you will be able to look at the spark UI and understand the basic stuff.

Some of the basic operators that show up in Spark UI are

FileScan

It represents reading the data from a file format.

Things to look out for in FileScan

  • Number of Files Read
    — number of files or part files that Spark needs to read into memory
  • File System Read Data Size Total
    — The actual size of data that will go the next stage
    — This file size will match with the input size of the next stage
  • Size of Files Read Total
    — The total size of data that spark reads while scanning the files
  • Rows Output
    — Number of records that will be passed to the next stage
    — The row count will match with the input row count in the next stage

Exchange

It represents Shuffle — physical data movement on the cluster.

Exchange is one of the most expensive operation in a spark job.

There are multiple ways in which data will be re-partitioned when it is shuffled. To know the type of partitioning that happens, you have to mouse over the Exchange in Spark UI.

Single Partition

During Single partition, all your data is moved to a single partition and it is going to run on a single core. This will create a serious bottleneck in your job. If your data is large, you might even face OutOfMemory error.

This happens when we use a window function, but do not provide a key by which the data has to be split.

Triggers

  • window.partitionBy()

Hash Partitioning

Triggers

  • groupBy
  • distinct
  • join
  • repartition(column)
  • window.partitionBy(key)

Round Robin Partition

Triggers

  • repartition(number)

Range Partition

Triggers

  • orderBy

Aggregate

Triggers

  • groupBy
  • distinct
  • dropDuplicates

There are 3 types of aggregates that can happen on your data

  • Hash Aggregate
    1. Comes in Pairs
    a. Partial Sum that happens on each executor after which data is shuffled
    b. Final Merge Sum happens after exchange. This merges all partial sum
    The exchange between the hash aggregate pair may or may not happen.
  • Sort Aggregate
  • Object Hash Aggregate

Sort Merge Join

Sort Merge Join represents joining two dataframes.

Exchange and Sort happens when the Output Partitioning and Output Ordering requirements are already met by the individual datasets.

If you mouse over the Sort Merge Join in your Spark UI, you will be able to see what join actually happened.

Broadcast Hash Join

Broadcast Hash Join comes in pairs

  • Broadcast Exchange
    — This is where the data will be broadcasted to each executor
  • Broadcast Hash Join

--

--