Understanding Spark UI
At the end of this article, you will be able to look at the spark UI and understand the basic stuff.
Some of the basic operators that show up in Spark UI are
FileScan
It represents reading the data from a file format.
Things to look out for in FileScan
- Number of Files Read
— number of files or part files that Spark needs to read into memory - File System Read Data Size Total
— The actual size of data that will go the next stage
— This file size will match with the input size of the next stage - Size of Files Read Total
— The total size of data that spark reads while scanning the files - Rows Output
— Number of records that will be passed to the next stage
— The row count will match with the input row count in the next stage
Exchange
It represents Shuffle — physical data movement on the cluster.
Exchange is one of the most expensive operation in a spark job.
There are multiple ways in which data will be re-partitioned when it is shuffled. To know the type of partitioning that happens, you have to mouse over the Exchange in Spark UI.
Single Partition
During Single partition, all your data is moved to a single partition and it is going to run on a single core. This will create a serious bottleneck in your job. If your data is large, you might even face OutOfMemory error.
This happens when we use a window function, but do not provide a key by which the data has to be split.
Triggers
- window.partitionBy()
Hash Partitioning
Triggers
- groupBy
- distinct
- join
- repartition(column)
- window.partitionBy(key)
Round Robin Partition
Triggers
- repartition(number)
Range Partition
Triggers
- orderBy
Aggregate
Triggers
- groupBy
- distinct
- dropDuplicates
There are 3 types of aggregates that can happen on your data
- Hash Aggregate
1. Comes in Pairs
a. Partial Sum that happens on each executor after which data is shuffled
b. Final Merge Sum happens after exchange. This merges all partial sum
The exchange between the hash aggregate pair may or may not happen. - Sort Aggregate
- Object Hash Aggregate
Sort Merge Join
Sort Merge Join represents joining two dataframes.
Exchange and Sort happens when the Output Partitioning and Output Ordering requirements are already met by the individual datasets.
If you mouse over the Sort Merge Join in your Spark UI, you will be able to see what join actually happened.
Broadcast Hash Join
Broadcast Hash Join comes in pairs
- Broadcast Exchange
— This is where the data will be broadcasted to each executor - Broadcast Hash Join