Persistence Vs. Broadcast

Published in

Analytics Vidhya

2 min readJan 4, 2020

Most of the people don’t clearly know the difference between persisting a dataframe and broadcast a dataframe.
Let me make it simple for you.

Persist Process

Let’s consider, you have a dataframe of size 12 GB, 6 partitions and 3 executors. Each partition is going to have 2 GB of data in memory and each executor is going to read 2 partitions, so we will have 4 GB of data in memory in each executor.

One key thing to note here, the persisted data in each executor will be stored in Storage Memory.

Persist — Serialization

Broadcast Process

Let’s consider, we have the same settings — data of size 12 GB, 6 partitions and 3 executors. Spark reads the data from each partition in the same way it did it during Persist. But it is going to store the data in the executor in the working memory and it is going to take the same amount of space (4 GB per executor).

2 partitions are going to go into each executor for a total of 12 GB

Now from the working memory, the data is pulled through a COLLECT process to the Driver. The Driver converts the dataframe to a broadcastable object (Hashmap).

This Hashmap objct will be of size 12 GB.

Finally, this object will be sent to each executor.

12 GB object on each executor

The total space used on executors during broadcast will be the actual size of the data and the size broadcasted object in each executor, which turns out to be x + n(x). Where x is the size of the data and n being the number of executors.

In this case, the total space used for broadcasting will be 12 + 3(12) = 48 GB.

One key thing to be noted is, all this space will be utilized from the working memory.

This is the case until the garbage collector runs and clears up the partitions of data that is stored in the Executor. Once the garbage collector clears the data, you will finally end up with 36 GB — which is the total size of the broadcasted object on each executor.

If you need to know about Working Memory and Storage Memory in the executors, please refer to the article — https://medium.com/@tharun026/spark-memory-management-583a16c1253f

This article is a transcript of the Daniel Tomes video from Databricks spark summit.

Persistence Vs. Broadcast

Written by Tharun Kumar Sekar