PySpark Broadcast and Accumulator

4 min readMay 9, 2022

As we know, Apache Spark uses shared variables, for parallel processing…

📌 If you’re not aware of what Apache Spark is, do refer my blog on ApacheSpark 📌

→ Shared Variables are of two types, Broadcast & Accumulator.

Before getting into it, Lemme explain you what shared variables is…

🔎 Shared variables are the variables that are required to be used by many functions & methods in parallel.

So, Now let’s start the PySpark Broadcast and Accumulator.

PySpark Broadcast and Accumulator

→ On defining parallel processing, when the driver sends a task to the executor on the cluster a copy of shared variable goes on each node of the cluster, so we can use it for performing tasks.

Let’s get deep learning on it types…

1. Broadcast Variables — PySpark

→ Basically, to save the copy of data across all nodes, Broadcast variables are used.

→ However, on all the machines this variable is cached, not sent on machines.

Also, we can use it to broadcast some information to all the executors

It can be of any type…
→ Either preliminary type or a hash map

Single Value :

Single value refers to the Common value for all the products.

Hashmap :

Hashmap means, look up or map side join.

→ Broadcasting dimension can have considerable performance improvement,
When very large data set (fact) is tried to join with smaller data set (dimension). In addition, these variables are immutable.

Code Block of Broadcast class: