PySpark Broadcast and Accumulator

R RAMYA
4 min readMay 9, 2022

--

As we know, Apache Spark uses shared variables, for parallel processing…

πŸ“Œ If you’re not aware of what Apache Spark is, do refer my blog on ApacheSpark πŸ“Œ

β†’ Shared Variables are of two types, Broadcast & Accumulator.

Before getting into it, Lemme explain you what shared variables is…

πŸ”Ž Shared variables are the variables that are required to be used by many functions & methods in parallel.

So, Now let’s start the PySpark Broadcast and Accumulator.

PySpark Broadcast and Accumulator

β†’ On defining parallel processing, when the driver sends a task to the executor on the cluster a copy of shared variable goes on each node of the cluster, so we can use it for performing tasks.

Let’s get deep learning on it types…

1. Broadcast Variables β€” PySpark

β†’ Basically, to save the copy of data across all nodes, Broadcast variables are used.

β†’ However, on all the machines this variable is cached, not sent on machines.

Also, we can use it to broadcast some information to all the executors

It can be of any type…

β†’ Either preliminary type or a hash map

Single Value :

Single value refers to the Common value for all the products.

Hashmap :

Hashmap means, look up or map side join.

β†’ Broadcasting dimension can have considerable performance improvement,

When very large data set (fact) is tried to join with smaller data set (dimension). In addition, these variables are immutable.

Code Block of Broadcast class:

Example:

β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€”

β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€”

2. Accumulators β€” Pyspark πŸ“Œ

β†’ Accumulator variables are used for aggregating the information through associative and commutative operations.

For a sum operation or counters in MapReduce => we can use an accumulator

Code block of an Accumulator class:

class pyspark.Accumulator(aid, value, accum_param)

β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€”

Note: Only in a driver program, it is usable.

β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€” β€”

β€œ An accumulator variable is used by multiple workers and returns an accumulated value”

Example:

Output:

This is all about PySpark Broadcast and Accumulator….

Hope you got an idea on it…

Still, if any doubt, ask in the comment…

Cheers !!! : )

Ramya R : )

Resources:

https://data-flair.training/blogs/pyspark-broadcast-and-accumulator/

Images:

--

--