As we know, Apache Spark uses shared variables, for parallel processingβ¦
π If youβre not aware of what Apache Spark is, do refer my blog on ApacheSpark π
β Shared Variables are of two types, Broadcast & Accumulator.
Before getting into it, Lemme explain you what shared variables isβ¦
π Shared variables are the variables that are required to be used by many functions & methods in parallel.
So, Now letβs start the PySpark Broadcast and Accumulator.
PySpark Broadcast and Accumulator
β On defining parallel processing, when the driver sends a task to the executor on the cluster a copy of shared variable goes on each node of the cluster, so we can use it for performing tasks.
Letβs get deep learning on it typesβ¦
1. Broadcast Variables β PySpark
β Basically, to save the copy of data across all nodes, Broadcast variables are used.
β However, on all the machines this variable is cached, not sent on machines.
Also, we can use it to broadcast some information to all the executors
It can be of any typeβ¦
β Either preliminary type or a hash map
Single Value :
Single value refers to the Common value for all the products.
Hashmap :
Hashmap means, look up or map side join.
β Broadcasting dimension can have considerable performance improvement,
When very large data set (fact) is tried to join with smaller data set (dimension). In addition, these variables are immutable.
Code Block of Broadcast class:
Example:
β β β β β β β β β β β β β β β β β β β β β β β β β β
β β β β β β β β β β β β β β β β β β β β β β β β β β
2. Accumulators β Pyspark π
β Accumulator variables are used for aggregating the information through associative and commutative operations.
For a sum operation or counters in MapReduce => we can use an accumulator
Code block of an Accumulator class:
class pyspark.Accumulator(aid, value, accum_param)
β β β β β β β β β β β β β β β β β β β β β β β β β β β β
Note: Only in a driver program, it is usable.
β β β β β β β β β β β β β β β β β β β β β β β β β β β β β
β An accumulator variable is used by multiple workers and returns an accumulated valueβ
Example:
Output:
This is all about PySpark Broadcast and Accumulatorβ¦.
Hope you got an idea on itβ¦
Still, if any doubt, ask in the commentβ¦
Cheers !!! : )
Ramya R : )
Resources:
https://data-flair.training/blogs/pyspark-broadcast-and-accumulator/
Images: