Broadcast & Accumulator — Shared Variables in Spark

Siddharth Ghosh
5 min readAug 10, 2022

Access this article for free at Broadcast & Accumulator Variables in Spark. However, I highly recommend becoming a Medium member to explore more engaging content and support the talented writers.

In the Big Data world, where codes run on remote machines, they do so in containers of their own, creating separate copies of the variables needed for the execution of the code. Sharing these variables and keeping track of the values across the cluster would be inefficient. However, Spark does provide two types of shared variables with limited usage patterns: Broadcast & Accumulator variables.

Photo by Christopher Gower on Unsplash

Broadcast Variables

Broadcast variables allow the developers to cache a copy of the read-only variable on each machine/node rather than moving the copy of it with tasks. It means, that whenever the driver program encounters a broadcast variable(or one that can be broadcasted), it creates a copy of it and share it with all the machines/nodes where the codes are supposed to be executed.

The benefit of copying the data using broadcast is that it resides on the worker node until the lifecycle of the Spark application. So if in multiple stages the same variable or data is used multiple times, it does not need to be copied every time and can be used from the cached data. The data is cached in…

--

--