Spark Memory Management

Priyanka Bhakuni
3 min readApr 22, 2020

--

You must have heard the “Jim Gray’s Storage Latency analogy” on how far the data is?

Courtesy — Slideplayer

“Simplifying the analogy with food, grocery available in the kitchen means accessing memory. However, going to store to buy the grocery means accessing the disk”

In this big data world, everyone wants to fit all of the data in memory!!

Terms used in the article:

a) Execution Memory — As the name suggests the memory used for execution. To elaborate, the memory used for the temporary data/intermediate results. For example, shuffle, join, etc.

b) Storage Memory — It stores the cache data and also the broadcast variables.

Static Memory Management

In Spark 1.0, the memory was statically assigned which means some part of the memory for “Execution” and other parts for “Storage”.

But the problem here was, in case the execution memory gets “FULL”. Irrespective, if Storage memory is available or not, it is going to spill the LRU (Least recently used) data into the disk. That means no sharing of memory between them.

Unified Memory Management

Later in Spark 1.6+, this problem was solved by making a unified region for both execution and storage memory. “They learnt, Sharing is caring”. This means, if execution memory is full, it can use the available Storage memory and vice versa.

But this leads to a few questions?

a) Which memory gets the priority. In the above case if Storage memory is full, what happens then?

Execution memory can evict the data stored in Storage memory till a certain threshold. The data to this threshold gets immune from getting evicted from the execution memory.

b) What happens in a reverse scenario, if storage needs more memory and execution memory is full?

The data will be spilled to the disk.

c) Why priority is given to Execution memory?

We would need the intermediate data for our computation so ultimately if we spill the data in disk, it needs to read again. However, there may be a scenario where the cached data is not needed in the computation.

References: https://spark.apache.org/docs/latest/tuning.html#memory-tuning

https://www.youtube.com/watch?v=dPHrykZL8Cg

--

--