Spark Memory Management [Prior to v1.6.0]
Spark Memory Management Model described in this article is deprecated from Apache Spark v1.6.0+, the new memory model can be referred from this article.
Here’s the diagram of Spark Memory allocation inside of the JVM heap as per the Memory Management Model prior to Apache Spark v1.6.0 :
Lets understand what are these different components one by one :
- Safe Heap : Any Spark process that runs on a cluster or a local machine is a JVM process. As for any JVM process, heap size can be configured with -Xmx and -Xms flags. By default, Apache Spark starts with 512 MB JVM heap. To avoid Out Of Memory error Spark only allows to utilize 90% of the total heap as Safe Heap, which is controlled by the spark.storage.safetyFraction parameter. A Spark process operated over this safe heap memory.
- Storage Memory : Spark allows to store some data in memory and utilizes the memory for its LRU (Least Recently Used) cache. So, by default 60% of the safe heap memory is reserved for this caching of the data being processed. This fraction is controlled spark.storage.memoryFraction parameter. So, data that can be cached = Heap Size * spark.storage.safetyFraction * spark.storage.memoryFraction. For example, if the you have 8 GB of heap size then the data you can cache is 8192 * 0.9 * 0.6 = 4423 MB.
- Shuffle Memory : Now lets understand what shuffle memory is. It is calculated as Heap Size * spark.shuffle.safetyFraction * spark.shuffle.memoryFraction. By default, spark.shuffle.safetyFraction is 0.8 and spark.shuffle.memoryFraction is 0.2. So, default Shuffle Memory turns out to be 16% (0.8*0.2 = 0.16) of the JVM heap . Spark uses this memory for shuffle task. Shuffling is a process of redistributing data across partitions that may or may not cause moving data across JVM processes. Spark uses this Memory to store data while executing shuffle task. Let’s take an example, sometimes during shuffle, sorting of data occurs. Usually a buffer is needed to store the sorted data (Data in the LRU cache can not be modified as it is there to be reused later). So it needs some amount of RAM i.e. Shuffle Memory to store the sorted chunks of data.
ShuffleMemoryManager manages shuffle memory by allocating a pool of memory to task threads for use in shuffle operations.
- Unroll Memory : Spark allows data to be stored in both serialized and deserialized form. The data in serialized form cannot be used directly. This is the memory that is used when Spark is unrolling the block of data into the memory. Unroll Memory is shared with the Storage Memory, which means that if you need some memory to unroll the data, this might cause dropping some of the partitions stored in the Spark LRU cache. The amount of RAM that is allowed to be utilized by unroll process is spark.storage.unrollFraction * spark.storage.memoryFraction * spark.storage.safetyFraction, by default it is 10.8% of the heap (0.2 * 0.6 * 0.9 = 0.108).
I hope this article helped in understanding Apache Spark memory management model.