Spark Memory Management
Let's try to understand how memory is distributed inside a spark executor.
In each executor, Spark allocates a minimum of 384 MB for the memory overhead and the rest is allocated for the actual workload.
The formula for calculating the memory overhead — max(Executor Memory * 0.1, 384 MB).
- 1st scenario, if your executor memory is 5 GB, then memory overhead = max( 5 (GB) * 1024 (MB) * 0.1, 384 MB), which will lead to max( 512 MB, 384 MB) and finally 512 MB.
This will leave you with 4.5 GB in each executor for spark processing. - 2nd scenario, if your executor memory is 1 GB, then memory overhead = max( 1(GB) * 1024 (MB) * 0.1, 384 MB), which will lead to max( 102 MB, 384 MB) and finally 384 MB.
This will leave you with 640 MB in each executor for spark processing.
On Heap Memory
By default, Spark uses On-memory heap only. The On-heap memory area in the Executor can be roughly divided into the following four blocks:
- Storage Memory: It’s mainly used to store Spark cache data, such as RDD cache, Unroll data, and so on.
- Execution Memory: It’s mainly used to store temporary data in the calculation process of Shuffle, Join, Sort, Aggregation, etc.
- User Memory: It’s mainly used to store the data needed for RDD conversion operations, such as the information for RDD dependency.
- Reserved Memory: The memory is reserved for the system and is used to store Spark’s internal object
You have to consider two default parameters by Spark to understand this.
- spark.memory.fraction — to identify memory shared between Unified Memory Region and User Memory.
In this case, it is 60%. Say, if we have 1 GB spark executor memory 600MB will be allocated for the Unified Memory Region and 400MB for User Memory. - spark.memory.storageFraction — to identify memory shared between Execution Memory and Storage Memory. The default value provided by Spark is 50%. But according to the load on the execution memory, the storage memory will be reduced to complete the task.
One of the reasons Spark leverages memory heavily is because the CPU can read data from memory at a speed of 10 GB/s. Whereas if Spark reads from memory disks, the speed drops to about 100 MB/s and SSD reads will be in the range of 600 MB/s.
If CPU has to read data over the network the speed will drop to about 125 MB/s.
Common Issue CheckList
- Enough partitions for concurrency.
If you have 20 cores, make sure you have at least 20 partitions or more than that. - Minimize memory consumption by filtering the data you need.
- Minimize the amount of data shuffled. Shuffle is expensive.
- Know the standard library and use the right functions in the right place.