Decoding Memory in Spark — Parameters that are often confused

Photo by Chris Ried on Unsplash

Spark Memory Blocks — Quick Recap

Spark was developed using Scala as the primary language. Hence, operations in Spark happens inside a JVM, even if the user’s code is written in a different language like python or R. The Spark runtime segregates the JVM heap space in the driver and executors into 4 different parts:

  1. Storage Memory — JVM heap space reserved for cached data
  2. Execution Memory — JVM heap space used by data-structures during shuffle operations (joins, group-by’s and aggregations). Earlier (before Spark 1.6), the term shuffle memory was also used to describe this section of the memory.
  3. User Memory — For storing the data-structures created and managed by the user’s code
  4. Reserved Memory — Reserved by Spark for internal purposes.
  1. Off-Heap Memory — This segment of memory lies outside the JVM, but is used by JVM for certain use-cases (e.g. interning of Strings). Off-Heap memory can be used by Spark explicitly as well to store serialized data-frames and RDDs.
  2. External Process Memory — Specific to PySpark and SparkR, this is the memory used by the python/R process which resides outside of the JVM.

Spark Storage Memory

spark.storage.memoryFraction vs spark.memory.storageFraction

Both these parameters set the amount of JVM space to be used as a Storage memory (for cached data). But it is unclear which parameter should be set.

Spark Static Memory Manager
Spark Unified Memory Manager

Off-Heap Memory

spark.executor.memoryOverhead vs. spark.memory.offHeap.size

JVM Heap vs Off-Heap Memory
  • A part of off-heap memory is used by Java internally for purposes like String interning and JVM overheads.
  • Off-Heap memory can also be used by Spark explicitly for storing its data as part of Project Tungsten [5].
Off-Heap Memory Allocation in Spark

Python Memory

spark.python.worker.memory vs spark.executor.pyspark.memory

While executing PySpark, both these parameters seem to limit the memory allocated to python. But in reality, they limit very different sections of the memory in the executor. In PySpark, two separate process runs in the executor, a JVM that executes the Spark part of code (joins, aggregations and shuffles) and a python process that executes the user’s code. The two processes communicate via a Py4J bridge that exposes the JVM objects in the python process and vice-versa.

Configuring Python worker memory

Total Container Memory

Total Memory Request to YARN

References

  1. Spark Configuration
  2. Tuning Spark: Memory Management Overview
  3. Spark Memory Management
  4. Apache Spark Memory Management
  5. Project Tungsten: Bringing Apache Spark Closer to Bare Metal

--

--

We’re powering the next great retail disruption. Learn more about us — https://www.linkedin.com/company/walmartglobaltech/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store