Solving “Container Killed by Yarn For Exceeding Memory Limits” Exception in Apache Spark

Chandan Bhattad
Analytics Vidhya
Published in
4 min readOct 22, 2019

Introduction
Apache Spark is an open-source framework for distributed big-data processing. Originally written in Scala, it also has native bindings for Java, Python, and R programming languages. It also supports SQL, Streaming Data, Machine Learning, and Graph Processing.

All in all, Apache Spark is often termed as Unified analytics engine for large-scale data processing.

If you have been using Apache Spark for some time, you would have faced an exception which looks something like this:
Container killed by YARN for exceeding memory limits, 5 GB of 5GB used

The reason can either be on the driver node or on the executor node. In simple words, the exception says, that while processing, spark had to take more data in memory that the executor/driver actually has.
There can be a few reasons for this which can be resolved in the following ways:

  • Your data is skewed, which means you have not partitioned the data properly during processing which resulted in more data to process for a particular task. In this case, you can examine your data and try a custom partitioner that uniformly partitions the dataset.
  • Your Spark Job might be shuffling a lot of data over the network. Out of the memory available for an executor, only some part is allotted for shuffle cycle. Try using efficient Spark API's like reduceByKey over groupByKey etc, if not already done. Sometimes, shuffle can be unavoidable though. In that case, we need to increase memory configurations which we will discuss in further points

If the above two points are not applicable, try the following in order until the error is resolved. Revert any changes you might have made to spark conf files before moving ahead.

  • Increase Memory Overhead
    Memory Overhead is the amount of off-heap memory allocated to each executor. By default, memory overhead is set to the higher value between 10% of the Executor Memory or 384 mb. Memory Overhead is used for Java NIO direct buffers, thread stacks, shared native libraries, or memory-mapped files.
    The above exception can occur on either driver or executor node. Wherever the error is, try increasing the overhead memory gradually for that container only (driver or executor) and re-run the job. Maximum recommended memoryOverhead is 25% of the executor memory
    Caution: Make sure that the sum of the driver or executor memory plus the driver or executor memory overhead is always less than the value of yarn.nodemanager.resource.memory-mb i.e. spark.driver/executor.memory + spark.driver/executor.memoryOverhead < yarn.nodemanager.resource.memory-mb
    You have to change the property by editing the spark-defaults.conf file on the master node.
sudo vim /etc/spark/conf/spark-defaults.conf  spark.driver.memoryOverhead 1024 
spark.executor.memoryOverhead 1024

You can specify the above properties cluster-wide for all the jobs or you can also pass it as a configuration for a single job like below

spark-submit --class org.apache.spark.examples.WordCount --master yarn --deploy-mode cluster --conf spark.driver.memoryOverhead=512 --conf spark.executor.memoryOverhead=512 <path/to/jar> 

If this doesn’t solve your problem, try the next point

  • Reducing the number of Executor Cores
    If you have a higher number of executor cores, the amount of memory required goes up. So, try reducing the number of cores per executor which reduces the number of tasks that can run on the executor, thus reducing the memory required. Again, change the configuration of driver or executor depending on where the error is.
sudo vim /etc/spark/conf/spark-defaults.conf 
spark.driver.cores 3
spark.executor.cores 3

Similar to the previous point, you can specify the above properties cluster-wide for all the jobs or you can also pass it as a configuration for a single job like below:

spark-submit --class org.apache.spark.examples.WordCount --master yarn --deploy-mode cluster --executor-cores 5--driver-cores 4 <path/to/jar>

If this doesn’t work, see the next point

  • Increase the number of partitions
    If there are more partitions, the amount of memory required per partition would be less. Memory usage can be monitored by Ganglia. You can increase the number of partitions by invoking .repartition(<num_partitions>) on RDD or Dataframe

No luck yet? Increase executor or driver memory.

  • Increase Driver or Executor Memory
    Depending on where the error has occurred, increase the memory of the driver or executor
    Caution:
    spark.driver/executor.memory + spark.driver/executor.memoryOverhead < yarn.nodemanager.resource.memory-mb
sudo vim /etc/spark/conf/spark-defaults.conf  
spark.executor.memory 2g
spark.driver.memory 1g

Just like other properties, this can also be overridden per job

spark-submit --class org.apache.spark.examples.WordCount --master yarn --deploy-mode cluster --executor-memory 2g --driver-memory 1g <path/to/jar>

Most likely by now, you should have resolved the exception.

If not, you might need more memory-optimized instances for your cluster!

Happy Coding!
Reference: https://aws.amazon.com/premiumsupport/knowledge-center/emr-spark-yarn-memory-limit/

--

--