Debug Flink OOM in Docker Container

Published in

Hadoop Noob

3 min readSep 10, 2018

Recently I am planning to deploy a new Flink pipeline. I tested on my local and staging environment. When I deploy it to serve the full traffic, the TaskManagers are killed randomly. Since I have enabled externalized checkpoint, I won’t lose any data or state. But every time, it crashes, it certainly causes some delay in processing. And it really annoys me.

From the stderr, I got Container exited with status 137
From the exit code, we can tell the container is killed by the Linux OOM killer
Running dmesg on the host machine, we got

As I understand, OOM killer kills the process when rss (in RAM) reaches docker’s memory limit (-m flag)

Just give some context:

I deploy Flink 1.3.3 under Marathon in cluster mode
production environment, I give 24GB to each TaskManager and 22GB for taskmanager.heap.mb. So 2GB for everything else
staging environment, I give 8GB to each TaskManager and 6GB for taskmanager.heap.mb
In this post, I actually use 24GB to TaskManager and 19GB for taskmanager.heap.mb
Java version is 8
A “WARNING: No swap limit support” message :)

My Flink application is very heavy on memory consumption, so at first, I thought it may be related to GC or heap.
I turned on the JMX, I got the following diagram.

One thing to notice, from the code here
If we add any java.opts, the script won’t set your GC to G1GC. The above diagram is actually ParallelGC which is the Java 8 default option

Anyway I changed it back to G1GC, here is what I got

All look normal to me

If it is not the heap, then what else? Since we are on Java 8, I can enable NMT by

-XX:NativeMemoryTracking=detail -XX:+UnlockDiagnosticVMOptions -XX:+PrintNMTStatistics

After jcmd $pid VM.native_memory

If we sum up everything off-heap it is actually more than 2GB on reserved. We may wonder why GC took around 800MB and Class reserved around 1GB
GC accounts for GC native structures. It grows with heap size linearly when the heap size is big (for example, more than 500MB)
Class is for the Metaspace. Before Java 8, classes are stored on PermGem which is on the heap. Now it is off-heap

**Reserved for Class should not account for the rss, only committed should/could. But from my observation my long-running TaskManagers, rss is actually close to Reserved. So if anyone has an answer, do let me know :) **

In a conclusion, in the production setup, I simply cut the heap a little bit and give more to non-heap

Along with the research, I also came across some interesting findings I would love to share

taskmanager.memory.preallocate, If this configuration is set to false (which is default) cleaning up of the allocated off-heap memory happens only when the configured JVM parameter MaxDirectMemorySize is reached by triggering a full GC. But if we run Flink cluster mode, MaxDirectMemorySize by default is actually set as 8388607T which means there is no limit. Although if taskmanager.memory.off-heap is false, Flink should not use too much off-heap memory.
Some interesting JIRAs. https://issues.apache.org/jira/browse/FLINK-9904, https://issues.apache.org/jira/browse/FLINK-6217, https://issues.apache.org/jira/browse/FLINK-4094

References:
https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/tooldescr007.html
https://shipilev.net/jvm-anatomy-park/12-native-memory-tracking/

Debug Flink OOM in Docker Container

Written by Hao Gao