After upgrade to 6.4 or 7.x - Jobs failing with "Container is Running Beyond Physical Memory Limits."
After upgrading from 6.1 to a 6.4 or 7.x release, jobs that were previously running are now failing with the following error:
WARN [<timestamp>] [ConcurrentJobExecutor-0] (DefaultMrJobClient.java:185) - attempt_<id>: Container [pid=<pid>,containerID=container_<id>] is running beyond physical memory limits. Current usage: <value> GB of <value> GB physical memory used.
Configuring the typical memory parameters and increasing their values does not alleviate the problem:
Additionally, increasing the off-heap memory space by using the heap fraction parameter does not alleviate the issue until set to 0.5. Reducing the heap fraction by this significant of an amount requires the overall container memory be increased, since half of the container will be dedicated to off-heap.
In very busy clusters, the YARN resource manager is more strict about memory limits. This strict behavior combined with the high rate of direct memory buffers being allocated by the implementation of Parquet leads to containers being killed before JVM Garbage Collection can occur.
To mitigate this issue, the fraction of memory dedicated to off-heap operations should be slightly increased - and the amount of Direct Memory allocated within the JVM should be bound to 25% of the total container memory.
To do this, we need to add -XX:MaxDirectMemorySize to the current tez.task.launch.cmd-opts string. To determine what the current value is - download a Full Job Trace for a successful job and open the job-conf-cluster.xml. Note the value. The default value is: -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC
As an example, if working with a 4GB container the following properties should be used:
tez.task.launch.cmd-opts=-XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC -XX:MaxDirectMemorySize=1024m
Heap Fraction is a percentage of the overall container memory. By setting a heap fraction of 0.7 we are allocating 30% of the 4GB to off-heap operations. So by setting a Direct Memory size of 25% while allocating 30% to off-heap, it should not be possible to overrun the off-heap memory space.