Job Failure - org.apache.tez.dag.api.TezException: org.apache.tez.dag.api.TezException: com.google.protobuf.ServiceException: java.io.IOException: java.io.EOFException

Problem

A Datameer job fails with an error that is similar to the following stack trace. 

ERROR [2015-01-01 00:00:00.000] [MrPlanRunnerV2] (ClusterSession.java:192) - Failed to run cluster job 'Workbook job (123456): MyWorkbook#MySheet(Expression record processor)#' [10 mins, 10 sec] java.lang.RuntimeException: com.google.protobuf.ServiceException: java.io.IOException: Failed on local exception: java.io.EOFException; Host Details : local host is: 'mynode1/10.10.10.123'; destination host is: 'mynode2':33337;
at datameer.dap.sdk.util.ExceptionUtil.convertToRuntimeException(ExceptionUtil.java:49)
at datameer.dap.sdk.util.ExceptionUtil.convertToRuntimeException(ExceptionUtil.java:31)
at datameer.plugin.tez.DagRunner.submit(DagRunner.java:96)
at datameer.plugin.tez.TezJob.runTezDag(TezJob.java:159)
at datameer.plugin.tez.TezJob.runImpl(TezJob.java:132)
at datameer.dap.common.graphv2.ClusterJob.run(ClusterJob.java:129)
at datameer.dap.common.graphv2.ClusterSession.execute(ClusterSession.java:186)
at datameer.dap.common.graphv2.mixedframework.MixedClusterSession.execute(MixedClusterSession.java:48)
at datameer.dap.common.graphv2.ClusterSession.runAllClusterJobs(ClusterSession.java:360)
at datameer.dap.common.graphv2.MrPlanRunnerV2.run(MrPlanRunnerV2.java:86)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.tez.dag.api.TezException: com.google.protobuf.ServiceException: java.io.IOException: Failed on local exception: java.io.EOFException; Host Details : local host is: 'mynode1/10.10.10.123'; destination host is: 'mynode2':33337;
at org.apache.tez.client.TezClient.submitDAGSession(TezClient.java:406)
at org.apache.tez.client.TezClient.submitDAG(TezClient.java:342)
at datameer.plugin.tez.DagRunner.<init>(DagRunner.java:55)
at datameer.plugin.tez.DagRunner.<init>(DagRunner.java:40)
at datameer.plugin.tez.DagRunner$1.run(DagRunner.java:105)
at datameer.plugin.tez.DagRunner$1.run(DagRunner.java:102)
at datameer.dap.common.entity.properties.SecureGridMode.executePossiblyImpersonated(SecureGridMode.java:257)
at datameer.plugin.tez.DagRunner.getPossiblyImpersonatedDagRunner(DagRunner.java:102)
at datameer.plugin.tez.DagRunner.submit(DagRunner.java:77)
... 8 more
Caused by: com.google.protobuf.ServiceException: java.io.IOException: Failed on local exception: java.io.EOFException; Host Details : local host is: 'mynode1/10.10.10.123'; destination host is: 'mynode2':33337;
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:216)
at com.sun.proxy.$Proxy125.submitDAG(Unknown Source)
at org.apache.tez.client.TezClient.submitDAGSession(TezClient.java:399)
... 16 more
Caused by: java.io.IOException: Failed on local exception: java.io.EOFException; Host Details : local host is: 'mynode1/10.10.10.123'; destination host is: 'mynode2':33337;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
at org.apache.hadoop.ipc.Client.call(Client.java:1414)
at org.apache.hadoop.ipc.Client.call(Client.java:1363)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
... 18 more
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1054)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:949)

Cause

This aligns with a known bug in TEZ: TEZ-494. The root cause is that the RPC request in the TEZ container is beyond the limit: ipc.maximum.data.length is the affected cluster value.

 

Solution

To work around this issue, force the affected job to run in the MapReduce framework by adding this Custom Property to the affected workbook: das.execution-framework=MapReduce.

To resolve this issue, please configure the ipc.maximum.data.length=134217728 parameter in the Custom Hadoop Properties as documented here: Smart Execution - Activation