Hadoop Task Failed - Timed out After 600 secs

Problem

A Datameer job fails and in the job log, the following stacktrace is displayed:

ERROR [2015-01-01 00:00:00.000] [ConcurrentJobExecutor-4] (ClusterSession.java:186) - Failed to run cluster job 'Workbook job (12345): MyWorkbook with MyJob#Joined(Disconnected record stream)' [1 hrs, 18 mins, 24 sec] java.lang.RuntimeException: Job job_1447373200318_0080 failed! Failure info: Task failed task_1447373200318_0080_r_000071 Job failed as tasks failed. failedMaps:0 failedReduces:1
 at datameer.dap.sdk.util.ExceptionUtil.convertToRuntimeException(ExceptionUtil.java:49)
 at datameer.dap.sdk.util.ExceptionUtil.convertToRuntimeException(ExceptionUtil.java:31)
 at datameer.dap.common.graphv2.hadoop.MrJob.runImpl(MrJob.java:228)
 at datameer.dap.common.graphv2.ClusterJob.run(ClusterJob.java:128)
 at datameer.dap.common.graphv2.ClusterSession.execute(ClusterSession.java:181)
 at datameer.dap.common.graphv2.ConcurrentClusterSession$1.run(ConcurrentClusterSession.java:48)
 at datameer.dap.common.security.DatameerSecurityService$1.call(DatameerSecurityService.java:135)
 at datameer.dap.common.security.DatameerSecurityService$1.call(DatameerSecurityService.java:129)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.io.IOException: Job job_1447373200318_0080 failed! Failure info: Task failed task_1447373200318_0080_r_000071 Job failed as tasks failed. failedMaps:0 failedReduces:1
 at datameer.dap.common.job.mr.DefaultMrJobClient.waitUntilJobCompletion(DefaultMrJobClient.java:234)
 at datameer.dap.common.job.mr.DefaultMrJobClient.runJobImpl(DefaultMrJobClient.java:91)
 at datameer.dap.common.job.mr.MrJobClient.runJob(MrJobClient.java:34) at datameer.dap.common.graphv2.hadoop.MrJob.runImpl(MrJob.java:216) ... 9 more
 Caused by: java.lang.RuntimeException: Task: AttemptID:attempt_1447373200318_0080_r_000071_3 Timed out after 600 secs

Cause

The timeout occurs when a task isn't updating on the cluster side within the specified time frame. This problem might occur due to priorities of other tasks on that node at that time. Ultimately, the task was terminated by Hadoop because it exceeded the timeout value (in milliseconds).

mapreduce.task.timeout

Solution

To be more flexible, increase the timeout parameter by setting 6 million milliseconds 

mapreduce.task.timeout=6000000

for this job and re-running it. A Datameer administrator can implement this recommendation.

If that doesn't resolve the issue, contact Datameer Support for further assistance. 

Further Information

This issue is described in the Apache Hadoop documentation of mapred-default.xml.

"The number of milliseconds before a task will be terminated if it neither reads an input, writes an output, nor updates its status string. A value of 0 disables the timeout."