Shuffle Failures

Problem

A workbook fails with one of the following errors:

ERROR [2016-08-03 11:30:49.312] [MrPlanRunnerV2] (ClusterSession.java:198) - Failed to run cluster job 'Workbook job
datameer.com.google.common.base.VerifyException: Finished DAG 'Workbook job with state FAILED and diagnostics:[Task failed, taskId=task_1468512490689_41979_2_01_000001, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: error in shuffle in fetcher
...
Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

or

ERROR [2017-01-23 09:04:39.842] [MrPlanRunnerV2] (ClusterSession.java:220) - Failed to run cluster job 'Workbook job (78177): 
datameer.com.google.common.base.VerifyException: Finished DAG 'Workbook job (78177)with state FAILED and diagnostics:[Task failed, taskId=task_1484145103264_5612_1_00_000001, diagnostics=[TaskAttempt 0 failed, info=[attempt_1484145103264_5612_1_00_000001_0 being failed for too many output errors. failureFraction=0.1, MAX_ALLOWED_OUTPUT_FAILURES_FRACTION=0.1, uniquefailedOutputReports=3, MAX_ALLOWED_OUTPUT_FAILURES=10, MAX_ALLOWED_TIME_FOR_TASK_READ_ERROR_SEC=300, readErrorTimespan=0] 

Cause

This issue is caused by shuffle, the process where the system performs a sort and transfers the map outputs to the reducers as inputs. Sometimes this process fails at the fetching stage, when data is being transferred from a mapper to a reducer. Usually it is caused by communication issues among nodes.

Solution

To gather more information about the impacted execution, implement debug logging and include the following property into Logging Customization section. This property lets you pick up a detailed network communication log.

log4j.category.datameer.dap.common.util.network=DEBUG

Collect Resource Manager and Node Manager logs for the corresponding applicationID. These logs help you to understand what was going on on the cluster side.

Ensure that the mapreduce.shuffle.port configured in mapred-site.xml is open for nodes and isn't blocked by your fire wall. You can use the following command:

openssl s_client -connect <datanode>:<shuffleport> -state -msg

Case1

If the workbook fails under Tez and MapReduce execution engines with the following error:

java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES 

The Resource Manager logs report:

2016-08-16 15:57:52,619 DEBUG [fetcher#17] datameer.dap.common.util.network.NetworkProxySelector: Using proxy [HTTP @ example.com/192.168.12.23:8000] for uri http://example2.com:13562/mapOutput?
...
2016-08-16 15:57:52,627 WARN [fetcher#17] org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to example2.com:13562 with 15 map outputs
java.io.IOException: Got invalid response code 403 from http://example2.com:13562/mapOutput?job=job_1468512490689_69301&reduce=0&map=attempt_1468512490689_69301_m_000059_0,attempt_1468512490689_69301_m_000075_0,attempt_1468512490689_69301_m_000080_0,attempt_1468512490689_69301_m_000089_0,attempt_1468512490689_69301_m_000094_0,attempt_1468512490689_69301_m_000085_0,attempt_1468512490689_69301_m_000058_0,attempt_1468512490689_69301_m_000067_0,attempt_1468512490689_69301_m_000086_0,attempt_1468512490689_69301_m_000073_0,attempt_1468512490689_69301_m_000093_0,attempt_1468512490689_69301_m_000084_0,attempt_1468512490689_69301_m_000061_0,attempt_1468512490689_69301_m_000074_0,attempt_1468512490689_69301_m_000062_0: Forbidden
.. 
2016-08-16 15:57:52,632 FATAL [fetcher#18] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Shuffle failed with too many fetch failures and insufficient progress!
2016-08-16 15:57:52,632 FATAL [fetcher#18] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Shuffle failed with too many fetch failures and insufficient progress!

This error is caused by an incorrect HTTP proxy server configuration in Datameer (namely proxy blacklist). Ensure that Datameer is set up according to the Configuring Network Proxy documentation. 

Case 2

If the workbook runs fine under MapReduce engine but fails under Tez with the following error:

failed for too many output errors

The Resource Manager logs report:

2017-01-17 15:20:23,440 [INFO] [Dispatcher thread {Central}] |impl.TaskAttemptImpl|: attempt_1484665825016_0373_1_00_000227_0 being failed for too many output errors. failureFraction=0.33333334, MAX_ALLOWED_OUTPUT_FAILURES_FRACTION=0.1, uniquefailedOutputReports=1, MAX_ALLOWED_OUTPUT_FAILURES=10, MAX_ALLOWED_TIME_FOR_TASK_READ_ERROR_SEC=300, readErrorTimespan=0

The Node Manager logs report:

2017-01-17 15:20:23,251 ERROR org.apache.hadoop.mapred.ShuffleHandler: Shuffle error: 
org.jboss.netty.handler.ssl.NotSslRecordException: not an SSL/TLS record: 
...
2017-01-17 15:20:23,252 ERROR org.apache.hadoop.mapred.ShuffleHandler: Shuffle error [id: 0x70fb52f3, /192.158.23.112:36818 => /192.168.23.112:13562] EXCEPTION: org.jboss.netty.handler.ssl.NotSslRecordException: not an SSL/TLS record:

Most likely, this points to the fact that Tez framework hasn't been configured according to the environment requirements - SSL support for shuffle process is disabled.

Set the following custom properties (acceptable at job or cluster level):

tez.runtime.shuffle.ssl.enable=true
tez.runtime.shuffle.keep-alive.enabled=true

Related documentation

Enable SSL for WebHDFS, MapReduce Shuffle, Tez, and YARN