Application Master Connectivity Issue

Problem

Datameer isn't installed on an edge node and there is a firewall between the Datameer host and the cluster. All jobs fail without any progress. Job logs report the following errors, but the cluster application runs fine and gets completed successfully.

2018-02-01 11:01:17 -  Retrying connect to server: datanode1.internal.com/10.11.12.167:45823. Already tried 0 time(s); maxRetries=45
2018-02-01 11:15:58 - Retrying connect to server: datanode1.internal.com/10.11.12.167:45823. Already tried 44 time(s); maxRetries=45

2018-02-01 11:16:18 - Failed to retrieve AM Status via proxy
com.google.protobuf.ServiceException: org.apache.hadoop.net.ConnectTimeoutException: Call From <datameer hostname>/<datameer IP> to datanode1.internal.com/10.11.12.167:45823 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=datanode1.internal.com/10.11.12.167:45823]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout


Cause

As soon as a Resource Manager accepts the application, it starts the Application Master at one of the DataNodes. Further communication happens between Datameer and the Application Master.

2018-02-01 11:00:55,849 [INFO] [IPC Server Responder] |ipc.Server|: IPC Server Responder: starting
2018-02-01 11:00:55,849 [INFO] [IPC Server listener on 45823] |ipc.Server|: IPC Server listener on 45823: starting
2018-02-01 11:00:55,850 [DEBUG] [IPC Server handler 0 on 45823] |ipc.Server|: IPC Server handler 0 on 45823: starting
2018-02-01 11:00:55,850 [INFO] [ServiceThread:DAGClientRPCServer] |client.DAGClientServer|: Instantiated DAGClientRPCServer at datanode1.internal.com/10.11.12.167:45823

Datameer tries to connect to this address using the port provided by the Resource Manager. In case this port is closed for incoming connections from the Datameer host, the connection attempt will fail with socket timeout exception.

 

Solution

There are Hadoop properties that control the port range a Resource Manager uses for the Application Master processes. If at the cluster these properties are not set/blank, any available port could be used.

In case you want to dedicate a certain port range to start AM for Datameer executions, it is required to set using the next Hadoop custom properties on the Datameer Admin page.

yarn.app.mapreduce.am.job.client.port-range for MR jobs and tez.am.client.am.port-range for Tez jobs.


For example.

tez.am.client.am.port-range=50000-55000
yarn.app.mapreduce.am.job.client.port-range=50000-55000