Spark Job Fails When It's the First Execution After Datameer Restart

Problem

After a Datameer restart, the first job fails with the following error (if it's a Spark job):

Error: UnknownHostException: mycluster

The issue doesn't occur for Tez jobs or any ongoing execution, regardless of its engine. 

 

Cause

The root cause is currently under investigation by the Datameer engineering team. The internal ticket number is DAP-32386.

 

Workaround

It is confirmed that the issue only occurs on environments that have enabled HDFS High Availability as well as having the first job executed after Datameer restarts using the Spark engine. If you face such issue, restart impacted job.

Please get in touch with Datameer support if the mentioned workaround doesn't help.