Export Job to Hive Dropping Records
When exporting to Hive, a large chunk of records gets consistently dropped. Exporting these records directly to HDFS is possible without an issue. The dropped records only appear when attempting to export to either a new or existing Hive table.
There is not an exact error message; instead, the job completes with warnings. The log snip below details the output from a Hive export job that is dropping records.
INFO [<timestamp>] [MrPlanRunnerV2] (JobExecutionTraceService.java:106) - Copying job execution trace log from /app/datameer/Datameer-<version>-<dist>/build/cache/dfscache/local-job-execution-traces/<jobID> to maprfs:/datalake/corporate/dataliberation/datameer/exportjobs/<configID>/<jobID>/job-execution-trace.log INFO [<timestamp>] [JobScheduler worker1-thread-253] (DapJobCounter.java:176) - Job SUCCESS with '1' mr-jobs and following counters: INFO [<timestamp>] [JobScheduler worker1-thread-253] (DapJobCounter.java:179) - EXPORT_RECORDS: 1908 INFO [<timestamp>] [JobScheduler worker1-thread-253] (DapJobCounter.java:179) - EXPORT_DROPPED_RECORDS: 3617
Further investigation in the task logs revealed the following nondescript error messaging:
!message:NullPointerException: !Error Repeated:>100 times !stack:java.lang.NullPointerException at datameer.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:210) at datameer.com.google.common.collect.Lists.newArrayList(Lists.java:142) at datameer.das.plugin.hive.RecordObjectInspector.coerceColumnForHive(RecordObjectInspector.java:118) at datameer.das.plugin.hive.RecordObjectInspector.getStructFieldsDataAsList(RecordObjectInspector.java:108) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:396) at datameer.das.plugin.hive.HiveFileOutputAdapter.write(HiveFileOutputAdapter.java:80) at datameer.das.plugin.hive.HiveOutputAdapter.write(HiveOutputAdapter.java:74) at datameer.dap.common.job.dapexport.ExportJob$ExportRecordProcessor$1.computeValue(ExportJob.java:115) at datameer.dap.common.job.dapexport.ExportJob$ExportRecordProcessor$1.computeValue(ExportJob.java:108) at datameer.dap.sdk.sequence.Sequence$Singleton.moveToNext(Sequence.java:246) at datameer.dap.sdk.sequence.Sequence$14.computeNext(Sequence.java:647) at datameer.dap.sdk.sequence.Sequence$Simple.moveToNext(Sequence.java:157) ...
In this instance, the cause was identified by isolating the columns in which were exported to Hive until it were determined the offending fields. Through this process of elimination following a binary sort pattern, two columns were determined to contain null values.
The formulas attached to those columns were updated to include the DENULLIFY wrapper and the
NullPointerException listed above was resolved. All records were now exported to Hive as expected.