GROUPCOUNTDISTINCT Function Fails with NullPointerException

Problem

When executing the GROUPCOUNTDISTINCT function with a very large data set (more then 10.000 distinct elements in a group), the following error message is displayed in the Hadoop syslogs for the failing job:

!message:ComputationException: DistinctSubIds: =GROUPCOUNTDISTINCT(#SheetName!ColumnName) failed with NullPointerException:
!Record (current sheet):ColumnA: "(null)", ColumnB: 12345678, ColumnC: (null), ColumnD: "(null)", ColumnE: "(null)" 
!stack:datameer.dap.common.exception.ComputationException: DistinctSubIds: =GROUPCOUNTDISTINCT(#SheetName!ColumnName) failed with NullPointerException:
        at datameer.dap.common.formula.RecordContext.createComputationException(RecordContext.java:128)
        at datameer.dap.common.formula.lazy.RecordEvalSequence.toComputationException(RecordEvalSequence.java:123)
        at datameer.dap.common.formula.lazy.RecordEvalSequence.moveToNext(RecordEvalSequence.java:135)
        at datameer.dap.common.formula.lazy.ExpressionEvaluator2$2.computeNext(ExpressionEvaluator2.java:114)
        at datameer.dap.common.formula.lazy.ExpressionEvaluator2$2.computeNext(ExpressionEvaluator2.java:111)
        at datameer.dap.sdk.sequence.Sequence$Simple.moveToNext(Sequence.java:157)
        at datameer.dap.sdk.sequence.Sequence$23.moveToNext(Sequence.java:1140)
        at datameer.dap.sdk.sequence.Sequence$27.moveToNext(Sequence.java:1240)
        at datameer.dap.sdk.sequence.Sequence$13.moveToNext(Sequence.java:602)
        at datameer.dap.sdk.sequence.Sequence$14.computeNext(Sequence.java:647)
        at datameer.dap.sdk.sequence.Sequence$Simple.moveToNext(Sequence.java:157)
        at datameer.plugin.tez.processing.AggregationVertexRecordProcessor.run(AggregationVertexRecordProcessor.java:161)
        at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:324)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:180)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:172)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:167)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.NullPointerException
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:268)
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:344)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
        at org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:3402)
        at org.apache.hadoop.io.SequenceFile$Sorter.merge(SequenceFile.java:3196)
        at org.apache.hadoop.io.SequenceFile$Sorter.mergePass(SequenceFile.java:3172)
        at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:2735)
        at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:2774)
        at datameer.das.functions.grouping.spills.SpillFile.eachSorted(SpillFile.java:98)
        at datameer.das.functions.grouping.spills.SpillingSet.aggregateMergedValues(SpillingSet.java:142)
        at datameer.das.functions.grouping.spills.SpillingSet.aggregateValues(SpillingSet.java:149)
        at datameer.das.functions.grouping.GroupCountDistinctFunction$GroupCountDistinctAggregator.computeAggregationResult(GroupCountDistinctFunction.java:69)
        at datameer.dap.common.formula.lazy.EvalSequence$6.computeValue(EvalSequence.java:122)
        at datameer.dap.common.formula.lazy.SingleEvalSequence.currentValue(SingleEvalSequence.java:31)
        at datameer.dap.common.formula.lazy.EvalSequence.currentIsError(EvalSequence.java:47)
        at datameer.dap.common.formula.lazy.RecordEvalSequence.moveToNext(RecordEvalSequence.java:134)
        ... 21 more

 

Cause

This is a limitation for the GROUPCOUNTDISTINCT function. For large datasets, the file spilling to temporary storage can get exhausted and lead to a NullPointerException. 

Solution

Starting in Datameer versions 4.5.10, 5.0.5, 5.1.1 and 5.2+, Datameer enhanced the usage of the spill files and can now better handle the large datasets when processed through a GROUPCOUNTDISTINCT function.

Further internal reference for the enhancements can be found in DAP-21628.