GROUPCOUNTDISTINCT may result in Out of Memory (OOM) errors for large datasets

Problem

When working with the GROUPCOUNTDISTINCT function on a dataset with a very high number of records in a group, out of memory exceptions may be encountered.

The following illustrates the operation:

Example

Sheet_1: #user_id, #device 
Sheet_2: GROUPBY(#Sheet_1.device), GROUPCOUNTDISTINCT(#Sheet_2.user_id)

Running this job may result in the following error:

job.log

ERROR [ConcurrentJobExecutor-0] (ClusterSession.java:239) - Failed to generate file for 'GroupByOperation[sheetName=Sheet_2]' [14 mins, 36 sec]
java.lang.RuntimeException: Job job_ failed! Failure info: NA
        at datameer.dap.sdk.util.ExceptionUtil.convertToRuntimeException(ExceptionUtil.java:38)
        at datameer.dap.sdk.util.ExceptionUtil.convertToRuntimeException(ExceptionUtil.java:30)
        at datameer.dap.common.graphv2.hadoop.MrJob.runImpl(MrJob.java:193)
        at datameer.dap.common.graphv2.ClusterJob.run(ClusterJob.java:134)
        at datameer.dap.common.graphv2.ClusterSession.execute(ClusterSession.java:227)
        at datameer.dap.common.graphv2.ConcurrentClusterSession$1.run(ConcurrentClusterSession.java:53)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Job job_ failed! Failure info: NA
        at datameer.dap.common.job.mr.HadoopMrJobClient.waitUntilJobCompletion(HadoopMrJobClient.java:189)
        at datameer.dap.common.job.mr.HadoopMrJobClient.runJobImpl(HadoopMrJobClient.java:81)
        at datameer.dap.common.job.mr.MrJobClient.runJob(MrJobClient.java:32)
        at datameer.dap.common.graphv2.hadoop.MrJob.runImpl(MrJob.java:188)
        ... 9 more
Caused by: java.lang.RuntimeException: Task: Error: Direct buffer memory

Cause

This issue is caused by the fact that in order to determine the true distinct value, GROUPCOUNTDISTINCT need to shift all mapped information to disk on one reducer. This is a limitation and can cause disk space and performance/memory issues on large datasets. 

Solution

In order to work around this issue, an intermediary sheet along with a combination of GROUPBY and GROUPCOUNT functions is suggested as described below.

Original

Sheet_1: #user_id, #device 
Sheet_2: GROUPBY(#Sheet_1.device), GROUPCOUNTDISTINCT(#Sheet_2.user_id)

Workaround

Sheet_1: #user_id, #device 
Sheet_2: GROUPBY(#Sheet_1.device), GROUPBY(#Sheet_1.user_id) 
Sheet_3: GROUPBY(#Sheet_2.device), GROUPCOUNT

This will give the same desired output of the GROUPCOUNTDISTINCT function.