Duplicate Records Processed in Workbook Functions
Functions produce duplicate records on some workbook sheets. Specifically, source Parquet files less than a threshold may be read twice. Any downstream calculations will include the duplicated data. No errors are displayed nor logged indicating an issue.
For example, a grouping Workbook is expected to produce 15 results, but actually produces 20 results.
This is a software defect in Datameer. This is known internally as DAP-36752.
The problem only occurs if the job processes at least 1 small file (smaller than the threshold) and at least 1 large file (bigger than the threshold). The exact threshold depends on the system settings for the following values:
- minSplitSize: Value of the property mapreduce.input.fileinputformat.split.minsize (default 128MB)
- maxSplitSize: Value of the property mapreduce.input.fileinputformat.split.maxsize (default 512MB)
- parquetMaxBlockSize: Value of the property das.parquet-storage.max-parquet-block-size (default 256MB)
The threshold itself is calculated as the maximum of the minSplitSize value and the result of the minimum of the parquetMaxBlockSize or maxSplitSize values. Using the default values, the parquetMaxBlockSize is the minimum of the parquetMaxBlockSize and the maxSplitSize. The resulting parquetMaxBlockSize is compared to the minSplitSize and the maximum result is the parquetMaxBlockSize of 256MB.
- 7.1.3, 7.1.4 and 7.1.5
- 6.4.7, 6.4.8 and 6.4.9
- 6.3.9 and 6.3.10
To work-around this issue, splitting can be disabled by adding the following Custom Property to the Hadoop Cluster's Custom Properties configuration:
Adding this property will negatively affect the job's performance so it is advised to install the maintenance release as soon as possible.
If the work-around is applied, it should be deactivated after updating to a fixed release.
Apply the latest Datameer maintenance release to resolve this issue. A fix for this issue is included in 6.4.10 and 7.1.6 and higher releases.