Filesystem full due to excessive reduce-part-#### files generated in dfsCache folder

Problem

The Datameer file system has filled up due to excessive reduce-part-#### files generated in the Datameer dfsCache folder.

Cause

This issue occurs if the number of preview records (# Sample Records) configured for Import Jobs, Datalinks or File Uploads is set to an excessive level across several artifacts. Datameer will generate preview data cache and keep these files in the dfsCache folder. An excessive number, especially across several artifacts, would fill up the filesystem with these cache files.

Solution

  • First identify that it is the dfsCache folder which is being filled with files of the type reduce-part-####. This can be accomplished by running the following command on the filesystem:
    find <datameer-install-dir>/ -size +100M
  • Note the output and determine if any of these files are in the dfsCache folder and named reduce-part-####
  • Upon confirmation, run the following query on Datameers application database to identify the artifacts with preview records greater than 5000:
    SELECT id, max_preview_records FROM data_source_configuration WHERE max_preview_records > 5000;
  • The names of these artifacts can be identified by the following query:
    SELECT ds.type_id DataSourceType, df.id DataStoreId, df.name DataStoreName, df2.extension Type, df2.id ImportDataId, df2.name ImportDataName FROM data_store ds INNER JOIN dap_file df ON ds.dap_file__id = df.id INNER JOIN data_source_configuration dsc ON ds.id = dsc.connection_fk INNER JOIN dap_job_configuration djc ON dsc.id = djc.id INNER JOIN dap_file df2 ON djc.dap_file__id = df2.id WHERE dsc.max_preview_records > 5000;
  • Once identified, reduce this number to the default of 5000:
    UPDATE data_source_configuration SET max_preview_records=5000 WHERE max_preview_records > 5000;
  • Once completed, it is safe to delete the dfsCache folder that is consuming all of the space
  • A Datameer restart is recommended for best practices