Best Practices - Data Retention

Goal

Learn about best practices related to Data Retention and Minimizing your Data Footprint in HDFS.

Learn

Data Retention (under Workbook Configuration)

By default, Datameer will purge all historical data except the Last 1 results for a Workbook.  Unless there is a very specific use case that requires you to keep additional history, users should not change the default settings.  Keeping additional results can lead to unnecessary waste of storage in HDFS.

Data Links vs Import Jobs

Data Link is best:

  • When you don't want to import data to the Hadoop cluster (HDFS) and create a data footprint.  Note that the Sample Data (default 5,000 rows) is still stored in HDFS, even when using a Data Link.
  • When the data you are working with is already in HDFS (e.g. In a local Hive table).
  • When you want to grab data freshly, and stale data won’t work.
  • When the Data link fetch time for the work is not that important.

Import Job is best:

  • When you do want to migrate the data to the Hadoop cluster (HDFS), so that it is local to Datameer
  • When Data Link is too slow because of the fetch time.
  • When you are okay with stale data state after import.
  • When you want to repartition the data.
  • When data has a "half-life" - append mode with sliding window

Minimizing Data Footprint in HDFS

If your goal is to minimize the data footprint on HDFS consider:

  • Using Data Links instead of Import Jobs, especially if performance is not a concern (see above)
  • Using Export-Only Workbooks 
  • Save Workbook Results only for the Sheets necessary in your downstream processes

Further Reading

How to Optimize a Workbook

Best Practices - Efficiently Handling Multiple Data Sources