How to Use Datameer Smart Sampling
Datameer offers many features that enable users to work with large data sets. One of those features is Smart Sampling. Smart Sampling reliably provides a view on your data that enables you to build an analytic model with real-time feedback.
For optimized performance and an accurate representation of your data, Datameer provides a default initial sample of 5000 records (which can be customized) when you start your analysis. This is a representative sample generated via the distributed reservoir sampling technique:
* Each Task generates a ((k / n-tasks) * oversample- factor) sized sample
* At the end of the job the samples are merged to generate a k sized final sample
During design of the analysis, using the initial representative sample, it may happen that you end up with an empty sheet (e.g., due to usage of filters.) Even though the full data set may contain records that would pass the filter. Smart sampling overcomes this problem by resampling the data, taking the designed operations up to that point into account. Simply run your partially designed workbook and the new smart sample will populate the sheet with records.
Unbiased Randomized Samples
Unbiased randomized samples do have limitations:
- Increasing the size can reduce some of them
- Outliers could be missed or included affecting the preview calculations
- Join and Filter can result in empty sheets when trying to match against an underrepresented subset of the records
In order to deal with the empty sheet problem, we bias the samples.
Filtered and Joined Sheets
For filtered sheets, we generate a biased sample based on which records pass through the filter by assigning a larger weight to records that pass.
- This requires computing the formulas and filter twice
- CPU time isn’t usually the bottleneck
For Join sheets we generate a biased sample by evenly distributing the join keys
- Lowers the chance of sampling a record if we have seen its join key too many times
The optimized previews (smart samples) can only be generated when the Workbook is being run, and are generated for the data source sheets and kept sheets. Optimized Preview generation can be disabled by setting the property: “das.sampling.lookahead.maxdepth” to 0
- This can be useful for critical, time constrained Workbooks
The initial sample for Data Links aren’t representative. Currently, the sample is just the first k records of the data source. Once the Data Link is used in a Workbook though, an Optimized Preview will be generated for it.