Flip Side Results Accuracy

Goal

Get better understanding on how Datameer generates values for Flip Side sheets.

Learn

Flip Sheet is meant to be used for data exploration to get quick and raw understanding of your data. Not all of its values are exact, some are estimates, however one could use Datameer spreadsheets and functions to compute the exact values, when needed.

For performance sake, Datameer uses algorithms that read a dataset only once before providing the output and generating values for the Flip Side sheet. Accuracy of such calculations depend on a possibility to provide exact results walking through the data only once.

 

For example, its easy to find out Count (displays a rounded count of records for easy visibility), MinMax and Mean by One-View algorithms, therefore these values should always be accurate.

 

However, algorithms that calculate the number of unique values can't provide accurate results when reading the data only once. Therefore the Unique count on Flip Side displays an estimated count of unique records. Precision of these estimates decreases when data volume increases.

 

To calculate number of unique values in a column, Datameer uses HyperLogLog algorithm with 14 bits of memory.

 

You can calculate the average accuracy (error rate) for this algorithm with the formula:

accuracy = 1.04/sqrt(m) where m = 2^b and b=14
accuracy = 1.04/sqrt(16384) = 0.0081 or 0.81%

Depending on amount of records, the error rate is slightly different, e.g. for 10k-100k records it's ~0.4% and for 1M it's ~0.7%.

 

The following article provides more details on how HLL works - Understanding the HyperLogLog.