• Official comment
Joel Stewart

Hello Fritz, you're absolutely right, the documentation as it is published is insufficient in describing the expected behavior of this function. The title of Jaccard is a bit misleading to the behavior since Jaccard is about comparing two sets of data to each other and this function is only referencing a single set.

The current behavior is that the function performs an order specific segmentation of the input data into two equally sized subsets and then computes the Jaccard Index between these subsets. You're correct that the order of the inputs affects the results because it affects the membership of the subsets. I hope this clarifies the observations you have today.

Ultimately, if you're looking to implement the Jaccard distance between two definitive sets I recommend using the SDK to create a function that does this precisely the way you'd like.

• Fritz Schinkel

Hi Joel, thanks for the explanation. Do you have any hint, which positions belong to the which set and how to guarantee the order. I just tried a,b,c,d,a,b,c,e and was surprised to get 0 since there are 2 unique elements, so cardinality of intersections should be less than cardinality of the union.

• Joel Stewart

I don't know currently, I'll try to find out though. I'll get send another update tomorrow.

• Joel Stewart

Hi Fritz, after some research and consultation with colleagues I am able to confirm that the current distribution of the sets is simply alternating order. Note that the order that is displayed with the preview data is not guaranteed to be the same order that is executed when the data set is run on the cluster. The distribution of the data and the timeliness for when earlier steps are completed in a distributed manner will affect the order of the input records for this step. The order could only be guaranteed by using a GROUP_SORT_ type function instead of a GROUPBY.

For confirmation with your example of records a,b,c,d,a,b,c,e the two sets would be {a,c,a,c} and {b,d,b,e}. These two sets do in fact have a 0 Jaccard Index because the intersection of these sets is the empty set.

• Fritz Schinkel

Hi Joel, thanks for your further Explanation, at least I understand the implementation now. But I see no way to use it. GROUP_JACCARD_DIST is a Group aggregate function requiring a GROUP_BY and adding a serial function like GROUP_SORT_ASC gives the expected error.

The function seems to distinuish identically named element, when they are on different positions e.g. a,a,b,b,c,c,c,d does not mean {a,b,c,c}={a,b,c} and {a,b,c,d} but something like {a1,b2,c3,c4} and {a1,b2,c3,d4}. But a,a,b,b,c,d,d,c behalves like {a,b,c,d} and {a,b,d,c}={a,b,c,d}. The implementation does not recognize the values as elements of a set consequently (case1).

The result is indeed sometimes and with the right interpretation of sets the Jaccard index, and not the Jaccard-distance as the function name and the documentation tells and which is 1-Jaccard-index.

Do you know of any practical usage of this implementation. In my view nobody can use it and it can't be repaired. So Datameer could fix it with a usable design.

• Joel Stewart

With this implementation, I cannot recommend any practical usage of it. Based on this, I'm coordinating with our Product team to replace this function in a future release with one that would take two set inputs and compute the Jaccard Index. I am recommending that the GROUP_JACCARD_DIST be deprecated and removed with the current implementation.