group_previous

3 followers
0
Avatar

I used group_previous on smaller datasets and worked fine. However, on a 10GB dataset (204M rows), it did not work. Is there a size restriction?

YM Wang

11 comments

  • Avatar
    Joel Stewart

    There is not a size restriction on this function. Can you expand on the issues you observed with the larger data set? 

    0
  • Avatar
    YM Wang

    I used 8 patients data, groupby(A), groupby(B), group_sort_asc(C), group_sort_asc(D), group_previous(E). It worked. When I tried to use the same fields and the same syntax for 204M rows, the group_previous(E) did not show up in the preview. It also seems to slow the Datameer processing.

    -1
  • Avatar
    Konsta Danyliuk

    Hello YM Wang,

    >When I tried to use the same fields and the same syntax for 204M rows, the group_previous(E) did not show up in the preview.

    Do you mean there is no data in Column E? Have you executed the Workbook after you applied formulas to larger dataset, whether there are still no results?

    0
  • Avatar
    Joel Stewart

    Did the workbook complete successfully? Or did it complete with any warnings or errors? 

    What are the data types of the columns you're referencing for columns A-E? 

    0
  • Avatar
    YM Wang

    The fields are string and date types. I think there is no error in the full run. As I mentioned, the weird part is the same setup with <200 rows worked. 

    In addition, if I change the first field to be group_sort_asc(A), second to groupby(B), the group_previous(E) field will show, even in the preview. However, it did not sort right and so no right previous values attained.

    0
  • Avatar
    Joel Stewart

    It is notable that the GROUP_PREVIOUS function does in fact display a null record type when a group's size is 1 because there is no previous value to pull data from. It would be valuable to check the cardinality of the data groups to ensure that there is more than one entry there. 

    Specifically, if you replace the GROUP_PREVIOUS(E) function with a GROUP_COUNT() function. Do you receive counts other than 1 in the resulting column? 

    0
  • Avatar
    YM Wang

    I was wrong that after running, it did get values. And there were errors: Errors

    failed with NullPointerException

     

    I found a thread with the same error message for GROUPCOUNTDISTINCT function. The cause was noted as "This is a limitation for the GROUPCOUNTDISTINCT function. For large datasets, the file spilling to temporary storage can get exhausted and lead to a NullPointerException." I think it is the same situation. I hope to know how big  the data could be before the temp storage gets exhausted.

    In addition, it is interesting when I change the first field to group_sort_asc() instead if groupby(), the preview shows values. However, they are not the values that I need...

    0
  • Avatar
    Joel Stewart

    NullPointerException is a very generic error. Are you able to share more context on the stacktrace for the error itself? For the test I posed in my last update, I was specifically referring to the GROUPCOUNT() function and not the GROUPCOUNTDISTINCT() function. For GROUPCOUNT() temporary space is not a concern. 

    For clarity, here is the test that I am curious to know about:

    • GROUPBY(A) 
    • GROUPBY(B)
    • GROUP_SORT_ASC(C)
    • GROUP_SORT_ASC(D)
    • GROUPCOUNT()

    Then are there values in the GROUPCOUNT column that are greater than 1. If not, it sounds like the GROUP_SORT_ASC results are leading to groupings that are of cardinality (think of group "size") 1. If groups of cardinality 1 exist, the GROUP_PREVIOUS function will return a null value by design. 

    0
  • Avatar
    YM Wang

    It looks like groupcount() cannot be used after group_sort_asc(D) or group_sort_asc(C). It showed red error warning.

    0
  • Avatar
    Joel Stewart

    The sorting is not relevant for the purposes of this test, can you convert them to GROUPBYs for the first four columns directly? 

    0
Please sign in to leave a comment.