Removing Duplicate Rows

2 followers
0
Avatar

Hi,

Is there any other efficient way to remove duplicate rows instead of applying "GROUP BY" to all the columns? Kindly suggest.

Thank You in Advance,

Swetha Telu

Swetha Telu

Official comment

  • Avatar
    Konsta Danyliuk

    Hello Swetha,

    I could suggest the following approach to remove duplicated records from a dataset.

    Let's say you have a dataset and need to remove duplicated records based on data in first 3 columns - this means that, if there are any other records that contain the same values in first 3 columns (let's call them Key Columns), they will be considered as duplicates.

    You could change number of key columns according to your requirements.

    Please try the following steps:

    Prepare the data

    1. Add a column to your sheet and apply function GROUPBY(1) to it. This will fill whole column with the same value ("1" in this case) in each row.
    2. Add one more column (lets name it RowNumber) and apply function GROUPROWNUMBER() to it. This will assign unique number to every row across the sheet (as all of them are at the same group we've created at the step one).
    3. At new column (lets call it KeyColumnsHash). Apply CONCUT function to your 3 key columns, then hash them in order to make result string shorter - you could use formula sha_256(CONCAT(sha_256(#KeyCol1); sha_256(#KeyCol2); sha_256(#KeyCol3)))

    Remove Duplicates

    1. At the new sheet execute GROUPBY function under KeyColumnsHash column. It returns Hash values for rows with unique data in key columns.
    2. At the same sheet apply GROUPFIRST function with RowNumber column as both arguments. It will display corresponding row number for values from KeyColumnsHash column.

    After all these steps you should have two sheets. The first contains whole your data set and additional columns with row number and hash string of your key columns. The second - rows with unique key columns' hash string and their numbers. You could perform Inner Join over these two sheets by RowNumber and exclude unnecessary columns. This will give you your dataset without duplicated rows.

    Hope you will find this helpful.

    Best regards,
    Konsta.

    0

4 comments

  • Avatar
    Swetha Telu

    Hi Konsta,

    I have tried the way you suggested. It is working fine as expected. Thank You so much for the quick response.

    But here if we import the data into a workbook, we need to create a duplicate data source sheet as we cannot apply any modifications to the original data source sheet. So in this case we will have three sheets : 1) Original Data Source sheet 2) Duplicate of the Data Source sheet 3) Required Final Output.

    Please correct me if I am wrong.

    Thanks in Advance,

    Swetha

    0
  • Avatar
    Konsta Danyliuk

    Hello Swetha,

    Yes, you are right, you need to duplicate source sheet to be able to perform action of the dataset.

    In total you should have the following sheets at your workbook:

    1. Original data source.
    2. Duplicate of Original data source (at this sheet you will add GroupRowNumber and Hash or your key columns).
    3. Data preparation sheet (here you should apply GROUPBY function under KeyColumnsHash column and  GROUPFIRST function for RowNumber column).
    4. Result sheet, where you will get results of your Join operation.

    Best regards,
    Konsta.

    0
Please sign in to leave a comment.