## How to eliminate duplicate rows?

3 followers

I have rows like
- 1, abc, xyz, 123, 567
- 2, def, str, 123, 789
- 1, abc, xyz, 123, 567
- 3, abc, xyz, 123, 567

Whats the best way to eliminate duplicate rows and get the result as

• 1, abc, xyz, 123, 567
• 2, def, str, 123, 789
• 3, abc, xyz, 123, 567
sr

## Official comment

• Joel Stewart

To combine duplicated rows, you may use a grouping function.

In the example above, a GROUPBY(#FirstColumn) would result in only three rows. The other values could be copied over from the original sheet.

• John Mulhall

So to clarify, if you use GROUPBY(#FirstColumn), Datameer will match rows by first column, and then check every column on the sheet to make sure the role is truly duplicated?

e.g.
abc, xyz, 123, 567
abc, xyz, 123, 567
abc, xyz, 123, 568

GROUPBY (FirstColumn)
Result:
abc, xyz, 123, 567
abc, xyz, 123, 568

Is this correct?

• Joel Stewart

John, the GROUPBY(#FirstColumn) would only output a single value. So for this example data, you'd get the following results:
e.g.
abc, xyz, 123, 567
abc, xyz, 123, 567
abc, xyz, 123, 568

GROUPBY(#FirstColumn) -- Result:
abc

However, if you used consecutive GROUPBY statements for all four columns, you'd get the result that you reported:
GROUPBY(#FirstColumn), GROUPBY(#SecondColumn), GROUPBY(#ThirdColumn), GROUPBY(#FourthColumn) -- Result:
abc, xyz, 123, 567
abc, xyz, 123, 568

• John Mulhall

Joel, thanks for this. Much appreciated! It seems like a cumbersome way to deduplicate records in a process. Is there any specific function to do de-deduplicate records like the "UNIQUE" function for lists?

• Joel Stewart

The number of GROUPBY functions used does increase with the number of columns that need to be deduplicated together. There is not another function that will automatically deduplicate records based on all columns.

The GROUPBY approach is very flexible. If there is only a handful of columns that need to be deduplicated, these columns may be used specifically in the Workbook and the others may be omitted.

• John Mulhall

Thanks Joel for your continuing engagement on this. So, if I GROUPBY (#Column A) and row 1 and 3 (col A) have identical identifiers, will the dulicate cell in Col A be removed or will be row/record be removed? If the cell is removed, will the other columnner information in the row/record be out of what with Col A? I presume the duplicate row goes right?

• Joel Stewart

Datameer processes functions on an entire column, not at a cell level. When thinking about the data set in this way, the functions are creating a new manipulated version of the data on this new worksheet.

The operation of GROUPBY lists out all unique values for the requested column. If multiple GROUPBY statements are included on a worksheet, they unique groupings are nested.

I'll work through the latest example you commented about. In this example, you have a set of data in which Column A has the same value for Row 1 and Row 3. Here is an example that meets this requirement:
ColumnA, ColumnB, ColumnC
abc, 123, apple
xyz, 987, apple
abc, 456, banana

If I create a new worksheet with the function "GROUPBY(#ColumnA)", I would receive these results:
GROUPBY(#ColumnA)
abc
xyz
Of note, this is only 2 records because the unique number of values in ColumnA is just 2.

If I add a second function "GROUPBY(#ColumnB)" then I will see these results:
GROUPBY(#ColumnA), GROUPBY(#ColumnB)
abc, 123
abc, 456
xzy, 987
Of note, the output has increased from 2 records to 3 records because the unique pairings of ColumnA and ColumnB are 3.

(The order of the displayed records was changed here for clarity of the grouping concept. I wanted to keep the "abc" records adjacent. This re-order is intended for the demonstration and is not related to this discussion.)

• John Mulhall

Thanks Joel, so it basically works like a manual version of the DISTINCT command in SQL, right?

• Joel Stewart

Yes, the SQL DISTINCT function is a good parallel to draw to the GROUPBY function.