Easy deduping of a workbook?

3 followers
0
Avatar

I have a workbook with 14 columns of interest derived from a data source with a couple hundred columns.  No rows are duplicated in the source, but in the subset of 14 columns there are duplicates.  What's the simplest, and hopefully fastest, way to create a deduped copy?  I could do 14 groupbys, but....

Thanks!

 

Steve Bernstein

4 comments

  • Avatar
    Alan Mark

    Hi Steve,

    It's really difficult to say based on the description here.  But there are a few approaches you can take.  First is understanding why there are duplicates.  Exactly when in your workflow are they being created?  Once you understand why they are there to begin with you could address that directly to remove the duplicates.

    If this isn't possible...  It doesn't seem like adding a groupby to the columns wouldn't be too intensive.  This does seem like the fastest way to get to the result you want - since it doesn't require an understanding of why things are duplicated.

    Sincerely,

    Alan Mark

    0
  • Avatar
    Steve Bernstein

    The duplicates are legitimate.  The source file is essentially a log file of changes to one of dozens of properties of a ticket listing on Stubhub, and the millisecond-grain timestamp is not granular enough to eliminate dupes on its own.  There are no dupes when all of the columns are accounted for--something did change.  But I am only interested in a relatively small handful of these properties, so when I remove the columns representing irrelevant properties, dupes are created, and I understand why.

    I ran with the groupby for every column, but notwithstanding I don't understand the internals such that I can predict the performance ramifications, wouldn't it be handy to have a dedupe function which just removes all dupes, and/or splits out the duped records into a separate workbook.  For example, you could write each record as a delimited string, dedup the strongs and parse back to the original record.  Could take just as long as the comprehensive groupby, but way easier on the user...?

    0
  • Avatar
    Joel Stewart

    Thank you for the feedback Steve, we'll gladly share this with our Product team for consideration of a new function.

    From a design point-of-view, would you say that the ideal flow is for a user to select one or more columns (maybe as checkboxes) and then press a "Deduplication" button to implement? If not, do you have any alternative suggestions? 

    0
  • Avatar
    Steve Bernstein

    That's an awesome way to implement the idea.  Maybe could implement it in the right-click-on-tab to "duplicate sheet" process.  Managing it at the column level is a great idea and, and as you may know, parallels the "Remove Duplicates" functionality in Excel (under the Data menu).

    0
Please sign in to leave a comment.