Removing duplicates from a table

Posted by Rus Paul-Adrian on 25-Feb-2017 04:49

So I have a table which has duplicate rows for everything, but there is a catch: the duplicate elements are not consecutive and they have different primary keys. Is there a code snippet that can work for this case?

All Replies

Posted by James Palmer on 25-Feb-2017 08:22

You'll need to define your table more clearly for us. What makes something a duplicate? What indexes are on that table.
A solution should be simple but it won't be quick to run if the table is large.

Get Outlook for Android





Posted by Patrick Tingen on 25-Feb-2017 13:33

How can you have duplicates if the pk is not the same? It's part of the table as well as the rest of the data. If - and I am starting to assume here - you have a synthetic key (a single integer) and all the rest is the same, then something along this lines should do the trick:

for each <yourtable>
  break by <field-1>
        by <field-2>
        by <field-3>
        ...
        by <field-n>:

  if not first-of(<field-n>) then delete <yourtable>.
end.

The idea is to do a break-by on all fields except the primary index field. Then, if you find a duplicate, delete it. This may not work if your table is large though.

 

Posted by OctavioOlguin on 25-Feb-2017 15:20

Some times, what it makes duplicate some data, it's the iterpretation that some actor on the system gives to the data, dont you think?

Assuming data is at least 4NF, we can consider its economical and practically efficient for database storing, but perhaps some department gives the same value to datum with, let's say, KEY(CustNum, DateOperation) and they consider this as duplicate record (if customer showed  up two times in a date), and Rus Paul was asked to create a data table with the concentrated data about this scenario...

Just a thought .  

And [mention:e666fffb14004b29b4bad87b731999a8:e9ed411860ed4f2ba0265705b8793d05] showed what most of us would think as solution to the task at hand..

Posted by marian.edu on 26-Feb-2017 00:50

There can be many ways to find a ‘duplicate’ once one is able to define what a duplicate means, Patrick’s break-by approach can work - probably audit fields need to be excluded as well (create, last update, user) and I imagine it might have issues with ‘large objects’ but the idea is there… however, I would not simply delete the ‘duplicate’ like that as it’s PK might be used in child tables (FK) and that could make a mess out of the database :(


So, there hardly is a code snippet for this but coding it once you have the requirements should not be difficult:
- define the criteria for ‘duplicate’ identification
- decide which of the dupes to keep (some timestamp field maybe)
- for each dupes to be deleted go through the child tables and update the PK to point to the ‘winner’ and only then delete the dupe


Marian Edu

Acorn IT 
+40 740 036 212

Posted by gus bjorklund on 26-Feb-2017 07:44

before deleting duplicates, consider whether you should.

if the data appear to be duplicates but maybe they aren’t, then maybe you shouldn’t be deleting business records just because they don’t conform to 4NF or some other “normal” form.

if they’re real duplicates then fine=

This thread is closed