Identify Duplicate Blobs

Posted by James Palmer on 03-Dec-2014 08:23

In our database we have a table that stores files as BLOBS. Due to the sorts of files people are adding it's bloating our database very quickly. I'm in the process of implementing compression on the files which is gaining us around 30GB already, but it's come to my attention that because of the way the code works we quite often end up with duplicate files in different BLOB records. 

We currently store the File Name in the DB record, and I can easily extrapolate the size of the BLOB using LENGTH(). I can test this for equality, but is there an alternative method I can use to establish if files are identical? 

Progress 11.2.1 on Windows

All Replies

Posted by Peter Judge on 03-Dec-2014 08:30

Technically, you can read the BLOB fields and calculate a hash (using the MESSAGE-DIGEST or related functions). Compare the hashes to determine uniqueness.
 
Challenge is when the files change – do all the stored versions need to change? Ie if foo.pdf is stored in 3 records, and someone updates foo.pdf with a signature, do all 3 records need to change? If so, you may want to normalise the db some more, to reduce duplicates.
 
-- peter
 
[collapse]
From: James Palmer [mailto:bounce-jdpjamesp@community.progress.com]
Sent: Wednesday, 03 December, 2014 09:25
To: TU.OE.Development@community.progress.com
Subject: [Technical Users - OE Development] Identify Duplicate Blobs
 
Thread created by James Palmer

In our database we have a table that stores files as BLOBS. Due to the sorts of files people are adding it's bloating our database very quickly. I'm in the process of implementing compression on the files which is gaining us around 30GB already, but it's come to my attention that because of the way the code works we quite often end up with duplicate files in different BLOB records. 

We currently store the File Name in the DB record, and I can easily extrapolate the size of the BLOB using LENGTH(). I can test this for equality, but is there an alternative method I can use to establish if files are identical? 

Progress 11.2.1 on Windows

Stop receiving emails on this subject.

Flag this post as spam/abuse.

[/collapse]

Posted by ssouthwe on 03-Dec-2014 08:31

Perhaps you could add an indexed checksum field, and upon storage of a blob, calculate the checksum, and compare it to the other records to detect a duplicate.

Posted by James Palmer on 03-Dec-2014 08:42

Thanks [mention:9e4ee96fac634b8f91b580e1fb4f7e71:e9ed411860ed4f2ba0265705b8793d05], that's the sort of thing I'm after. I'm inserting a CrossRef table meaning we should be able to get rid of the duplicates.

This thread is closed