What with digital photography, recordings from statellite TV, digitised music collection (what else does one do with vinyl LPs and 45s) one ends up with lots of files.
Over the years I have progressively upgraded my digital storage (who remembers 5 1/4 inch floppies) until today I have 3 TB disks (about 6 disks, plus numerous 2, 1.5 and 1 TBs).
About 3 years ago one of my master disks packed in. No problem, every file existed on an back-up. But the files were spread over multiple (smaller disks) and in some cases the same file was on two different back-up disks but the back-up disks were not identical.
Ah, I ended up with a new “master” disk that unfortunately contained multiple copies.
I tried a few freeware programs to find duplicates, but it was just to difficult.
So I reached for the Microsoft equivalent of a soldering iron (Visual Basic) and got programming.
At least I had the sense to stop using VB6 and move on to .Net and VB-2008. I like VB6 put it is just too old!
My VB program links to an embedded database (SQLite), the program executes an MD5 hash function over each file under a given starting point (typically “D:/”) and places the file’s path, name and MD5 hash value in the database. Using a not too powerful PC a complete scan of a 2 TB disk took 14 hours! Then I run SQL commands to find files with the same MD5 hash value. Initially I had a loop that said:
for each file in the database
get its MD5 hash and mark all other files with the same MD5 as duplicate
The principle was sound, but the practise was rubbish. The loop was taking about 2 seconds per iteration (=140 hours for a 2TB disk). With a bit of research, and quite a few false starts, I was able to create 5 SQL statments, each containing nested statements, so that in less than 30 seconds every duplicate in the database was identified, and it even identified whether a directory was duplicated.
This program has been progressively updated. For example: it can not identify if a MP3 track is duplicated (in the sense of same song but different quality), but it can identify if it is the same MP3 but different tagging information. I can create a database for my back-up disk, and compare the database with the master disk, and then automatically synchronise the back-up.
That was my first real play with hash functions and with SQL. What I wanted to do next was find a simple way to “publish” the application without having to redistribute .Net. If I publish the files with the Visual Studio publishing function the resulting file is 135 Mbytes!
Oh, it was written in Visual Basic and linked to the “SQLite ADO.Net 2.0/3.5 Provider” which provided a Visual Basic wrapper for the embedded database called SQLite.