Rapid
 Interactive
  Programming
   Environment

Finding Duplicates

There are three tools to find duplicate entries (same hash check sum).

  • Find duplicates within the master part of the current database.
  • Find files in the master part of the current database that have a duplicate in the slave part of the current database.
  • Find files in the slave part of the current database that have a duplicate in the master part of the current database.

By duplicate I mean: that the MD5 hash check sum of two files is the same.


The following screenshots show the menus that can be activated to compare entries and set the internal duplicate status accordingly.


A folder (path) can have one of three duplicate states:

  • Yes = all files directly under the folders path are duplicated
  • Mix  = some but not all of the files directly under the folders path are duplicated
  • No   = none of the files directly under the folders path are duplicated

A file can have one of two duplicate states:

  • Yes = the file is a duplicate (same MD5 hash checksum) as one or more other files in the database
  • No   = the file has a unique MD5 hash checksum

 

Note: you can select multiple duplicate states, so the following would include only folders that are either fully duplicated, or absolutely unique.

Upper line options are only for folders.               --->
                               --->
Lower line options are only for files


Note: you can combine duplicate filters with the SQL filters and with the simpler path and file filters (even all 3 filters if you wanted!)


The following screenshot is based on the demo described in an earlier part of this documentation.

I ran the duplicate check, master with master, and then displayed all files without any display filters.

 

Scan data for the file entries.

The left side of the display shows the file entries in terms of their:

  • Index inside the relevant file table.
  • Path of the file
  • Name of the file
  • Duplicate status
  • MD5 check sum for the file

The right side of the display shows some details about the first duplicate that was found for this particular left side file.

  • Index inside the relevant file table.
  • Path of the file
  • Name of the file


It is possible that the left file has more than one duplicate, the display shows us only one of its duplicates on the right.

Since I requested the duplicate check to be master with master, and I requested an all file display, then a given file can appear on both the left and right sides.

Each file will appear once and only once on the left side, but the file can appear on the right side possibly more than once.

A file appears on the right side if the left file has a duplicate, then the duplicate with the lowest index will be shown on the right side.


In the following screenshot, I pressed the column header for the Hash field. Pressing any column header causes the program to redisplay the data with the lines sorted by the selected column. Press again for a reverse sorted display.

Information about the first found duplicate.


You can now see that the file entries 3, 5 and 7 are identical with one another. But that 7 does not appear on the right side since 3 is found first.

 

File entry 6 has no duplicate.


Hard rule:

A file displayed on the right is always  in the master part of the current database.

 

I know that the files displayed on the left belong (in this example) to the master because of the green area on the left side of the display:


If I had pressed on the Get Files button on the right side (the slave area) then since in this example there are no entries in the slave area I would have got the following display:

[Home] [Uses] [History] [Goals] [First Steps] [Displaying Entries] [Display Filters] [Scan Filters] [Import Export] [Finding Duplicates] [Removable Storage] [Maintenance] [FAQ]