Duplicate files and hashing in Visual Basic.

Duplicate file detection is fairly easy to do, as it turns out, and it goes like this.

  1. Read first file.
  2. Calculate a unique fingerprint, and store the fingerprint.
  3. Read second file.
  4. Calculate a unique fingerprint, and compare the fingerprint to the first files’ fingerprint.
  5. If they equal, the two files are duplicate files.

One way of doing this fingerprinting is to use a hashing algorithm, such as MD5 or SHA-1.  A hashing algorithm should give you a unique fingerprint for each file.  The snippet of code to do that, which I borrowed from the Visual Basic Knowledgebase is below the fold.

Now there are applications, such as DiskState, which scan your hard disk for duplicates by using MD5 hashing.  But they do this:

  1. Read first file.
  2. Do While we have a file to read
       Calculate a MD5 Hash, and store that value in a list, alone with the filename and location
  3. If we have another file, go back to 2.
  4. At this point, we’ve checked all the files on hard disk. 
       Now search though the list, and do we have any hashes that are the same?
  5. If we do, we have duplicate files.

Why do I need to know how to detect duplicate files?

Well I was updating the Lotus Notes Mail Exporter program the other day, and decided to implement some duplicate attachment file checking.  I’m not sure if I’m that happy with how I’ve implemented it, time will tell.  It does give me an idea for some other programs …


(Click here to continue reading Duplicate files and hashing in Visual Basic.)