Duplicate files and hashing in Visual Basic.

Duplicate file detection is fairly easy to do, as it turns out, and it goes like this.

  1. Read first file.
  2. Calculate a unique fingerprint, and store the fingerprint.
  3. Read second file.
  4. Calculate a unique fingerprint, and compare the fingerprint to the first files’ fingerprint.
  5. If they equal, the two files are duplicate files.

One way of doing this fingerprinting is to use a hashing algorithm, such as MD5 or SHA-1.  A hashing algorithm should give you a unique fingerprint for each file.  The snippet of code to do that, which I borrowed from the Visual Basic Knowledgebase is below the fold.

Now there are applications, such as DiskState, which scan your hard disk for duplicates by using MD5 hashing.  But they do this:

  1. Read first file.
  2. Do While we have a file to read
       Calculate a MD5 Hash, and store that value in a list, alone with the filename and location
  3. If we have another file, go back to 2.
  4. At this point, we’ve checked all the files on hard disk. 
       Now search though the list, and do we have any hashes that are the same?
  5. If we do, we have duplicate files.

Why do I need to know how to detect duplicate files?

Well I was updating the Lotus Notes Mail Exporter program the other day, and decided to implement some duplicate attachment file checking.  I’m not sure if I’m that happy with how I’ve implemented it, time will tell.  It does give me an idea for some other programs …


Continue reading

“Man invented fire here”

I’ve done a bit of programming over the years, and I’m currently working with a Visual Basic 6 (VB6) code base.  I’ve found the following quotation so true:

visual basic 6 logo It is a weird thing though if you’ve noticed that for very old applications, you know, I feel like for some of them you often have multiple ways of doing things, they just end up co-existing.

It’s kind of like you look around and you can say, oh, man invented fire here, and it’s like, oh, man discovered wheel here and you’re finding the entire history of 15 years of software development stuck in one codebase…

Michael Feathers, speaking with Scott Hanselman, on Hanselminutes podcast 165.

Bookmark and Share