Duplicate file detection is fairly easy to do, as it turns out, and it goes like this.
- Read first file.
- Calculate a unique fingerprint, and store the fingerprint.
- Read second file.
- Calculate a unique fingerprint, and compare the fingerprint to the first files’ fingerprint.
- If they equal, the two files are duplicate files.
One way of doing this fingerprinting is to use a hashing algorithm, such as MD5 or SHA-1. A hashing algorithm should give you a unique fingerprint for each file. The snippet of code to do that, which I borrowed from the Visual Basic Knowledgebase is below the fold.
Now there are applications, such as DiskState, which scan your hard disk for duplicates by using MD5 hashing. But they do this:
- Read first file.
- Do While we have a file to read
Calculate a MD5 Hash, and store that value in a list, alone with the filename and location
- If we have another file, go back to 2.
- At this point, we’ve checked all the files on hard disk.
Now search though the list, and do we have any hashes that are the same?
- If we do, we have duplicate files.
Why do I need to know how to detect duplicate files?
Well I was updating the Lotus Notes Mail Exporter program the other day, and decided to implement some duplicate attachment file checking. I’m not sure if I’m that happy with how I’ve implemented it, time will tell. It does give me an idea for some other programs …
If (CompareFiles(“c:\temp\fileone.txt”, “c:\temp\filetwo.asc”)) Then
' code from: http://www.vbknowledgebase.com/Default.aspx?Id=88&Desc=Find-Duplicate-files-using-Vb.Net-using-MD5-Hash
Public Function CompareFiles(ByVal FirstFile As String, ByVal SecondFile As String) As Boolean
Return ReadFile(FirstFile) = ReadFile(SecondFile)
Private Function ReadFile(ByVal Path As String) As String
Dim ReadFileStream As System.IO.FileStream
Dim FileEncoding As New System.Text.ASCIIEncoding()
Dim FileReader As System.IO.StreamReader
Dim HashData As New System.Security.Cryptography.MD5CryptoServiceProvider
ReadFileStream = New System.IO.FileStream(Path, System.IO.FileMode.Open)
FileReader = New System.IO.StreamReader(ReadFileStream)
Dim FileBytes = FileEncoding.GetBytes(FileReader.ReadToEnd)
Dim FetchedContent = FileEncoding.GetString(HashData.ComputeHash(FileBytes))