Duplicate files and hashing in Visual Basic.

Duplicate file detection is fairly easy to do, as it turns out, and it goes like this.

  1. Read first file.
  2. Calculate a unique fingerprint, and store the fingerprint.
  3. Read second file.
  4. Calculate a unique fingerprint, and compare the fingerprint to the first files’ fingerprint.
  5. If they equal, the two files are duplicate files.

One way of doing this fingerprinting is to use a hashing algorithm, such as MD5 or SHA-1.  A hashing algorithm should give you a unique fingerprint for each file.  The snippet of code to do that, which I borrowed from the Visual Basic Knowledgebase is below the fold.

Now there are applications, such as DiskState, which scan your hard disk for duplicates by using MD5 hashing.  But they do this:

  1. Read first file.
  2. Do While we have a file to read
       Calculate a MD5 Hash, and store that value in a list, alone with the filename and location
  3. If we have another file, go back to 2.
  4. At this point, we’ve checked all the files on hard disk. 
       Now search though the list, and do we have any hashes that are the same?
  5. If we do, we have duplicate files.

Why do I need to know how to detect duplicate files?

Well I was updating the Lotus Notes Mail Exporter program the other day, and decided to implement some duplicate attachment file checking.  I’m not sure if I’m that happy with how I’ve implemented it, time will tell.  It does give me an idea for some other programs …


Continue reading

Sew me a Thread or three.

Threading, simply put, is a way for your program to do multiple things at once.

Sort of like driving and talking on the cell/mobile phone at the same time.

Yes you can do both, but they can cause bad things to happens.  Like driving through red lights, or having your computer freeze on you.

Coincidently, I use threading in my Lotus Notes Mail Exporter (LNME) program.
Lotus-Notes-Mail_exporter-4

The part which does the actual mail message exporting, runs in it’s own thread.

I needed to use threading as I wanted the program screen to update in real time.  Before I implemented a BackGroundWorker thread, the program wasn’t able to update the Status section (highlighted above).  Using threading, it was simple.

But it was difficult to implement, because most programming examples for threading don’t explain the WHY of using a particular thread techique.
So I made a guess to use a BackGroundWorker Thread.

Which turned out ok.

Though, had I seen Joseph Albahari’s website, and the section on Threading in C#, I might have done it differently.

Why?

Well Joseph has written a guide to Threading in C# which forms THE reference to threading.  It’s packed full of information, and I wish I had it when I was starting to write LNME.

Go there, and have a read.

Bookmark and Share

Lotus Notes Mail Exporter – writing into Outlook looks hard.

Outlook 2007 Programming - ISBN 0470049944Which is why I purchased the Outlook 2007 Programming book.

To import messages into Outlook, what I need to do is set the Outlook Sender field to something other than “me”.

Why?  Well I’m trying to import messages from Lotus Notes, and I want to copy the Lotus Notes “From” field into the Outlook “From” field (aka Sender).

Unfortunately, the Outlook Object Model only has this Sender field as ReadOnly.

Searching around the web hasn’t helped, but looking at the Amazon “Look Inside” view of this book, it looks like there is a couple of other things I can try.

My fallback position is that I could write all the messages out into a Eudora format (mbox), and have the user import the messages into Outlook that way.

Bookmark and Share

I don’t know how they got a TAB into the filename,

but they did.

Had a report from a user of LNME, that LNME was producing an error message, then crashing.
lnme-error-message

The cause that an message attachment had a TAB character in it’s filename, and Windows wouldn’t accept that, so it would error.
(how the TAB got into an attachment name is another question… , for another day)

Now, if you believe Windows, there is only a small number of characters it doesn’t like in file names:
A file name cannot contain any of the following characters: \ / : * ?  < > |
… which proves to be utter tosh.  And Microsoft provides you with the proof with the Path.GetInvalidFileNameChars method.

If you use the Path.GetInvalidFileNameChars method, it will tell you there are actually 41, including both TABs.

So now, before LNME writes out the message and attachment files, LNME will now check them for more invalid characters.

What this means is …
You shouldn’t notice MUCH difference in speed, as originally I was blindly checking every filename by doing something like this:

If FilenameField <> “” Then
‘Trim Leading and Trailing space from FilenameField Field
FilenameField = FilenameField.Trim

FilenameField = Replace(FilenameField, “/”, “[SLASH]”)
FilenameField = Replace(FilenameField, “\”, “[BACKSLASH]”)

End If

Now I check first to see if the FilenameField has invalid characters, and if so, then do the replacement of “invalid” characters.

‘ get the list of invalid characters for this operating system
cInvalidFileNameChars = System.IO.Path.GetInvalidFileNameChars

‘ check if any of the invalid characters are in our FilenameField
iReturnValue = FilenameField.IndexOfAny(cInvalidFileNameChars)

‘ >0 means that we found some invalid characters.
If iReturnValue >= 0 Then
If FilenameField <> “” Then
‘Trim Leading and Trailing space from FilenameField Field
FilenameField = FilenameField.Trim

FilenameField = Replace(FilenameField, “/”, “[SLASH]”)
FilenameField = Replace(FilenameField, “\”, “[BACKSLASH]”)

End If
End If

The performance improvement appears because the majority of Subject and Attachments are correctly named, so the previous blind check will be skipped.

You can download the fixed LNME, here.

Bookmark and Share

Programming Lotus Notes with Visual Basic – much is wrong.

Much of what is out there is plain wrong, which is not surprising.  Of the stuff which does work, it’s poorly documented.  That does not surprise either.

Here are a couple of pointers for novice Lotus Notes/Visual Basic programmers.

Lotus Notes and Domino 6 Programming Bible Lotus Notes and Domino 6 Programming Bible
Get a copy of the Lotus Notes and Domino 6 Programming Bible.  It is considered a good source for Lotus Notes programming.  And if you want to do Lotus Script or JavaScript programming, it’s really helpful.  After having done some Notes related programming, I’m going to re-read it to pick up what I missed the first time around.

NotesPeek
notespeek I spent plenty of time peering at NSF files using NotesPeek.  NotesPeek let you look into the structure of Notes’ database files.  A much simpler way of working things out, then the “French Cafe” technique.
I’ve also used NotesPeek allowed me to identify an issue with a popular Notes <-> Windows Mobile synchronisation product.
I’m very grateful that Ned Batchelder wrote it.  You can download it here.

Some websites/articles
Lotus Notes Ninjas
Notes411
OpenNTF – Detach files using COM
Language differences between LotusScript and Visual Basic
Common ground- COM access to Domino objects
Calling Notes CAPI from C#/Visual Studio