October 2006 - Posts - Jon Galloway

October 2006 - Posts

[.NET Gotcha] SettingsPropertyValue typed as a string will never be null

I like to use chained null coalesce operators (??) to handle default console application parameters. The idea is that I provide a default value, override it via application settings (app.config and user.config), and then overridede it with command-line arguments. I think it provides for a nice user experience - if you're always going to to run the application with the same settings, you can set them in the config; if you need to script it you can control everything from the command-line.

Unfortunately, this doesn't appear to work for string valued settings, since they will always return at least string.Empty. Here's the example code:

string logFilename = arguments["logFile"] ?? Settings.Default.logFile ?? "application.log";

The intent is to check for command-line arguments, then application settings, and if both are missing to use the default value. However, Settings.Default.logFile (a System.Configuration.SettingsPropertyValue) will never be null and thus the null coalesce operator (??) is worthless in this case:

If there is no serialized value to deserialize when the deserialization is attempted, then SettingsPropertyValue object will instead attempt to return a default value if one was configured as defined on the associated SettingsProperty instance. In this case, if the DefaultValue property was set to either a null reference (Nothing in Visual Basic), or to the string "[null]", then the SettingsPropertyValue object will initialize the PropertyValue property to either a null reference (Nothing in Visual Basic) for reference types, or to the default value for the associated value type. On the other hand, if DefaultValue property holds a valid object reference or string value (other than "[null]"), then the DefaultValue property is returned instead. If there is no serialized value to deserialize when the deserialization is attempted, and no default value was specified, then an empty string will be returned for string types.

A better pattern would be a function which takes the key ("logFile") and uses string.IsNullOrEmpty() to determine if it should move on to the next setting source. If it was vital that a setting be nullable, it would be possible to use a StringCollection, but that still wouldn't support simple ?? chaining.

Posted by Jon Galloway | with no comments
Filed under: ,

Free OCR software? You may already have it...

Update (2011): This post describes using a little know feature of Microsoft Office which does a good job with OCR. It was written 5 years ago, but still gets a lot of traffic. Unfortunately, Office 2010 removes that feature (Microsoft Office Document Imaging).

There are a few workarounds for this:

Summary

OCR (Optical Character Recognition) can really come in handy. For example, I previously wrote about how I use Timesnapper as a black box to recover work which would otherwise be lost. Since most of my work is text based (C#, SQL, HTML, documentation, communications, etc.), the obvious next step is to grab the code from a screenshot. Of course I can retype it, but OCR would be better. 

There are some great commercial OCR packages out there. My company recently used OmniPage Pro in a project which loaded data from hundreds of PowerPoint slides into SQL Server for reporting and analysis1OmniPage is great software, but it costs $149 for the basic version, which doesn't really make sense if you're just using it to avoid retyping a little text from a screenshot every now and then.

I looked around for free OCR software, and was a little bit surprised that there wasn't much out there. Here's a rundown of what I found, wrapping up with a program that wasn't technically free, but I already had it. There's a good chance you've got it, too. 

GOCR

I first tried out GOCR (a.k.a. JOCR). The easiest way to try it out is the GOCR Win Frontend, which installs GOCR as well. My opinion matched Pitor's:

To let things be clear - gocr is not ready, to say the least. Personally I'd even say the effect of trying to OCR a page is so crappy it is not even worth installing the gocr engine (seems like the total rewrite in 0.40 did not help much). And I am talking about an ascii black text on a white page, without other elements. Gocr seems to go all the way down here - error in 98% of recognized characters, randomly added spaces, etc. For example: content is C unrir in gocr, sounds like drunken elvish to me.

Tesseract OCR

Yeah, there's been some chatter in the blogospheres and internets about Tesseract since Google assisted in re-releasing it as an open source project. I have no doubts that the press alone (not to mention Google's involvement) will propel Tesseract towards OCR fame and fortune, but it sounds like it's not usable at this point:

It only is configured to build under MSVC++6 for Windows.
It only accepts uncompressed bitonal tiffs.
It's command-line only.
No GUI.
It performed abysmally on the provided testimage.tif
But it did build. :)

Microsoft Office Document Imaging

On accident, I stumbled across Microsoft Office Document Imaging. It's included Microsoft Office Tools ("Microsoft Office \ Microsoft Office Tools" folder in the start menu, default installation location is "C:\Program Files\Common Files\Microsoft Shared\MODI\11.0\"). The interface looks a "My First VB5 Application" reject, but it works great.

It handles scanned documents via TWAIN. The image import's a bit lame - it only handles TIF files. You can convert to TIF in just about any graphics application (e.g. MSPAINT - open the file, Save As TIF file). An easier method is to just copy the image to the clipboard and paste as a new page into MODI.

Here's a quick walkthrough of how I grabbed some text from a PDF2.

Step 1. I selected the text I wanted to OCR with Cropper (output set to Clipboard)

Step 2. I opened Microsoft Office Document Imaging and loaded my image with Page / Paste Page

Step 3. I ran the OCR process by clicking on the "funky eye" toolbar button (or in the Tools menu)

Step 4. Click the Export to Word toolbar button

Step 5. Copy the text and paste it where you want it

In this case, it was an e-mail. I've done the same thing to grab SQL or C# code which I then paste into the editor and compile (Ctrl-F5 for SQL, Ctrl-Shift-B for C#) to catch the things that didn't make it through the OCR cleanly.

I haven't tried it, but apparently you can automate MODI from .NET.


1 Yes, it sounds insane, but it actually worked, and the business value of the data more than justified it.
2 Yes, you can select and copy text in a PDF. This is just an example, but in this case the final result of the OCR'd text was a lot cleaner than the oddly mangled and mis-formatted text I got from the PDF select / copy approach.

Posted by Jon Galloway | 28 comment(s)
Filed under: ,
More Posts « Previous page