Indexing PDF in Orchard (and elsewhere.NET)

Monday, January 5, 2015

Indexing custom contents in Orchard is really easy: write a new handler derived from ContentHandler, then write an event handler for OnIndexing:

public class PdfIndexingHandler : ContentHandler {
    public PdfIndexingHandler(IStorageProvider storageProvider) {
        OnIndexing<DocumentPart>((context, part) => {
            context.DocumentIndex
                .Add("body", thePdfText).Analyze();
        });
    }
}

Orchard will then hand the text over to Lucene, which will index it. Orchard already handles PDF documents stored in its media gallery, so we should be good to go if we can somehow extract the text from a PDF file. Unfortunately, that’s a rather big if, and the main difficulty.

There are a few libraries available in .NET to handle PDFs. They are usually built mainly to create new PDF files, but most can also read them. Text in PDF files is a scattered set of fragments of text in a complex tree structure (not a big concern for indexing), when it’s not a dirty set of scanned images that would need to be OCR’ed in order to be read. I’ll ignore the OCR case for this post. Libraries give you access to the document’s tree, but usually don’t hand you a text property directly, so we’ll have to build this ourselves.

Here’s a list of some of the libraries available, with the challenges they present:

IFilter is the venerable, antique, COM-based way of indexing documents on Windows. It requires an install on the server. I need something that is xcopy-deployable.
iTextSharp is the .NET version of a quite commonly used Java library, but it has an exotic GPL-like license designed to push you to buy a commercial license for an undisclosed amount of money.
SquarePdf.Net is another adaptation of a Java library, but it uses IKVM to emulate a Java virtual machine, where it runs the original Java library. This is clearly insane.
Aspose is pure .NET, but is quite expensive.
PdfSharp is under MIT, is pure .NET, but hasn’t been updated in a very long while. It also suffers from some nasty bugs.

As you can see, there’s no great solution. I picked PdfSharp because it’s real open source, and real .NET, despite the bugs and lack of updates. One bug in particular was an infinite loop that some documents generated from Word can cause. Fortunately, I was able to find a fix for that on the PdfSharp forums, and recompile the latest source code with it.

The following code (adapted from this forum post), walks the tree and adds the strings in finds to a StringBuilder:

private static void ExtractText(CObject cObject, StringBuilder builder) {
    if (cObject is COperator) {
        var cOperator = cObject as COperator;
        if (cOperator.OpCode.Name != OpCodeName.Tj.ToString()
            && cOperator.OpCode.Name != OpCodeName.TJ.ToString()) return;
        foreach (var cOperand in cOperator.Operands) {
            ExtractText(cOperand, builder);
        }
    }
    else if (cObject is CSequence) {
        var cSequence = cObject as CSequence;
        foreach (var element in cSequence) {
            ExtractText(element, builder);
        }
    }
    else if (cObject is CString) {
        var cString = cObject as CString;
        builder.Append(cString.Value);
    }
}

The rest of the work is just getting to the document’s stream from the part, then hand it over to PdfSharp and scan each page:

OnIndexing<DocumentPart>((context, part) => {
    var mediaPart = part.As<MediaPart>();
    if (mediaPart == null || Path.GetExtension(mediaPart.FileName) != ".pdf") return;
    var document = _storageProvider.GetFile(
        Path.Combine(mediaPart.FolderPath, mediaPart.FileName));
    using (var documentStream = document.OpenRead()) {
        var pdfDocument = PdfReader.Open(documentStream, PdfDocumentOpenMode.ReadOnly);
        var text = new StringBuilder();
        foreach (var page in pdfDocument.Pages.OfType<PdfPage>()) {
            var pageContent = ContentReader.ReadContent(page);
            ExtractText(pageContent, text);
            text.AppendLine();
        }
        context.DocumentIndex
            .Add("body", String.Join(" ", text.ToString())).Analyze();
    }
});

Special thanks to Piotr Szmyd for sharing some of his research on this with me.

UPDATE: I found out, by trying more PDF files, that lots of recent files won’t get read by PdfSharp. I’ll post tomorrow with a new update, an link from here.

6 Comments

Did you look at how PDFBox or PDFClown compared to those?

ac - Tuesday, January 6, 2015 9:13:16 AM

Looks like Square PDF == PDFBox.NET
And in PDFClown ISSUES.TXT:
PDFClown [limitation] Text extraction: column layouts and table layouts haven't been supported yet (text is always grouped by row). [limitation] Text composition: only left-to-right writing systems are currently supported.

I'm also looking for way to extract PDF text from column/tables. I suspect its non-trivial to get working in general way. Some PDF's might have text as single characters even if OCR is not required. So some number crunching of the character locations is likely involved to determine if they are related.

ac - Tuesday, January 6, 2015 9:20:44 AM

Thanks for the pointer, I'll check them out.

bleroy - Tuesday, January 6, 2015 8:17:10 PM

One thing to add - be aware of the time it takes for your OnIndexing event to run. It should be quick enough to finish within Orchard's background task interval (1 minute for all tasks to run).

In other words - when you deal with large files - parse the files prior to indexing and retrieve parsed data inside OnIndexing.

Piotr Szmyd - Wednesday, January 7, 2015 3:56:59 AM

@ac Good point about some weird files you may run into.
It will happen at some point that you'll run into files that will get you junk after parsing. Sometimes strings are obfuscated as characters scattered across the whole file (to make copy-pasting problematic), etc. There are lots of border cases, which make a robust solution quite challenging. This is why, IMHO, those commercial libraries are pretty pricey.

In the end it all depends on what kinds of PDFs you are dealing with. PdfSharp is great alone, but if you want a robust solution, it's best to first test all of those mentioned. It may result in having to use more than one library, with some fallback mechanism.

Piotr Szmyd - Wednesday, January 7, 2015 4:16:37 AM

Last resort trick involving no code: use pdftodjvu converter then djvutoxml or djvutext from djvulibre - found when I stumbled across how archive.org does its multi format offer.

ac - Thursday, January 8, 2015 12:48:06 AM

Comments have been disabled for this content.