Toward a Generic XML Content Indexing System
While building our CMS, one of the challenges we faced was indexing XML content. The content input into the CMS can be in any arbitrary XML format, such as:
<kbArticle>
<id>1023</id>
<dateCreated>10/3/2004</dateCreated>
<createdBy>myUser</createdBy>
<title>this is a kb article</title>
<body>some text here...</body>
</kbArticle>
Getting the XML into the database is the easy part, just run an insert and it is done. Unfortunately, SQL does not currently contain any functionality to query xml documents, so providing search capabilities for XML content columns in SQL 2000 just ain't going to happen out of the box. Our solution, which works pretty darn well for XML content like this, was to allow users to markup their content schemas with special indexing attributes that tell the CMS whether and how to index the content. For example, you could index the body element's text by setting the following attribute:
<xs:element name="body" cms:indexed=“true“ >...</xs:element>
Or, if the body text was an XHTML document, you could do somthing like the following:
<xs:element name="body" cms:indexed=”true” cms:indexType=“innerXml“ cms:contentType=“text/html“>...</xs:element>
Our indexing engine will load the content and if there is a schema attached to your template, it will parse that schema and pull out any values that it needs to index and place values to be indexed in an entries in the ContentIndex table. From here, text content can be indexed using SQL server full text indexing (which just so happens to have built in filters for things like HTML content) and other types of content (such as decimal values) can also be stored.
Combined with some nice string parsing / tokenization code and some interesting SQL statements. This allows us to provide some really cool search functionality, so that if I want to find all the kb articles concerning “ASP.NET” from 1/1/2000 to 1/1/2004, I can query the content database like so:
channelManager.FindPostings(“kbArticle“, ”body ~= 'ASP.NET' && dateCreated < DATE(1/1/2000) && dateCreated > DATE(1/1/2004)”);
note: “~= “ is the “contains” operator.
It may not be XQuery, but it is still pretty darn powerful compared to a plain text content index over the output content from a given piece of XML.