XML Performance Checklist, and some issues on XPath evaluation
Note: this entry has moved.
DonXML pointed some issues with regards to the Checklist: XML Performance article. I believe the checklist (and the corresponding "full-length" explanations) could have benefit from more space to cover the topic. I agree with most of Don's comments. The only one I'm not so sure about is his assertion:
By implementing #1 (Use XPathDocument to process XPath statements), it forces you to break #2 (Avoid the // operator by reducing the search scope), since XPathNavigator.Select() always evaluates from the root, not from the context of the current cursor location.
This observation is partially true. I say partially because you can reduce the
scope of a search by explicitly addressing the full hierarchy of nodes, instead
of the "//" which is a shortcut for "descendant-of-self". The real cost of
"//" is that all nodes being matched must not be duplicated in the
resulting node-set, and this incurrs an additional calculation cost. For
example, let's say you have an XHTML document, and you want to process all
links that exist inside a paragraph. The XPath could be something
like: //p//a. Well, as you know, a <p> can be nested in other
<p> elements, so that an <a> can be determined to (initially)
satisfy the "//a" for two <p> that happen to be parent and child. At this
point, the XPath evaluator must skip those <a> that have already been
matched. This is what makes the process much more slower.
So, if you positively know that all your <a>s happen as a direct child of
<p>, and your <p>s you want to process always appear inside
the <body>, you could get an amazing speed boost by replacing the
query with "/html/body/p/a". And I really mean *amazing*. Try for yourself with
an XHTML version of a long spec, for example the XML Schema part 2.
But, going back to the main point raised by Don (and which helped me remember those ugly days when I stumbled with it), the core issue is that there's a conscious design decision of making the overload to Evaluate that receives an XPathNodeIterator as the context, absolutely useless. Let me explain (and what follows is exactly the use case I explained in the public newsgroup).
Let's say you have the Pubs database as XML. Now you have selected (for whatever reason) all titles with "//publishers/titles". This will be an XPathNodeIterator with the results:
At some point, let's say you need to work with all prices from that set of nodes. The navigator exposes an overload for the Evaluate method that receives an XPathNodeIterator object as the context to execute the evaluation on. It seems natural, then, to think that the following code would yield the results we expect:
The result I expect is a node-set (XPathNodeIterator) for each price child of the titles I passed as the second argument to Evaluate. Well, that isn't happening, because the "price" expression is being evaluated from the document root. So, what's this overload useful for?
The code Oleg used doesn't test the problem, as he's iterating each node (i.e. the nodes variable above) and evaluating on each of them without using the other overload. This works, just as the regular Select method does.