XML Performance Checklist, and some issues on XPath evaluation

Saturday, June 19, 2004

.NET XML

Note: this entry has moved.

DonXML pointed some issues with regards to the Checklist: XML Performance article. I believe the checklist (and the corresponding "full-length" explanations) could have benefit from more space to cover the topic. I agree with most of Don's comments. The only one I'm not so sure about is his assertion:

By implementing #1 (Use XPathDocument to process XPath statements), it forces you to break #2 (Avoid the // operator by reducing the search scope), since XPathNavigator.Select() always evaluates from the root, not from the context of the current cursor location.

This observation is partially true. I say partially because you can reduce the scope of a search by explicitly addressing the full hierarchy of nodes, instead of the "//" which is a shortcut for "descendant-of-self". The real cost of "//" is that all nodes being matched must not be duplicated in the resulting node-set, and this incurrs an additional calculation cost. For example, let's say you have an XHTML document, and you want to process all links that exist inside a paragraph. The XPath could be something like: //p//a. Well, as you know, a can be nested in other elements, so that an <a> can be determined to (initially) satisfy the "//a" for two that happen to be parent and child. At this point, the XPath evaluator must skip those <a> that have already been matched. This is what makes the process much more slower.
So, if you positively know that all your <a>s happen as a direct child of , and your s you want to process always appear inside the <body>, you could get an amazing speed boost by replacing the query with "/html/body/p/a". And I really mean *amazing*. Try for yourself with an XHTML version of a long spec, for example the XML Schema part 2.

But, going back to the main point raised by Don (and which helped me remember those ugly days when I stumbled with it), the core issue is that there's a conscious design decision of making the overload to Evaluate that receives an XPathNodeIterator as the context, absolutely useless. Let me explain (and what follows is exactly the use case I explained in the public newsgroup).

Let's say you have the Pubs database as XML. Now you have selected (for whatever reason) all titles with "//publishers/titles". This will be an XPathNodeIterator with the results:

XPathNavigator nav = document.CreateNavigator(); XPathNodeIterator nodes = nav.Select("//publishers/titles");

At some point, let's say you need to work with all prices from that set of nodes. The navigator exposes an overload for the Evaluate method that receives an XPathNodeIterator object as the context to execute the evaluation on. It seems natural, then, to think that the following code would yield the results we expect:

XPathExpression expr = nav.Compile("price"); object allprices = nav.Evaluate(expr, nodes);

The result I expect is a node-set (XPathNodeIterator) for each price child of the titles I passed as the second argument to Evaluate. Well, that isn't happening, because the "price" expression is being evaluated from the document root. So, what's this overload useful for?

The code Oleg used doesn't test the problem, as he's iterating each node (i.e. the nodes variable above) and evaluating on each of them without using the other overload. This works, just as the regular Select method does.

Finally I got it :)

Well, that definitely sounds weird. MSDN doc for Evaluate() method says:

"The expression is evaluated using the Current node of the XPathNodeIterator as the context node." If that's untrue, documentation should be changed.

But after all I don't see any significant difference between

object allprices = nav.Evaluate(expr, nodes);

and

object allprices = node.Current.Evaluate(expr, nodes);

Somehow I always used second one thus missing the problem.

oleg@tkachenko.com (Oleg Tkachenko) - Saturday, June 19, 2004 5:48:00 PM

Well, actually the benefit of the overload would be to use the context and evaluate one node at a time, instead of the current one (which, as you point, is pretty useless and that's why you didn't face it at all).

So, the replacement code would actually be:

// Accumulate prices greater than 10.

ArrayList allprices = new ArrayList();

while (nodes.MoveNext())

{

object price = nodes.Current.Evaluate(expr, nodes);

if (price != null && price as XPathNodeIterator != null && ((XPathNodeIterator)price).MoveNext())

{

//Only add if it's not null and we actually have a node allprices.Add(((XPathNodeIterator).price).Current);

}

}

And there's the fact that you can't build an XPathNodeIterator with the results but an arraylist or something like that. Compare it with the code I'd like to work:

XPathNodeIterator allprices = (XPathNodeIterator) nav.Evaluate(expr, nodes);

That would cause a retrieval of each price bigger than 10 in each node in the "nodes" variable, generating an iterator over it, with a <price> element for each. That'd be way cool....

Daniel Cazzulino - Tuesday, June 22, 2004 3:21:00 AM

2 Comments