High-performance XML (II): XPath execution tips

Thursday, October 9, 2003

Note: this entry has moved.

As I programmed an XPath-only implementation of the Schematron specification (soon an ISO standard and very cool XML validation language, incredibly flexible and powerfull), called (surprisingly) Schematron.NET, and part of the NMatrix project, I found many interesting things about the internals of XPath execution.
And I needed to dig deep inside it because my implementation had to be more performant than the reference implementation based on XSLT. And it ended being an average 50% faster than the fastest-XSLT-engine-executed version. During that trip, I found the following useful tips:

At first I was worried about the amount of XPathNavigator.Clone() that goes on during execution. Further research showed that the method only creates a new object and saves the references to the document, the node and parentOfNs (don't know where it's used) variables. So it's really fast and doesn't have any perf. impact. So, clone the navigator at will!
The only way to get at the xml contents of a navigator (i.e. node) is to check whether it implements IHasXmlNode, which is only true if the navigator was constructed from an XmlDocument. If it does, you can access the underlying XmlNode with the following code:
if (navigator is IHasXmlNode) node = ((IHasXmlNode) nav).GetNode();
When we use an XPathNodeIterator, its Current object is always the same, that is, a single object is created, and its internal values changed to reflect the undelying current node. Therefore, if we want to track already-processed nodes, we can't use its hashcode or reference. The only (standard) way to compare navigators is through the use of their IsSamePosition(XPathNavigator other) method. So, if you need such a mechanism (process some node only once), your only way (in principle) is to iterate through a collection of previously saved navigators and compare them one by one with the current one. Note that you must clone the Current element (the XPathNavigator itself), or the position will be changed as you move on in the iteration.

XPathNavigator.Evaluate() produces a movement in the cursor position! So always remember to clone before doing anything against a navigator, or clone once, and later use MoveTo(XPathNavigator original) to reposition again to the original place.

For all but the smallest documents (or very few child nodes from the current position), XPathNavigator.SelectChildren and XPathNavigator.SelectDescendents are 35-45% slower than XPathNavigator.Select with an equivalent precompiled expression.
Adding the string values (tokens, such as element and attribute names) that are expected in the instance document to the navigator's NameTable property, prior to executing the queries, offers a marginal performance gain of 4-8%.

Check out the Roadmap to high performance XML.

Hi Daniel. I own XPathNavigator and schema validation technologies in the .NET Framework so it's always a pleasure to read your posts about both the navigator and schema validation . On your specific points

* Cloning is intended to be cheap because it is done a lot in our implementations of technologies like XSLT.

* You can get the XML for a particular node by walking it with the navigator (or a clone if you do not want to move the navigator) but you are right that there is no easy way to get the XML otherwise unless it implements IHasXmlNode. This will be fixed in Whidbey.

* The comparison issue is a problem and one we've attempted to fix in Whidbey but the bits didn't make it into the PDC build. Hopefully you'll be enrolled in beta 1 and can tell us if the choice we've made satisfies your needs.

* This sounds like a bug. I'll have one of our testers take a look at this. * Yup, you aren't the only one whose thought that those helper methods don't really help that much.

* Was it worth it or did you expect more? Thanks for the excellent feedback.

PS: Have you ever considered the possibility of writing an article on Schematron.NET for MSDN?

Dare Obasanjo - Thursday, October 9, 2003 7:19:00 PM

Really good entries Daniel -- these are the types of blogs that are worth reading.

chadb - Thursday, October 9, 2003 9:26:00 PM

2 Comments