An alternative API for performant streaming XML processing: XSE (Xml Streaming Events)
Note: this entry has moved.
As a followup to my previous posts (
part I
and
part II), I have been thinking if there's a way to completely
avoid the cost intrinsically associated with XPath, that is,
the need to load a complete XML document in-memory (be it an
XmlDocument or an XPathDocument,
it's the same problem basically). Of course there's a way, I
almost hear you say, use an XmlReader! There
are a lot of scenarios where it makes perfect sense to do
so. However, I've (unfortunately) found that most developers
get easily tired of writing the same code performing the
Read()
calls, endless switch/if statements checking the current
reader's Depth, maintaining flags to signal processed
elements (basically building a primitive state machine which
is required whenever you need more flexibility than simple
element name/namespace lookup), etc.
Typical code looks pretty ugly. For example, imagine I
wanted to count the number of items I have viewed recently
with
RssBandit
(by querying the file used by its cool Remote Storage
feature). The corresponding XPath expression would be:
count(/r:feeds/r:feed/r:stories-recently-viewed/r:story), where
xmlns:r="http://www.25hoursaday.com/2003/RSSBandit/feeds/". Pretty simple to use with an XPathNavigator or
XmlDocument, right? Well, the unbeatable-fast approach looks
acceptably bad:
In case you doubt my word that this IS the fastest approach,
here are the numbers, for a 77kb file, where I didn't
measure the time it takes to pre-compile the XPath
expression:
XmlTextReader: 28
XPathDocument: 562
Now imagine that you also need to build stats about
dc:creator (where
xmlns:dc="http://purl.org/dc/elements/1.1/"),
you have to process
xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/"
elements, etc. Well, most probably, you will give up to the
unbeatable-fast approach and load an
XPathDocument (I hope) and start executing
pre-compiled XPathExpression instances
against it. Well, it's my personal experience that not even
the example I showed above is tackled with the XmlReader
approach. Believe me, I've seen developers load an
*XmlDocument* just to issue an
//Error (?!) XPath query just to determine if
there was an error after some processing!! Unfortunately,
except for the most simple cases, using the
XmlReader is pretty difficult and leads to code
that is hard to maintain and extend.
What's more, there are a LOT of scenarios when there's not even a chance to know ahead of time what needs to be processed. For example, think about a webservice that receives a multi-namespaced XML file and processes information items from each namespace with different modules, specified through configuration. You want one example? RSS: there are a buch of namespaces that can go in there. The same goes for any other service that allows extensibility through namespaces, such as SOAP. Another example are XPath-based rules languages ( one such is Schematron ). But not only complex cases apply here, even loading deeply nested configuration files is extremely cumbersome without the help of XPath or XmlSerializer.
Alternatives in alternate worlds: SAX?
Java developers would say: "if you only had SAX...". Well, I'm no experienced Java developer, so if there's anybody out there reading this: first, you're in the wrong place, second, feel free to correct me in this section. In order to get a good feeling of what we are missing with SAX, I downloaded one cool (I thought it was, at least) piece of java code, SAX-creator David Megginson's RdfFilter, which is a concrete use case of SAX2-based processing and extensibility to handling XML. First let me say that I doubt most java developers take the approach of creating such a cool XMLFilterImpl (the SAX filters that expose events in a more usable way to the developer) for regular needs. However, it does show what SAX-based programming looks like. Guess what, it doesn't look very different than XmlReader:
I removed comments to clarify the code. Of course, it has a
more cool state machine than our example's 4 bool variables,
but you get the idea. Looking at this code, however,
I learned a cool thing about SAX: specialized filters can
expose more specific events than the
startElement, and that would make the client
code far easier to write. However, most Java developers
(again, correct me if I'm wrong, but I asked a couple ones'
opinion only) simply create a new empty filter and implement
the method they are interested in, more often with much less
elegancy than the code above. Just the quick and dirty way
to get the job done. That leads, of course, to equally
difficult to maintain/extend code as the
XmlReader approach does.
The idea of having a specialized strategy (filter in SAX) process and expose higher-level events is appealing. The developer of this strategy could have optimized the state machine it uses, has already debugged the code, etc. However, the strategy has to be manually created, and it only serves the single purpose it was created for. For example, there's no way we can use the RDFFilter to aid us in processing anything else than RDF (of course we can still get the lower-level SAX events, but how is that any useful at all?).
Enter the XSE: Xml Streaming Events
The solution I'm thiking of (and activily coding), is based on an improvement of SAX and the XmlReader. Main requirements are:
- Streaming processing: I don't want to pay the price of full XML loading.
- Support for dynamic expressions: I don't want to code a new filter/strategy/handler/whatever class for each new type of element to match.
-
Callbacks for matched/subscribed elements: I just want to
get called when something I'm interested in was found. For
example, it's completely useless for me to get called by
an
ElementStartedevent when the element is not in the namespace I expect (or the path, or has the attribute, or...). - Compiled code: the price to support dynamic expressions shouldn't be performance. I want the same performance I can get by manually coding the ugliest switch statement required to get the job done, FAST.
- Support for pre-compilation and reuse of expressions: I want to increase performance by caching and thread-safely reusing expresions I know ahead of time.
Given the first requirement, it's pretty obvious that the solution will be XmlReader-based. The following are a number of code samples showing usage of XSE in Whidbey (I've back-ported to v1.x too, don't worry):
The "strange" syntax following the delegate keyword is the new SUPER cool anonymous methods feature of C# v2. Equivalent v1.x code only needs to change that single line:
If you want to process all elements in a certain namespace (let's say, the Dublin Core), irrespective of its location, you would use something like this:
The compiled strategy can be cached statically and reused by
multiple threads simultaneously, as a call to
XseReader.AddHandler performs a
Clone
on it.
By now you may have realized that the real secret of all
this are the concrete factories. Each factory is able to
create strategies to match a specific expression language.
The two I showed are XPath-like or subsets of it. You can
even come up with your own expression format, given you
provide an appropriate factory for it.
You may have noted that the expression from the last example above is not valid according to XPath syntax. The equivalent in XPath would be:
dc:*.
If you wanted to process all elements from all namespaces, my
syntax is simply *.
And to process an element from any namespace, you use
*:item, instead of the XPath
The most important piece here is what does the strategy returned from the factory look like. Well, for the two ones above, what I basically did is analyze the expression and emit the ugliest-unbeatable-fast-if-else implementation you can think of. Almost exactly the one I showed at the beginning. Now for the numbers, which is the main reason I started this project:
Everett:
XmlTextReader: 28
XseReader: 38
XPathDocument: 562
Whidbey:
XmlTextReader: 18
XseReader: 23
XPathDocument: 478
XPathDocument2: 234
These are the tests for each approach, against the story
count test expression discussed at the beginning. Quite
impressive, right? Note that even when
XPathDocument2 in Whidbey performs a superb job
at improving the XPathDocument performance,
it's still an order of magnitude slower that the XSE
approach. Here we can also appreciate the good work of the
XML team in improving even the raw XML parser, by an
important percentage (28 vs 18, wow). That directly benefits
XSE too. Tests were as fair as I could make them: all XPath
expressions are precompiled, XmlNameTable pre-cached, etc. I
made the biggest effort to make XPath numbers do better, but
that's honestly its limit. When I post the tests code,
you'll believe me, hopefully ;).
The code is being polished now, and I'm trying to come up
with more factories that may be useful, specially adding
attribute and child element value evaluation, but that
requires a little bit more work to achieve with only a
reader at hand :o). I'm also trying to use direct
Reflection.Emit instead of CodeDom for compiling the
strategy, but I still have to measure the perf. improvement
that would yield in expression compilation. You can infer
that caching the compiled strategy (just like
XPathExpressions, but much more in this case) greately
improves performance. I haven't implemented an
internal strategy cache yet, which would also help there.
I'm doing some major rearrangements in my NMatrix opensource
project so that I can concentrate all my XML-related efforts
in a single NMatrix.Xml assembly, such as Schematron.NET,
XSE and others.
As you can see, flexibility and performance are two cornerstones of this approach. I wanted both, and I think it delivers. In the upcoming posts I will discuss how it's implemented, and I welcome everyone interested in improving it or making suggestions, to make comments to this post, so that we can finally say to Java developers: "If you only had XSE...".
Update: read these follow-up: