Xml Streaming Events: simple streaming Xml handing (and changing) at work.
Note: this entry has moved.
I presented XSE (Xml Streaming Events) in a previous post. In this post I will show some examples of what can be acomplished with it in a streaming (therefore performant) way.
IMPORTANT NOTICE: for anyone not playing with Whidbey, all statements that look like delegate { ... //.net code ... }
can be replaced with the usual v1.x new EventHandler( yourMethod )
where the method implements the code inside the brackets.
Upgrading namespaces
Sometimes there's a need to perform some on-the-fly change in a document stream.
For example, imagine you have upgraded your schemas to a different namespace
(i.e. from xmlns:kzu="http://kzu.aspnet2.com/2003/schematron"
to
xmlns:kzu="http://kzu.aspnet2.com/2004/schematron")
. I know there
are several discussions all over the web (see Dare's post,
David Orchad's, etc.) on schema versioning, and most (including
me) agree that changing the namespace name is not versioning at all. It's
a whole new schema. Aside from that, there are concrete cases where this has
happened and will happen, as well as your own business requirements. Right now
I can think of WXS and SOAP as two concrete examples. With XSE, it can be
achieved easily at the reader level (that is, BEFORE you even load a SoapMessage
,
XPathDocument
or whatever):
Note that the transformation feature is layered on top of the base XseReader
so that I only have to pay the performance cost for what I use. If I
don't need modifications to the InfoSet, I don't have to pay for the cost of
checks for transformations. A document loaded with this reader will see an
infoset complying with the new namespace. I can hand this reader to
an XmlValidatingReader
and have it validated against the new
schema (remember there's a known
bug in v1.x validating
reader that prevents this, but it has been fixed in v2). Note that because
we're matching with a wildcard, this works at any level in the document. For
example, the following document:
Is upgraded as follows:
Note that a root-element namespace change alone is not enough. So, in order to achieve similar functionality today, a full string loading and find&replace would need to be issued. Again, full streaming support is a top priority for XSE.
Simple element name transformations
Another common use case is simple name changes in a document. For
example, an incoming document may have a <customer>
element
when you expect a <person>
, or an <orderDate>
when you need an <ordered>
element. XSE removes the
need for full document loading and XSLT stylesheet creation and
processing that would be required for such a simple case:
Note that I changed at the same time the element name and namespace.
Simple content adaptation
In the above example, I showed changing a <customer>
element
name and namespace to the expected <person>
one. Combined
with node skipping, I can adapt (sort-of downgrade in this case) the former
element to your desired representation. For example, if the <customer>
element
includes a <contact>
children that our <person>
element doesn't expect, I can simply skip it:
Transparent elements and namespaces
James Clark has proposed what he calls Transparent namespaces
in his Namespace Routing Language (NRL) proposal, which
may make it into the ISO/IEC 19757 Document Schema Definition Languages (DSDL).
He gives examples where it is useful to have an
element ignored from the stream, as if it didn't exist at all, but without
losing its content. This is different than XmlReader.Skip()
method
in that the later stops parsing the skipped element's children. He gives as an
example an XSLT stylesheet containing XHTML, for example:
It's impossible to validate the XHTML against the corresponding schema, unless you modify it accordingly to include extension points all over the place. The proposed solution is to make the xsl:* elements transparent for the validation process, while retaining their children. James proposes this "transparentizing" at the namespace level. This can be easily achieved with XSE:
Note that I'm using a special wildcard supported by XSE. Wildcard options are:
-
* : mathes any element in any namespace. Is equivalent to
*:*
. -
*:item : matches an item with a
LocalName="item"
, irrespective of namespace. -
kzu:* : matches any element in the namespace mapped to the "kzu" prefix
by the
XmlNamespaceManager
. -
:* : matches any element with a
NamespaceURI=""
. Note that this is not the same as*:*
(first option).
Therefore, making elements transparent is supported at a more granular
level than that proposed by James. If a document is loaded (or a ReadOuterXml()
is performed on the reader), the following infoset is seen:
And of course, as the implementation supports streaming scenarios, you can pass it to the next processing hop without ever loading the entire stream. Another example is processing the body of a SOAP message:
Handing this reader to the processing phase will result in only the contents of
the soap:Body
to be seen.
Skip irrelevant content
Finally, if we're processing XML with mixed namespaces, it may be the case that our application only cares about elements from our own namespace. In such cases, loading irrelevant nodes in a document is a clear waste of resources. We can choose to make those other nodes transparent or skip them altogether:
I still have to decide on what syntax would be the most convenient way to say "match everthing that is NOT in this namespace". Options I can think of are:
- ^kzu:*
- !kzu:*
-
Create another strategy factory that interprets the matches as negative asserts
instead of positives. i.e.:
IMatchStrategy nonblank = new NegativeRelativePath().Create(":*"); Instead of matching anything with a blank namespace would match anything with a non-blank NamespaceURI.
As usual, I look forward your feedback as I finish setting up the opensource project for this.
Update: read these follow-up: