Typed XmlReaders: bridging the gap between streaming and object model APIs.

Friday, March 12, 2004

XML

Note: this entry has moved.

When dealing with XML in .NET, you're mostly faced with two options:

Streaming API: the XmlReader.
Object model API: either XmlDocument, XPathDocument or an XmlSerializer-aware custom object model.

Several reasons can lean you towards any of the later ones, such as strong typing (XmlSerializer), flexibility and XPath querying (XmlDocument and XPathDocument), etc. Any of the three object model API approaches, however, require the entire XML input to be parsed and loaded to memory. Therefore, when you're presented with large documents, or need the fastest processing, all you're left with is the XmlReader. If you worked with it doing anything but the most trivial XML processing, you know how ugly it can become. Lots of string comparison, endless switch, if, loops, whatever.

From my point of view, working against a custom object model is best, as it gives you a level of abstraction from the wire format, and you get to work with OO classes and properties, which is far more comfortable than dealing with InnerXml, Value, etc. If you haven't tried the XmlSerializer approach before, you definitely should.

When you move to streaming processing, you lose all that. And you don't lose it because the abstractions of your entities have disappeared, as you most probably have an XML Schema defining what the XML must look like. You just lose it because of the API. You can still use the XML Schema to validate as you read, and get some (very little) extra functionality from the XmlValidatingReader.ReadTypedValue() method. If you're like me, you may be asking: given that I know the schema at design time, isn't there a way to use it to make things easier for me?

And that's not the only issue. Validating against an XML Schema, even if it's absolutely a really good idea to keep your application data consistent and considerably reduce your own validation code, is not for free. According to tests I've done with the (fairly simple) purchase order schema and instance document in XML Schema Part 0: Primer, XmlValidatingReader is between 10X and 12X slower than the XmlTextReader. Not that this is a bad number, just that you need to have that in mind. And why is it so costly? Well, mostly because it's a generic XML Schema validator, which means as it parses, it checks valid transition between states, data types, facets, etc. And again, given that I know the schema at design time, isn't there a way to use it to make things easier for the parser?

Typed readers

Just as typed datasets build upon the generic DataSet to bring strong-typing and validation to the game, based on an XML Schema, wouldn't it be great if the same existed for readers?
A typed reader should be built upon the XmlReader and provide the same validation capabilities as XmlValidatingReader, but at a fraction of the cost, because it would already know all the elements, attributes and types, and it would also be able to read and validate an specific schema.

Given a purchase order document, I could write code as follows:

poReader r = new poReader( inputStream ); if (r.Read()) { // Typed date for the orderDate attribute. Console.WriteLine( r.orderDate.ToShortDateString() ); shipToReader shipto = r.ReadshipTo(); // Country attribute turned into an Enum if (shipto.country == shipToCountry.US) Console.WriteLine( "US!!" ); // An inner simple-typed element is made a property // In OO, there's no distinction between this and an attribute. Console.WriteLine( shipto.name ); }

Maybe it should be something more like this:

poReader r = new poReader( inputStream ); while (r.Read()) { if (r.TypedReader is shipToReader) { shipToReader shipto = (shipToReader) r.TypedReader(); // Work against the typed one now. } else if (r.TypedReader is itemsReader) { // Do so for items. } }

I sort of prefer the later. The TypedReader property would contain the instance used to read (and validate) the current element content model, which would be the current strategy being applied. With the advent of generics, maybe I should even be allowed to pass the typed reader I want...

r.Read<shipToReader>();

I guess in Whidbey that would be way to implement it internally, anyways....

Another possible use is dynamic run-time generation of these typed readers for a schema. If we can prove that performance will increase, we could use the typed readers not to gain usability but to gain speed. This could be a specialized factory that emits the code (the same your would get at design time) to execute:

XmlSchema sch = new XmlSchema.Read(theFile, null); XmlReader r = XmlTypedFactory.CreateReader( sch );

The factory itself would keep cached versions of the Types it has already generated from a certain schema...

So, what do you think about such an idea? Is it useful? Would you use it? What should the API look like?

This may be part of the new Mvp.Xml project most XML MVPs (including me, of course) are heading.

I can't wait to see this idea comes true. Dealing with XmlReaders have been always a painfully experience (mostly with large documents) as you point out at the beginning of your post.

I just wonder how you might attack some problems such the “forward-only” nature of some readers when you “bind” the reader values with its typed properties.

On the other hand, it might be interesting to have some wizard alike helper to create the typed reader class from the schema (I guess this might be one of the evaluated strategies).

I think is a good idea and I will keep an eye on any proposal emerged from this Mvp.Xml project that you and your partners are cooking.

Hernan - Friday, March 12, 2004 3:26:00 PM

Perhaps somebody with the time to experiment could put together some sample code for one of these and we could build a Code Smith template to generate them.

Kenneth LeFebvre - Friday, March 12, 2004 4:53:00 PM

Definitely good idea. But again - what about forward-only nature of XmlReader? DataSet's in-memory store...

oleg@tkachenko.com (Oleg Tkachenko) - Sunday, March 14, 2004 9:57:00 AM

Well, Oleg, there's nothing wrong in being forward-only. The comparison with DataSet was just to observe the benefits of strong typing. The typed readers would be nothing more than wrappers of the regular XmlReader, but providing strong typing for easier programmability.

Daniel Cazzulino - Monday, March 15, 2004 11:35:00 AM

Typed readers

4 Comments