March 2004 - Posts

Do I need SAX for .NET? (or does straight Java ports to C# make sense?)

Note: this entry has moved.

After reading Oleg's post about an upcoming SAX.NET implementation, and while I still look forward to the other XML fellow developer working on that, I got certainly excited and run to download and have a look at it. I was dissapointed, I must say.

When I see a project implementing a class called XmlNamespaces containing methods such as AddMapping, GetPrefixMapping, PushScope and PopScope (among others), which do exactly the same as the System.Xml.XmlNamespaceManager with its AddNamespace, LookupNamespace, LookupPrefix, PushScope and PopScope, I start wondering whether straight ports of other platform libraries really does make sense in .NET. The mismatch doesn't end there:

  • There's IAttributes AND IAttributes2, and the corresponding implementations called AttributesImpl and AttributesImpl2 (?!?!). Multiply that by ILocator, IEntityResolver and so on. This is the first port and there's already interface versioning problems?
  • There's an IXMLReader (note the casing) class with an EntityResolver property which doesn't try to take advantage of .NET XmlResolver class, instead reinventing it through the IEntityResolver interface
  • All the GetFeature/SetFeature/IProperty baggage that only makes sense when multiple XML parsers are available and with varying features support (which judging from the silence of death to my request for support to such scenarios isn't going to happen at all in .NET)
  • Non-standard delegate such as OnPropertyChange(IProperty property, object newValue) - in .NET world it would have been OnPropertyChange(object sender, ProperyChangeEventArgs e).
  • Trivial things such as: public static string GetString(RsId id) { string name = Enum.GetName(typeof(RsId), id); return rm.GetString(name); // Should have been: // return rm.GetString(id.ToString()); }

I think copying Java projects over to .NET is not always a good idea, specially if done by people who doesn't work with C# and .NET on a daily basis. Examples of well done ports are NUnit and Log4Net, for example. Note, however, that it wasn't until v2 that NUnit started using .NET-isms as custom Attributes.

So, do I want SAX.NET? Definitely NOT. I like some of its ideas. We, as .NET developers, should take the best ideas from it, mix them with .NET-friendly APIs, take advantage of built-in infrastructure, and improve on it. So, I still like it much more the Xml Streaming Events (XSE) idea than any of these ports. I have to work further on it, develop more use cases, clarify the API and give a second though to some concepts, but it definitely integrates far better with current and future .NET XML support. What I definitely don't want, is to code against a pseudo-.NET/pseudo-Java API.

Visual Studio 2005 Community Technology Preview is here

Note: this entry has moved.

Just in case you didn't hear it before. Note: you will have to drill down to the Developer Tools tree node to find it. It isn't in the home as New Downloads yet...
Who failed to validate XML?

Note: this entry has moved.

Today, you validate XML in .NET v1.x by creating an XmlValidatingReader, setting the schema, and reading:

// Configure the validating reader XmlValidatingReader vr = new XmlValidatingReader(theinput); // Add the schema to the reader (usually the schema is preloaded only once). vr.Schemas.Add(theschema); while (vr.Read()) { // Do your stuff. }

You have two options for handling invalid content in the input document (with regards to the schema/s):

  1. Catch the  exception thrown at the first error, halting processing: try { while (vr.Read()) { // Do your stuff. } } catch (XmlException ex) { // Report the *parse* exception/rethrow. } catch (XmlSchemaException ex) { // Report the *validation* exception. }
  2. Attach to the ValidationEventHandler (according to .NET naming conventions this would have been named ValidationError or something like that): vr.ValidationEventHandler += new ValidationEventHandler(OnValidationError); while (vr.Read()) { // Do your stuff. } if (_haserrors) { // Report the errors/throw. } Here you get a chance of sort of recovering from errors, as you can keep reading and working with data. The _haserrors flag is set by your OnValidationError event handler, as well as the accumulation of error messages.

So far so good. All this is clearly explained in the MSDN documentation. The validation handler signature looks just like what you would expect:

void ValidationCallback(object sender, ValidationEventArgs e) { }

In case 2, what happens to the invalid XML item in the input? Well, it's read anyways, as well as its content. Now, suppose that the element just found doesn't even exist in your schema, and most probably its inner content either. Your validation error messages will be filled with errors about each and every single item inside the erroneous element. What's more, I may want my application to work in a "forgiveness" mode and so do something useful with what IS valid so far.

Easy enough, I though. I have a sender in my validation callback. I bet it's the reader. I just have to cast it back, call the Skip method, accumulate just one error for the current validation failure, and move on:

private void OnValidationError(object sender, ValidationEventArgs e) { if (e.Severity = XmlSeverityType.Error) { // Accumulate error, set flag. ((XmlReader)sender).Skip(); } }

Unfortunately, the sender is null in v1.x, so no luck.
The good news is that this has been fixed in the PDC bits. Maybe we can hope a service pack/hotfix for v1.1...

</Lagash><Clarius>

Note: this entry has moved.

For the past year I had the pleasure of working at Lagash Systems SA, a high-end consulting firm in Argentina, run buy really cool guys who created a company that is by far the best place you can work in Argentina right now. You won't find Morts there, only Einsteins. It was really an excelent experience, working with clever people, doing interesting and advanced stuff, and sharing knowledge as I had never seen in other companies. I can honestly say that my initial expectations were easily surpassed. The company afforded a diving course (including the initiating trip to a "lake"!) for all of us, where we spent a couple great days, and they even gave me as a gift a beatiful cradle when my little baby Agustina was born, which will always make me remember them. All I can say is a big "thank you", I've nothing but gratitude to them.

However, it's a fact of life that you always want more. And it was time for me to start my own company. I had been working on my own before (a whole year devoted to .NET research and writing for Wrox, which eventually led me to be a speaker in .NET ONE 2002 in Frankfurt), but couldn't find a partner so share the effort, and then Lagash came. This time, I found such a partner, the brilliant and excelent guy Victor (a.k.a. vga). We share a common view about technology, and the enthusiasm to continuously learn new advanced stuff and play with the latest .NET bits we can get.

So it's now time for Clarius Consulting SA (clariusconsulting.com and clariusconsulting.net in the registration process now), where we expect to develop further our public visibility and share with the comunity the stuff we learn (mainly with Whidbey now) through our new site aspnet2 (under construction still) and our books (two of them comming out soon from Apress). We have officially started the company (that is, we signed the appropriate papers with our lawyer) on March 15, 2004. An important day in our lives, and the beginning of interesting times, I'm sure...

Whidbey Provider Design Pattern pitfalls

Note: this entry has moved.

A couple weeks ago Rob Howard (from the ASP.NET team) announced the "disclosure" of the Provider Design Pattern they are using in Whidbey ASP.NET (v2).

I've got a couple complaints with this implementation:

  • Naming: Brad Adams (the guy behind design guidelines), is explicitly discouraging the practice of adding a "Base" prefix/suffix to abstract classes targeted to be the root of a class hierarchy. Yet, the pattern explained by Rob defines ProviderBase, MembershipProviderBase and so on. Not good. If these guidelines aren't followed by MS, you can't expect independant developers to do so, right?
  • Collection typing: as defined, each functionality (i.e. Membership, RoleManager, etc.) defines a Providers property of type ProviderCollection, defined as follows: public class ProviderCollection : IEnumerable { // Methods // public void Add(ProviderBase provider); public void Clear(); public void Remove(string name); public void SetReadOnly(string name); // Properties // public ProviderBase this(string name) { get; set; } public int Count { get; set; } } Therefore, I need to cast whenever I need to access a particular provider. Now that Whidbey has generics, it seems appropriate to define the collection as: public class ProviderCollection<T> where T:ProviderBase : IEnumerable { // Methods // public void Add(T provider); public void Clear(); public void Remove(string name); public void SetReadOnly(string name); // Properties // public T this(string name) { get; set; } public int Count { get; set; } }

    Membership, therefore, would only need to define its Providers property as follows:

    public ProviderCollection<MembershipProviderBase> Provider { get; set; }
  • Configuration:  As defined, the ProviderBase class contains an Initialize method with the following signature: public abstract void Initialize(string name, NameValueCollection config); Well, can anybody tell me why the config is not an XmlNode or an XPathNavigator at least? Given the configuration is already in XML form, why resort to converting it to a NameValueCollection?!
    Complex providers may need far more configuration that can't  be expressed with attributes in the element. This is a serious issue that may limit the usefulness of the pattern.

While the first two are a matter of taste in the end, the last one should be fixed promptly. I didn't hear any voice complaining, however. Am I the only one envisioning complex providers with the need to configure themselves with hierarchical XML information? It's all too common everywhere!
You want a a concrete example? Here it goes:

What if I develop a provider that implements automatic DB schema installation and migration? My super provider could allow the full DB schema to be specified in the configuration itself:

<configuration> <system.web> <roleManager defaultProvider="MySuperSqlProvider" ...> <providers> <add name="MySuperSqlProvider" type="Kzu.MySuperSqlProvider, Kzu" description="Auto-deploy provider"> <schema name="MySuperProvider"> <table name="TheTable"> <colum name="ID" type="nvarchar" size="255" /> ...other columns... </table> ...other tables... </database> </add> </providers> </roleManager> </system.web> </configuration>

The provider can detect the presence of the schema and create it automatically if necessary. I could even go as far as saying that it could even define through configuration the way to migrate a schema if it's incompatible, or whatever.

Another one: maybe my provider uses a webservice. I may need to pass complex information to the provider, such as credentials, proxy information, SOAP message skeletons, or whatever. None of this is possible with a NameValueCollection.

Typed XmlReaders: bridging the gap between streaming and object model APIs.

Note: this entry has moved.

When dealing with XML in .NET, you're mostly faced with two options:

  • Streaming API: the XmlReader.
  • Object model API: either XmlDocument, XPathDocument or an XmlSerializer-aware custom object model.

Several reasons can lean you towards any of the later ones, such as strong typing (XmlSerializer), flexibility and XPath querying (XmlDocument and XPathDocument), etc. Any of the three object model API approaches, however, require the entire XML input to be parsed and loaded to memory. Therefore, when you're presented with large documents, or need the fastest processing, all you're left with is the XmlReader. If you worked with it doing anything but the most trivial XML processing, you know how ugly it can become. Lots of string comparison, endless switch, if, loops, whatever.

From my point of view, working against a custom object model is best, as it gives you a level of abstraction from the wire format, and you get to work with OO classes and properties, which is far more comfortable than dealing with InnerXml, Value, etc. If you haven't tried the XmlSerializer approach before, you definitely should.

When you move to streaming processing, you lose all that. And you don't lose it because the abstractions of your entities have disappeared, as you most probably have an XML Schema defining what the XML must look like. You just lose it because of the API. You can still use the XML Schema to validate as you read, and get some (very little) extra functionality from the XmlValidatingReader.ReadTypedValue() method. If you're like me, you may be asking: given that I know the schema at design time, isn't there a way to use it to make things easier for me?

And that's not the only issue. Validating against an XML Schema, even if it's absolutely a really good idea to keep your application data consistent and considerably reduce your own validation code, is not for free. According to tests I've done with the (fairly simple) purchase order schema and instance document in XML Schema Part 0: Primer, XmlValidatingReader is between 10X and 12X slower than the XmlTextReader. Not that this is a bad number, just that you need to have that in mind. And why is it so costly? Well, mostly because it's a generic XML Schema validator, which means as it parses, it checks valid transition between states, data types, facets, etc. And again, given that I know the schema at design time, isn't there a way to use it to make things easier for the parser?

Typed readers

Just as typed datasets build upon the generic DataSet to bring strong-typing and validation to the game, based on an XML Schema, wouldn't it be great if the same existed for readers?
A typed reader should be built upon the XmlReader and provide the same validation capabilities as XmlValidatingReader, but at a fraction of the cost, because it would already know all the elements, attributes and types, and it would also be able to read and validate an specific schema.

Given a purchase order document, I could write code as follows:

poReader r = new poReader( inputStream ); if (r.Read()) { // Typed date for the orderDate attribute. Console.WriteLine( r.orderDate.ToShortDateString() ); shipToReader shipto = r.ReadshipTo(); // Country attribute turned into an Enum if (shipto.country == shipToCountry.US) Console.WriteLine( "US!!" ); // An inner simple-typed element is made a property // In OO, there's no distinction between this and an attribute. Console.WriteLine( shipto.name ); }

Maybe it should be something more like this:

poReader r = new poReader( inputStream ); while (r.Read()) { if (r.TypedReader is shipToReader) { shipToReader shipto = (shipToReader) r.TypedReader(); // Work against the typed one now. } else if (r.TypedReader is itemsReader) { // Do so for items. } }

I sort of prefer the later. The TypedReader property would contain the instance used to read (and validate) the current element content model, which would be the current strategy being applied. With the advent of generics, maybe I should even be allowed to pass the typed reader I want...

r.Read<shipToReader>();

I guess in Whidbey that would be way to implement it internally, anyways....

Another possible use is dynamic run-time generation of these typed readers for a schema. If we can prove that performance will increase, we could use the typed readers not to gain usability but to gain speed. This could be a specialized factory that emits the code (the same your would get at design time) to execute:

XmlSchema sch = new XmlSchema.Read(theFile, null); XmlReader r = XmlTypedFactory.CreateReader( sch );

The factory itself would keep cached versions of the Types it has already generated from a certain schema...

So, what do you think about such an idea? Is it useful? Would you use it? What should the API look like?

This may be part of the new Mvp.Xml project most XML MVPs (including me, of course) are heading.

Don't know what SHA1, DPAPI and Initialization Vector is? Crypto made simpe at last!

Note: this entry has moved.

Hernan de Lahitte brings signing, encrypting and hashing to the masses with his Crypto for Everyone post. His work is not pet-project development. I worked with this guy in probably the biggest .NET project in Argentina, and he really knows what security means. The helpers he presents are fully tested, widely used in that project to perform many security-sensitive actions, and is really a time-saver for anyone (like myself) who just wants to call a method and get somebody else worry about the intricacies of cryptography.

Next time you need to encrypt a license key, a password, in-memory (i.e. CallContext) data internal to your framework, etc., be sure to download it.
MSDN XML Dev Center

Note: this entry has moved.

Dare is announcing the upcoming MSDN XML Dev Center (~ two weeks from launch), and asking for a tagline suggestion. Mine is:

The asphalt for the Information Highway.

About covering newer, work-in-progress and unreleased stuff, I definitely think it makes for more compelling and interesting reading, if mixed with today's technologies articles. Having the mix means I can read those for today during work, and enjoy the edge-stuff at night at home :). It's a must that the article BEGINS by saying which versions of which draft/beta/platform are used for the discussion, so that whenever those either are deprecated/disappear/mutate or go Recomendations/Standard/RTM the reader knows that right from the start.

I also agree with Oleg that they should be more theory/exploratory than the regular material.

XmlNodes from XPathNodeIterator

Note: this entry has moved.

Every now and then I receive complains about XPathNodeIterator. You know, it allows iteration where each Current element is an XPathNavigator. Not too useful if you're looking for OuterXml, or are too-dependant on the XmlNode-based API (i.e. XmlDocument). The most worrying issue is that people use this argument against using compiled XPath expressions, which are known to significantly improve performance (see Performant XML (I) and Performant XML (II) articles). The reason is that in order to get an XmlNodeList, you have to use the SelectNodes method of the XmlNode (and therefore XmlDocument), whose signature is as follows:

public XmlNodeList SelectNodes(string xpath); public XmlNodeList SelectNodes(string xpath, XmlNamespaceManager nsmgr);

This means that most developers won't compile their expressions simply because in order to use the XPathExpression, they have to explicitly create a navigator for the node/document and work against the cursor-style API of the XPathNodeIterator and XPathNavigator:

// Statically compile and cache the expression. XPathExpression expr; // Init and load a document. XmlDocument document; // Create navigator, clone expression and execute query. XPathNodeIterator it = document.CreateNavigator().Select(expr.Clone()); while (it.MoveNext()) { // Do something with it.Current which is an XPathNavigator. }

This approach generally means that in order to optimize the code by compiling expression, you actually have to refactor significant pieces of your code. And you don't have any other choice if you need to sort the query by using XPathExpression.AddSort(). There's a solution to this problem, as usual :).

You know that the XPathNavigator is an abstract class that allows multiple underlying implementations to offer the same cursor-style API and gain the instant benefit of XPath querying. Aaron Skonnard has some interesting implementations showing this concept. Therefore, when you're iterating the results of the query, and asking for the current element, you're actually using something that is dependant on the implementation. Therefore, this object, besides being an XPathNavigator (that is, the XPathNodeIterator.Current property), can also implement other interfaces as part of the underlying implementation. As such, queries executed against an XmlNode-based element will have each Current element implementing IHasXmlNode whereas XPathDocument-based ones will implement IXmlLineInfo. And what is this useful for? Well, just to get access to additional information beyond the standard XPathNavigator API that depends on the concrete implementation. So, inside the while look above, we can ask:

while (it.MoveNext()) { if (it.Current is IHasXmlNode) { XmlNode node = ((IHasXmlNode)it.Current).GetNode(); // Work with your beloved DOM api ;) } }

Still, this doesn't solve the problem that you have to iterate diffently than you're used to, and that significant rewrites are still needed when you use XPathExpression for querying.
The solution is to use the knowledge about the underlying implementation (i.e. you KNOW you're querying against an XmlDocument) and get an easier API to it. This can be achieved by creating an IEnumerable class that provides iteration ofer the XPathNodeIterator but exposing the underlying XmlNode. Also, a helper method returning an array of XmlNodes is useful. It would be used as follows:

XPathNodeIterator it = doc.CreateNavigator().Select(expr.Clone()); XmlNodesEnumerable nodes = new XmlNodesEnumerable(it); foreach (XmlNode node in en) { Response.Write(node.OuterXml); } // Or use the array directly: XmlNode[] list = nodes.ToArray();

Complete code for the custom enumerable object and its internal enumerator implementation follows.

+ Collapsible code listing.

Update: check an even better approach here.

Enjoy!

Check out the Roadmap to high performance XML.

More Posts