For every piece of data made available via a schema, thousands more abound...

Matt Wayward got me thinking with his blog article on the http://weblogs.asp.net/mattwar/archive/2004/03/24/95790.aspx current usage of schemas to validate incoming data and the possiblity that this validation comes at a high cost.  What is the cost of data validation?  Well, the cost of any data validation is possibly data loss.  You see, for every piece of data that fails validation, you have the option to clean the data, or the option to throw it out.  If it doesn't pass your schema, then you might be tempted to throw it out.

Well, what about making schemas a bit more tolerant?  I've seen that done quite a few times.  However, the more tolerant the schema becomes the more work has to be done to process the incoming data anyway.  Something simple along the line of allowing more than one author attribute within a document where you are only going to store one author.  So what happens to the other authors?  Well, they get lost because the schema allowed for many, but you were only prepare to process one, and that is what you did.  If you do decide to process all of the authors, then you back-end has to be prepared for that, possibly creating a performance issue in whatever datastore you are using.

As humans, we do the same thing these schemas are doing in our heads.  We are processing web pages, processing data streams, and comprehending the important parts and throwing out the rest.  We aren't necessarily doing this with schemas, but rather through a thoroughly trained and very adept parsing engine that we've been enhancing throughout the course of our life.  As the web gets larger, more and more information will be made available in computer readable formats, while at the same time a much larger amount of information will be made available only in a currently human readable format.  I definitely think some of the mental energies currently being spent normalizing data to make it easier to consume, might better be spent coming up with ways to train computers to better parse arbitrary data and create meaningful associations.  They'll also need the ability to repair or enhance their parsing abilities in the case the data changes or more data becomes available that they aren't *aware* of because they've reached some level of consistency (a local minima).

That's crazy talk!  Of course it is.  However, at the same time, I've spent a great deal of my time creating bots to consume various web pages.  Pages that I'm sure will never be made available in a computer readable format.  Pages maintained by normal people with normal ambitions that simply want to share information.  I mean, why can't I tell my computer here is a page, here is a sample of the data I want, no go get the rest of it?  Well, sometimes I can.  Exploring the possibilities of partial regular expresson matching and expression repair functions have made some of this possible.  I would definitely like to see more research in this area.  I know I really enjoy it when I manage to suck someone's hand generated baseball collection table into my database in a matter of seconds without having to write a single lick of my own regular expression code.  Yay for partial regular expressions and self-repairing techniques ;-)

Published Thursday, March 25, 2004 12:44 AM by Justin Rogers

Comments

No Comments

Leave a Comment

(required) 
(required) 
(optional)
(required)