How do you store your RSS? A look at XmlDocument, aggregation, sorting, and XPathDocument
Tools that download and use RSS are getting more and more common. I've gone through and done some RSS style aggregation myself, and I have to wonder what tools people use when they implement any form of RSS aggregator. I use .NET and the System.Xml namespace for most of the work, but it can't really solve all of my problems. I'll list what I think would be a few common approaches and why I feel they are clunky or don't scale.
At the root of all RSS aggregation is grabbing the latest copy of the feed. You get two options for this really, a basic WebRequest or an XmlDocument.Load. I tend to use the later because it builds in all of my loading in a single step. You tend not to lose any of the advanced features either, since you can easily investigate the returned exceptions to find out what went wrong. The biggest issue is that Load builds the request for you, so if you want to take advantage of various HTTP headers, you aren't going to get anything from the API.
What kinds of HTTP headers would I be talking about? Well, the ones for NOT grabbing my damn RSS every time you make a request even though I haven't made a post lately (not likely if you are grabbing mine, but hey, some people don't update very often ;-) The Fishbowl has a great article on this. He talks about direct use of the header in order to return nothing if things haven't been modified. I tend to think that the stored value can instead be used to only return new articles as well, so you could return 3 new elements today rather than the 15-50 that would normally get sent since you technically had modified the document. By setting your Last-Modified header at this point to your latest pub-Date you can now enjoy very generous bandwidth savings. Once thing to note is that pub-Date would really have to be precise enough to dictate the time at which the article was made available for RSS and not necessarily it's published date. I've been thinking about this a lot lately, and I have a couple of posts floating around. If you take my idea that a blog is a document management system then you'd have to physically publish your articles onto a feed. When this happens the feed publishing date can get set (aggDate, if you follow my other article). You realize you could save a bunch of bandwidth with the proper caching in place. Note: I've previously read individual rants on this same matter, the only one of which I had bookmarked was Fishbowl. So if you don't get a linkie, publish in my comments.
With the document on the local machine, it is time to talk about aggregation. Since I was thinking about storing an XmlDocument, I'll find some way to load that up. But hey, I already have a local XmlDocument with 500 posts from this feed, so what in the hell do I do now. Darn, I have to go and merge the documents. If you are using any of the above caching this becomes easier since you won't end up merging identical document sets all day long. This process is arduous, and it involves writing a large number of Select statements into the document to find identical nodes. Must be an easier way right? Well, I could have chosen a different medium, perhaps a Hashtable to store strongly typed objects, or even mini XmlDocument's, one for each item. That doesn't fix my sorting problem, so perhaps a SortedList and do the same thing. Hopefully you start to see the problem here. I have two choices, either fully parse the RSS feed and strongly type it, or go with the XmlDocument. Lately I've implemented a solution that uses both, but in the long term I think I'd convert it to a fully strongly typed version. (Note, Whidbey has my fix in the form of the editable XPathDocument, but that still may not be the best in terms of performance). Eventually, I end up with a collection or XmlDocument with some new records in it.
That brings us to sorting. We could just use the pubDate, but if the user is messing with that value, then it is not really usable. I still propose the use of an aggregation date, and it really doesn't need to affect the document at all. With the aggDate, you can simply sort based on aggDate first, then pubDate second. This means you can have a local timestamp that marks when you imported the node and can use it over top of the previously defined date to allow for powerful anti-hack sorting. The user can still publish a bunch of records with a different Guid, the same content, etc..., and attempt to mess with you, however, there are other detection schemes for that moron.
Sorting isn't easy though. Sorting an XmlDocument means using some XSLT. They don't make it easy to use XSLT with an XmlDocument in order to get a new XmlDocument. In fact you have to jump through some hoops. The equivalent strongly typed code is a bit easier, since you can write an IComparer implementation that knows which fields to sort on. A sample sorter that I wrote for use by a SortedList did a number of comparisons, first checking guid equality (for ContainsKey functionality), followed by aggDate (if not equal then sort by this value), followed by pubDate (if not equal then sort by this value), and finally, since you can't return an equals even if the two elements have all the same dates, you need to sort by guid. That ensures you'll never get a false equals no matter what the pubDate is. That is still a large amount of work, and I'm wondering how many schemes actually go the XmlDocument route. What do the common blogging aggregators use?
Why did I assert that I wanted to keep the feed in XML format? Why not just strongly type everything and do away with the entire process of using XML? Well, because I'm not one to lose information. Parsing off all of the optional values isn't something I think most aggregators do. In fact, most aggregators probably only use the data that they are programmed to handle. So when your tool upgrades in the future, all the data that you could have used to provide an extended feature over previously aggregated material is gone. Most likely it is completely gone, since RSS is a rapidly evolving data stream, and most providers don't allow you to get old data. Maybe .Text has some feature in the Rss.aspx that let's me kick up the number of returned items (I know it let's me set between 0 and 25 as the admin), but I don't know about it, so I could never get my entire feed from the beginning in RSS form. Users work around this by having archival categories they use in order to surface a feed over old data, a generic hack at best.
That isn't the only reason for maintaining XML for the data though. XML is after all THE persistence format we are supposed to be using, right? If that is the case, it doesn't necessarily make sense for me to strongly type things. Makes more sense to keep things in XML. A year from now, when my RSS data store of all my favorite blogs is still there in XML format, and my new XML tools become available I can process them in all new ways. Having strongly typed the data and spitting out a binary stream somewhere, that wouldn't be the case (BTW - does anyone know where RSS Bandit puts all that stuff? I typically lose archives every time I upgrade the client, and sometimes just for spite, it dumps my archives without the upgrade needed... Nice feature).
I guess that brings me to the point. We write software for what is available today, but with an outlook for tomorrow. I would much prefer to write an RSS aggregator in today's technology by using strongly typed objects, really cool sorting mechanisms, and custom collections. The tools of tomorrow are going to be more geared towards XML though and XML is more geared towards the technology I'm trying to consume. If I keep everything around in XML form, I can make use of the editable XPathDocument. With proper use of XPathNavigator's, I can realize a fully strongly typed collection and object set with the XPathDocument behind it as the persistent data store (another article, go check my Whidbey archives). Heck all of that sorting and merging becomes easier as well with XQuery. Better yet, I don't have to worry about losing any of the tags I don't understand today, because of XML's tolerance for extra data.
What do you think? How do you store your RSS today? Do you feel your storage provides you with the appropriate amount of performance (looking at modern RSS aggregators, I'd say no, but maybe that isn't the fault of the storage)? If you make trade-offs between strongly typing your data and the underlying feed what are they (aka, are you doing the lossy method)?
This document focuses primarily on aggregation and storage of a single RSS feed. Scale this same problem out to thousands and you start to see where all of the small features become very important. Another feature left out of this document is the ability to cross aggregate multiple feeds into a single larger feed. I'll talk more about the various algorithms that become extremely important in this scenario in the future.