Large scale RSS aggregation, categorization, and publication dates...

I was listening to a post by Darren Neimke when he said there should be a set of global categories you could apply your blog postings to.  While I think this is possibly a tool that many people would abuse I do agree that with the right controls it could improve the channeling of information through RSS.  Many people think that the RSS feed and optimizations at the request level (compression and change headers) is pretty much all we can do with syndication, but there are some additional steps.  I'm going to examine one of these steps, a step that some sites have already taken, and talk about a very unfriendly RSS member that doesn't make things easy.

The Scenario
Create a site that does nothing more than aggregate and categorize other sites, providing the output as RSS.  Think Google, but that every search pattern you can think of has it's own RSS feed.  In current architectures we can't really think Google since they seem to have some inordinate amount of computers in order to do what they do, but we can think about aggregating some of the more popular sites that are crucial to our industry.  Think places like DotNetJunkies, Weblogs, and Blogs@MSDN.  We can then take categorization suggestions from these sites (a curiosity of mine is why categories have never shown up as an RSS extension that is widely used) or categorize them based on content.  The end result is that we have a series of categorized, aggregated feeds.

The Categorization
There are many ways to categorize things, but the most prominent would be the ability to globally categorize our own posts. Something in the XML akin to the following would be a standard categorization based on local categories, and I also show a format for possible global categories.

<dwc:local_cat>Physics,Security,Longhorn</dwc:local_cat>
<dwc:global_cat refer=“http://www.amazon.com/BrowseCategories“>AmazonCat1,AmazonCat2</dwc:global_cat>

In the first case we have locally scoped categories similar to what we find in .Text.  We already have the ability to provide RSS feeds, just based on some category.  What we don't have is the ability for that data to be propagated down to the aggregator on the client or some other larger aggregator so that it can make a decision.  Very unfortunate.  The second case is even more important, since it allows us to scope our articles to a widely accepted format, perhaps Amazon browse categories as I use, or some other format perhaps hosted by sites prominent in each field.  Having the ability to scope my articles in jest of cauasality to some global physics categories would let users follow a path from me, to the site hosting the categorization scheme, and from there back out into the world of physics using resources that the site provides.

We don't stop there, however.  Once you get the XML feed in your aggregator you have more opportunities than you could imagine.  We can now provide aggregation based on tokenizing the document and determining categories based on content.  This is powerful, since most users don't want to spend the time required in order to define a good deal of categories for their own posts (aka, the global categories would allow them to select from a list that someone else provides and probably improve this issue) and they will never be as granular as the 500 or so words they use within their post.

A final layer does work based on constructs within the post.  Categorize based on things that are located within the post, and you start to allow the ability to recognize things like whether an article contains code, what the type of the code is, whether or not the articles contains pictures, and many other neat things.  Once we've parsed the document at this level, we then go through the task of matching it with pre-defined global categories that it wasn't even assigned to begin with.  Powerful stuff, since I now find my article present in aggregation streams I never thought possible, such as “New C# Code Articles“ or “Articles with Code relating to SqlDataProvider“.  Starts to take the work out of properly organizing your post categories in the beginning.

People making a Difference
There are some guys making a difference.  We currently have an archival of Weblogs with full-text search ability.  They may have even done the RSS feed per search criterion, but I'm not sure.  A good deal of people must not know about the service they provide because I get very little linkage from them.  I do get a lot of linkage from Google though.  They seem to be doing the right thing in regards to XML, or at least their existing system was capable of parsing it natively without change and without regard to content type.  I've very rarely gotten a Google link to a feed (XML document), and almost always to the page, so it appears to index blogs just like it would a normal site, since the generic structure of a blog is simply linked pages.

Even with the advances, the categorization logic remains to be seen at the level I'm talking about.  I suspect that many users have written their own aggregators that do allow for higher granularity than is present in the current system.  The reason I say this is that I get some interesting link throughs from personal sites where users are aggregating my content.  The link throughs seem to detail a system of categorization that I don't have in place, so they must be doing something extra.

Problems for Large Scale Aggregation
The consistency of the data is only as good as the medium.  I've found some major issues with RSS and the formats that are used in order to propagate it around the net.  The major issue has been pubDate.  I can probably best describe the issues with some examples, so I'll get to those as soon as I describe the apparent conundrum to aggregated sites.

The first issue, is what time based sorting mechanism do you use for articles?  You can't really sort based on time, because then someone could hack your aggregation by posting highly futuristic dates and keep their articles on the list indefinitely.  I'm not seeing anywhere that you are allowed to change the pubDate as it comes in to the current time of aggregation (the time that you sucked down the file), thus allowing you to normalize all times to your own local system.  This poses a huge issue for large scale aggregation since you want your lists to be complete, fairly ordered, and indifferent to users attempts at thwarting the time based system (aka look at USENET and the huge issues caused there by future posters).

Scenario 1: You are aggregating 5 sites.  Users in site 5 are posting future dates.  As you retrieve the XML documents you need some method to sort new articles into the mix.  You may only be aggregating once or twice per day, so sorting by publication date would tend to be important so that articles from the same site appear in a specific order.  Even more important, articles from different sites should appear in a particular order such that links from a future (with regards to the system) post are never referencing a past post.  The decision here isn't easy, and you are forced to fall back on <guid> and some sorting mechanism devised by yourself.  Possibly use the pubDate if within an acceptable range, else grant the articles pseudo-dates based on some criterion.  This simply can't be easily fixed.

Scenario 2: I consider this a .Text bug, but I understand why it exists.  The moment you post, the pubDate is forever set even if your post is not visible or made public.  It doesn't show up in RSS or on your main page.  In essence the post doesn't exist yet.  Now, when you activate the post, it maintains the original pubDate, so it now shows up in the middle of an RSS feed or halfway down your page.  I've had articles that got 0 visibility because I made the mistake of working on them for several days.  In all honesty the first time the article is seen by the aggregator should be it's pubDate by accordance with whomever is doing the aggregating.  Pubdates that are from 3 days ago, but you haven't seen the post yet, it is pretty obvious what is going on.

Conclusion
Problems and solutions are always the name of the game.  RSS definitely has some problems, but also some very easy solutions that would improve it immensely.  The name of the game as of RSS 2.0 is to add your own namespace extensions.  I like this and I think I'll start working on a couple of namespace extensions that I would find useful and see what everyone else thinks.  I already have defined the categorization extension thoroughly, but I need to check it against existing categorization schemes to see how well it would integrate.  As for the things like pubDate, I think there needs to be an extension that augments the process of aggregation by allowing aggregators to apply special namespaced attributes.  The aggDate, might be something you get to set as the aggregator, rather than rewriting the pubDate.  Even more cool would be the aggregation trail, so that weblogs.asp.net would actually have a new element stating the original blog that the article came from (even though guid/link does this already), but as more aggregators join the mix, we maintain a full audit trail.

<agg:feedDetails thisFeed=”...” srcFeed=”...” />

Each aggregator would simply have to add it's little tag to the process and we can now do a blogging tracert to see how far this article has truly travelled and proper source credentials can be maintained.

Published Saturday, June 05, 2004 4:11 PM by Justin Rogers

Comments

Tuesday, June 08, 2004 4:35 AM by TrackBack

# re: Draft posts... what should they display for 'date posted'?

Saturday, June 12, 2004 12:40 PM by TrackBack

# How do you store your RSS? A look at XmlDocument, aggregation, sorting, and XPathDocument

Tuesday, November 02, 2004 9:21 AM by TrackBack

# Blogging Portals - useability and finding stuff

Friday, July 18, 2008 12:10 PM by Work from home.

# Envelope stuffing work from home.

Work from home. Work at home. At home work http.

Leave a Comment

(required) 
(required) 
(optional)
(required)