Creating a generic Site-To-RSS tool

Download the source files.

 

Index:

Creating a generic Site-To-RSS tool 1

What you’ll need. 1

Summary. 1

Introduction. 2

Planning a site scrape. 2

Regex – a powerful scraping tool 2

Creating our scraping regular expression. 3

Getting our link. 4

Getting our Title. 4

Getting our description. 5

Getting our category. 5

Getting our publishing date. 5

Our final regex. 5

Making an RSS feed out of it 5

Validating our feed. 7

Subscribing to our feed. 8

Approaches for a generic tool 8

Building the generic SiteToRSS Class. 8

Verifying the existence of capture groups in a pattern. 11

Retrieving the site using the WebClient class. 12

Writing the RSS feed to either a file or an in-memory stream.. 13

Working with a MemoryStream and the case for Xml encoding. 14

Link prefix. 16

Using the generic class with .NetWire. 16

What’s in the download?. 17

 

What you’ll need

·         regular expression knowledge .consider reading the following articles:

o        Introduction to Regular Expressions

o        Practical Parsing Using Groups

·         Expresso – a tool for working with regular expressions

 

Summary

I’ll show how to use regular expressions to parse a web page’s HTML text into manageable chunks of data. That data will be converted and written as an RSS feed for the whole world to consume. Finally, I’ll show how to create a generic tool that allows you to automatically generate an RSS feed from any given website, given a small group of parameters. At the end of the day we will have a working RSS feed for www.DotNetWire.com .

 

Introduction

Ah, the joys of RSS. You can get the data you need, as soon as it’s available, and no nagging browsers or popups along the way. If only all sites had RSS feeds, huh? If there’s one thing that would be really nice it would be the ability to generate an RSS feed from any site I want. For example, .NetWire is a very interesting site with lots of useful information. However, the folks maintaining this site hadn’t thought about providing it with an RSS feed, which it so sorely needs.

So I got to thinking “Hmm, all the data on the site that’s important to me seems to be arranged in an orderly and predictable manner. I should be able to parse it in a fairly easy manner and make it into an RSS feed” so I started trying. It worked out pretty well. So well, that I’ve come up with a way to let you do your own site scraping using a generic tool, providing it with only simple rules expressed as a single regular expression.

Planning a site scrape

Site scraping” depicts going over a site’s HTML and “mining” it for any relevant data. All other text is discarded. This is what I intend to show here. For this article, I’ve chosen .NetWire as the site I’ll be scraping, as the outcome of this will be useful to a great many people. In planning the scraping I’ll ignore the specifics of how I actually get the text to parse and leave that topic for the end of the article.

The first thing I did was to open my web browser on the .NetWire site , right click and select “view source”. Notepad shows me the site as my future parser will see it. This raw text is the juice I’ll need to parse in order to get the data I need.

To be honest, it looked quite scary. How on earth am I going to come up with an easy way to parse such an enormous amount of information without losing my head? Scrolling through the text, however, I could start to see patterns in which “important” text, text that was relevant to me, showed up in.

There were links inside paragraphs, followed by SPANs and many more attributes. It was a nightmare to parse. Just writing all the rules in searching for a specific link or title for the RSS feed that I wanted to create was a hard enough, but I also had lots more to contend with. I had to find text inside found text inside found text. It was hardly a job for a few hours on the weekend.

So the next thing I decided to check was whether I could do the job with regular expressions.

 

Note: If you don’t care to find out how we build the regular expression for scraping the site and would rather just move to where we actually use it to create the RSS feed, feel free to jump directly to “Making an RSS feed out of it” section.

 

Regex – a powerful scraping tool

If you don’t know what regular expressions are, there are loads of articles on the subject. I’ve written a couple myself. They are referenced at the bottom of this article. You’ll need to understand regular expressions before reading how to use them for scraping a site.

Regular expressions enable us to easily extract necessary information from text. Easily. It allows us, though complex expressions provided as plain text, to recover strings that match lots and lots of rules provided by us. The data we receive back after running our expressions on a string can be as complex and as detailed as we’d like. We can even divide it into groups of text that was matched, along with group names attached to them, allowing us to easily program against the regular expression(Regex) interface (see .”Practical Parsing Using Groups” for more info)

 

Since a site is ultimately represented as plain text (be it HTML, JaveScript, or anything else), we can apply regular expressions to that text as well, allowing us to search and filter any irrelevant information quickly and easily.

 

Creating our scraping regular expression

For our RSS feed, we only need several pieces of data retrieved from the HTML for every “post” we indent to create in our RSS feed:

·         Link: A link the post reader could click to go to the specific information the In .NetWire it’s the link of the news items

·         Title: The title that will appear in the RSS reader the user reading the posts will use. In .NetWire it’s the title of the news item

·         Description: The actual text of an individual post. In .NetWire it’s the text of the news item

·         Publishing date: The date of the Post. In .NetWire it’s the publish date of the news item.

 

These various items are buried deep inside the HTML of our website. It is now our job to find an regular expression that retrieves those items, and allows us to easily reference them by code. Using out knowledge of “groups” in Regex, we want to have a group in the resulting regex for every item we want to retrieve. We’ll name them “link”,”title”,”description” and ”pubDate” respectively.

In developing our regex, I decided to use Expresso, a tool designed to help with regular expression testing.

 

In developing our regex, I’ll rely on this piece of HTML, taken from the HTML of .NetWire:

 

 

<p class="clsNormalText"><a href="/redirect.asp?newsid=4974" target="newwindow"
class="clsNewsHead">Globalizing and Localizing Windows Applications, Part 1</a><br>
With the explosive growth of the Internet and rapid globalization of the world's
economies, the earth is getting smaller and smaller. The applications that you develop for
a local market may soon be used in another country. If the world used a common language,
that would make the life of developers much easier. However, reality is far from perfect.
The author shows you how to make your applications ready for the global marketplace.<br>
<span class="clsSubText">Article. Sep 16, 2003.</span></p>

 

 This HTML represents one news item on .NetWire, and this is the one we’ll need to focus on.

Out first item of business today is getting the link of the news item. Why the link first? Because it’s the first item in order of appearance, which makes it the least complicated to find.

 

Getting our link

Looking at this piece of that that we want to extract the link from :

 

<p class="clsNormalText"><a href="/redirect.asp?newsid=4974" target="newwindow"
class="clsNewsHead">Globalizing and Localizing Windows Applications, Part 1</a><br>

With the explosive…

 

We can easily see that each link (and title) is encapsulated between two items:

<p class="clsNormalText"><a href="

àOur link

" target="newwindow" class="clsNewsHead">
 

Simply enough, the following regular expression catches all instances of such a link within our HTML file, and presents us with a group name “link” that gives us the actual redirection string of the link:

 

<p\s*class="clsNormalText"><a\shref="(?<link>.*)?("\s*target="newwindow")

 

I’ve put “\s” to prevent from declaring exactly how many spaces or tabs reside between the tag definitions and the actual tag attributes. Also notice that I’ve added the “?” before the “"\s*target="newwindow"section . This is done so the expression will catch the first instance of this occurrence, and not the last one (or it will match everything up to the last link in the end of the file instead of closing the match on the first match).

 

Getting our Title

Now that we have the link, we need to get the title for the link. This one is also relatively easy. The title resides between the Href’s closing tag (“>”) and the link’s closing tag (“</a>”). More things we need to consider along the way are new lines or spaces, so we take these into our regular expression as well.

Here’s the full expression so far. I’ve highlighted the new part:

 

<p\s*class="clsNormalText"><a\shref="(?<link>.*)?("\s*target="newwindow")(.|\n)*?>(?<title>.*\n?.*)?(</a><br>\s*\n*)

 

And we have a group in there as well, called “title” so we can refer to it later in code. Notice that the title is made up of any number of characters, followed by zero or more new lines and more characters.

 

 

Getting our description

The description is a block of text that can contain new lines, and is terminated by  a “<br>”:

 

<p\s*class="clsNormalText"><a\shref="(?<link>.*)?("\s*target="newwindow")(.|\n)*?>(?<title>.*\n?.*)?(</a><br>\s*\n*)(?<description>(.|\n)*?)(<br>(.|\n)*?>)

 

The end of the expression contains the beginning of the next expression we want to find.

 

Getting our category

The category of the current news item is usually “Article” or “Product Release”. It always start with the “>” sign and ends with a period (“.”):

 

<p\s*class="clsNormalText"><a\shref="(?<link>.*)?("\s*target="newwindow")(.|\n)*?>(?<title>.*\n?.*)?(</a><br>\s*\n*)(?<description>(.|\n)*?)(<br>(.|\n)*?>)(?<category>.*)?\.

 

Getting our publishing date

The news date follows right after the category’s ending period (with zero or more spaces between them ) and finishes with another period, ending with the closing SPAN tag and P tag.

 

<p\s*class="clsNormalText"><a\shref="(?<link>.*)?("\s*target="newwindow")(.|\n)*?>(?<title>.*\n?.*)?(</a><br>\s*\n*)(?<description>(.|\n)*?)(<br>(.|\n)*?>)(?<category>.*)?\.\s*(?<pubDate>.*)?(\.</span></p>)

 

Our final regex

So. We end up with a piece of text that we can you to scan .NetWire’s HTML, and retrieve a list of Matches, each of which contains groups named “link”,”title” etc. that we can use in our code. Our next step is to transform this pile of data into useful readable information

 

Making an RSS feed out of it

 

The first step in creating a valid RSS feed is to know how the RSS schema looks. There are several RSS standars out there today. I’ve chosen to implement this using the RSS 2.0 standard. I won’t bore you with the entire schema definition here, but a standard RSS feed using the RSS 2.0 schema should look l something like this:

 

  <?xml version="1.0" encoding="utf-8" ?>

<rss version="2.0" xmlns:blogChannel="http://backend.userland.com/blogChannelModule">

<channel>

</title>

</link>

</description>

</copyright>

</generator>

             <item>

                      </title>

                      </link>

</description>

</category>

</pubDate>

             </item>

</channel>

</rss>

 

 

The easiest way to write XML with the .Net framework is using the XMLTextWriter class. This class abstracts away the need to explicitly write strings that represent XML, and supports writing directly to a file or an IO.Stream object. That stream can represent either a file stream, a memory stream, a response stream or anything else that derives from System.IO.Stream. Pretty powerful.

Here’s a small method that gets all the matches from a site’s HTML, loops through them, and uses an XMLTextWriter to write the XML representing the RSS feed:

 

Public Sub WriteRSSToStream(ByVal txWriter As TextWriter)

 

'our pattern to parse the page

Const REGEX_PATTERN as string = "<p\s*class=""clsNormalText""><a\shref=""(?<link>.*)?(""\s*target=""newwindow"")(.|\n)*?>(?<title>.*\n?.*)?(</a><br>\s*\n*)(?<description>(.|\n)*?)(<br>(.|\n)*?>)(?<category>.*)?\.\s*(?<pubDate>.*)?(\.</span>)"

 

'Get the HTML to parse

Dim DownloadedHtml As String = GetHtml()

'Get the matches using our regular expression

Dim found As MatchCollection = Regex.Matches(DownloadedHtml, REGEX_PATTERN)

Dim writer As New XmlTextWriter(txWriter)

 

With writer

    'make the resulting xml human readable

    .Formatting = Formatting.Indented

 

    'write the document header declaring rss version

    'and channel info

    .WriteStartDocument()

    .WriteComment("RSS generated by SiteToRSS generator at " _

+ DateTime.Now.ToString("r"))

    .WriteStartElement("rss")

    .WriteAttributeString("version", "2.0")

    .WriteAttributeString("xmlns:blogChannel", _

"http://backend.userland.com/blogChannelModule")

 

    .WriteStartElement("channel", "")

    .WriteElementString("title", RSSFeedName)

    .WriteElementString("link", RssFeedLink)

    .WriteElementString("description", RssFeedDescription)

    .WriteElementString("copyright", RssFeedCopyright)

    .WriteElementString("generator", "SiteParser RSS engine 1.0 by Roy Osherove")

 

    'write out the individual posts

    For Each aMatch As Match In found

        Dim link As String = aMatch.Groups("link").Value

        Dim title As String = aMatch.Groups("title").Value

        Dim description As String = aMatch.Groups("description").Value

 

      'format the date as RFC1123 date string (“Tue, 10 Dec 2002 22:11:29 GMT”)

        Dim pubDate As String = _

DateTime.Parse(aMatch.Groups("pubDate").Value).ToString("r")

        Dim subject As String = aMatch.Groups("category").Value

 

        .WriteStartElement("item")

 

        .WriteElementString("title", title)

        .WriteElementString("link", link)

 

      'The description may contain illegal chars

      ‘so write it our as CDATA

        .WriteStartElement("description")

        .WriteCData(description)

        .WriteEndElement()

     

        .WriteElementString("category", subject)

        .WriteElementString("pubDate", pubDate)

 

        .WriteEndElement()

 

     Next

 

     'close all open tags and finish up

     WriteEndDocument()

     Flush()

     Close()

End With

 

End Sub

 

The code to generate an RSS feed is surprisingly simple. After you create this XML file notice that the method accepts a TextWriter, which can potentially be a stream writing to a file, a string or lots of other things. We are not bound to any particular target in this implementation. I still haven’t shown how to get the actual HTML from the web, but I’ll explain shortly.

 

Validating our feed

To validate the feed as valid XML RSS, you can use one of the various free RSS validating sites out there (www.FeedValidator.org  pops to mind). The site will make sure your feed lived up to the standard it claims to support, and will tell you if you missed anything important.

It’s very helpful to test against such a site to make sure you don’t screw up people’s aggregators that will subscribe to your new feed.

 

Subscribing to our feed

Now that we have a ready made XML file, we can test it using a real aggregator. I used SharpReader and simply registered for a feed located at the path leading to the XML file. In SharpReader, I made sure that there are just the same number of posts as there are news items on the site, and that the titles are correct. Also I made sure that the “subject” column will correctly represent the “category” of each news item.

 

Approaches for a generic tool

Now that we have the basic mechanics of the thing working, we need to understand the power that comes from such a simple technique. What we’ve seen here demonstrates that given a simple regular expression and text to parse, we are  basically able to parse any site we wanted.

It comes to mind that we can build a simple class that receives these parameters which outputs RSS feeds appropriately.

Such a class can later be used to build a much more generic web site or web service, to which sites and expressions can be added dynamically, and that returns valid RSS feeds given a site ID.

But let’s start small.

 

Building the generic SiteToRSS Class

Our class should have several public properties representing the various RSS feed properties (description, generator and so on).

It should also be able to download a site from the web, and write an RSS feed into a file or just return it as a string.

I’ll spare you the entire code of the class, but I’ll refer here to the less trivial methods inside it. Here’s the basic layout:

 

 

 

Public Class RSSCreator

    Public Sub New(ByVal Url As String, ByVal FileName As String)

    End Sub

 

    Public Sub New(ByVal Url As String)

    End Sub

 

    Public Property UrlToParse() As String

    End Property

 

    ''' <summary>

    '''     the file to which the RSS feed will be written to

    ''' </summary>

    Public Property FileName() As String

    End Property

 

    ''' <summary>

    '''     returns a string containing the RSS feed xml

    ''' </summary>

    Public Overloads Function GetRss() As String

        Dim ms As New MemoryStream

        Dim sr As New StreamWriter(ms, Encoding.UTF8)

 

        'We send "false" to signal the method to not close the stream automatically in the end

        'we need to close the stream manually so we can get its length

        WriteRSS(sr, False)

        Try

 

            ''we need to explicitly state the length

            'of the buffer we want

            'otherwise we'll get a string as long as ms.capacity

            'instead of the actual length of the string inside

            Dim iLen As Long = ms.Length

            Dim retval As String = _

                Encoding.UTF8.GetString(ms.GetBuffer(), 0, iLen)

 

            sr.Close()

            Return retval

 

        Catch ex As Exception

            Return ex.ToString()

 

        End Try

 

    End Function

 

    ''' <summary>

    '''     writes the resolved RSS feed to a file

    ''' </summary>

    Public Overloads Function WriteRSS() As String

        Dim writer As New StreamWriter(FileName, False, Encoding.UTF8)

        Return WriteRSS(writer, True)

    End Function

 

    ''' <summary>

    '''     Writes the resolved RSS feed to a text writer

    '''     and returns the text that was written (if it was written to a file)

    ''' </summary>

    Public Overloads Function WriteRSS(ByVal txWriter As TextWriter, ByVal closeAfterFinish As Boolean) As String

 

    End Function

 

    ''' <summary>

    '''     writes the beggining of the XML document

    ''' </summary>

    Private Sub WritePrologue(ByVal writer As XmlTextWriter)

        With writer

            .WriteStartDocument()

            .WriteComment("RSS generated by SiteToRSS generator at " + DateTime.Now.ToString("r"))

            .WriteStartElement("rss")

            .WriteAttributeString("version", "2.0")

            .WriteAttributeString("xmlns:blogChannel", "http://backend.userland.com/blogChannelModule")

 

            .WriteStartElement("channel", "")

            .WriteElementString("title", RSSFeedName)

            .WriteElementString("link", RssFeedLink)

            .WriteElementString("description", RssFeedDescription)

            .WriteElementString("copyright", RssFeedCopyright)

            .WriteElementString("generator", "SiteParser RSS engine 1.0 by Roy Osherove")

        End With

    End Sub

 

 

    ''adds a post to the RSS feed

    Private Sub AddRssItem(ByVal writer As XmlTextWriter, ByVal title As String, ByVal link As String, ByVal description As String, ByVal pubDate As String, ByVal subject As String)

 

        writer.WriteStartElement("item")

        writer.WriteElementString("title", title)

        writer.WriteElementString("link", link)

 

        'write the description as CDATA because

        'it might contain invalid chars

        writer.WriteStartElement("description")

        writer.WriteCData(description)

        writer.WriteEndElement()

 

        writer.WriteElementString("category", subject)

        writer.WriteElementString("pubDate", pubDate)

        writer.WriteEndElement()

 

    End Sub

 

    ''' <summary>

    '''     generates a new regular expression

    '''     and retrives the GTML from thw web

    ''' </summary>

    Private Sub ParseHtml()

        m_FoundRegex = New Regex(RegexPattern)

        GetHtml()

 

    End Sub

 

 

    ''' <summary>

    '''     retrieves the web page form the web

    ''' </summary>

    Private Sub GetHtml()

    End Sub

 

    Public Property DownloadedHtml() As String

    End Property