Saturday, April 23, 2005 9:59 AM jerdenn

Web Scraping in ASP.NET

I know that there are other samples of web scraping out there, but here's mine.  One of my customers asked me how to scrape our ASP.NET Web application, so I though that I might post the example code.  I like the viewstate regex - it's my first time using lookarounds in a regular expression.

using System;
using System.Text;
using System.Text.RegularExpressions;
using System.Net;
using System.IO;

namespace Dennany.WebScrape {
 
class MainClass {
   
    [
STAThread]
   
static void Main(string[] args) {

      try {
       // Modify as appropriate:
      
const string baseUri = "http://remotewebhost/webpagedirectory/";
       const string loginDlgUri = baseUri + "LoginDlg.aspx";
       
const string mainConsoleUri = baseUri + "Mainpage.aspx";
       
const string username = "myuser";
       
const string password = "p@ssw0rd";

      
// This cookie container will persist the ASP.NET session ID cookie
      
CookieContainer cookies = new CookieContainer();

      // perform the first http request against
     
// the asp.net application login dialog.
     
HttpWebRequest request =
        (
HttpWebRequest) WebRequest.Create(loginDlgUri);

      //get the response object, so that we may get the session cookie.
     
HttpWebResponse response =
       (
HttpWebResponse)request.GetResponse();
    
     
// populate the cookie container.
     
request.CookieContainer = cookies;
      response.Cookies =
        request.CookieContainer.GetCookies(request.RequestUri);

      // read the incoming stream containing the login dialog page.
      
StreamReader reader =
       
new StreamReader(response.GetResponseStream());

      string loginDlgPage = reader.ReadToEnd();

      reader.Close();

     
// extract the viewstate value from the login dialog page.
     
// We need to post this back,
     
// along with the username and password
     
string viewState = GetViewState(loginDlgPage);

      // build postback string
      
// This string will vary depending on the page. The best
     
// way to find out what your postback should look like is to
     
// monitor a normal login using a utility like TCPTrace.
     
string postback = 
       
String.Format("__VIEWSTATE={0}&txtUserName={1}" +
         "&txtPassword={2}&txtMessage=&btnOK=OK"
,
         viewState, username, password);

      // our second request is the POST of the username / password data.
     
HttpWebRequest request2 =
      
(HttpWebRequest)WebRequest.Create(loginDlgUri);

      request2.Method = "POST";
     
request2.ContentType = "application/x-www-form-urlencoded";
     
request2.CookieContainer = cookies;

     
// write our postback data into the request stream
    
StreamWriter writer =
      
new StreamWriter(request2.GetRequestStream());
    
writer.Write(postback);
    
writer.Close();

     request2.GetResponse().Close();

     // our third request is for the actual webpage after the login.
    
HttpWebRequest request3 =
     
(HttpWebRequest)WebRequest.Create(mainConsoleUri);
    
request3.CookieContainer = cookies;

     reader =
       new StreamReader(request3.GetResponse().GetResponseStream());

     // and read the response
    
string page = reader.ReadToEnd();

     reader.Close();

    // our webpage data is in the 'page' string.
   
Console.WriteLine(page);
 
}

  catch(Exception ex) {
   
Console.WriteLine(ex);
 
}
 
}

  // extract the viewstate data from a page.
 
private static string GetViewState(string aspxPage) {
   
Regex regex =
    
new Regex("(?<=(__viewstate\".value.\")).*(?=\"./>)",RegexOptions.IgnoreCase);

    Match match =
     
regex.Match(aspxPage);

    return System.Web.HttpUtility.UrlEncode(match.Value);
 
}
 
}
}
// EOF

Filed under:

Comments

# re: Web Scraping in ASP.NET

Saturday, April 23, 2005 11:23 AM by AndrewSeven

Have you considered using nUnitAsp? It will let you progamatically access the client side content.

# re: Web Scraping in ASP.NET

Saturday, April 23, 2005 11:49 AM by Jerry Dennany

Yes, I did consider that, but it didn't fit the customer's requirements.

# re: Web Scraping in ASP.NET

Tuesday, April 26, 2005 8:54 PM by Jeff Atwood

I am perplexed by your regex. Wouldn't it be simpler and faster to capture that in a named group?

string s = "<input type=\"hidden\" name=\"__VIEWSTATE\" value=\"contents here\" />";
string pattern = "__viewstate[^>]+value=\"(?<Value>[^\"]*)";
Regex r = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Singleline); MessageBox.Show(r.Match(s).Groups["Value"].Value);

but what do I know, I'm just a stupid VB.NET dev..

# re: Web Scraping in ASP.NET

Tuesday, April 26, 2005 10:37 PM by Jerry Dennany

Hi, Jeff -

I'd like to say that I did it the other way just because I wanted to practice lookarounds.

However, I'm only moderately competent at Regex. The reason I didn't use a named group is because I didn't know better.

# re: Web Scraping in ASP.NET

Wednesday, April 27, 2005 1:21 AM by Jeff Atwood

Not a problem at all; no offense intended. Regex groups, particularly when you use the named (?<MyGroup>xxx) form, make a lot of regex operations MUCH easier than they would otherwise be..

# re: Web Scraping in ASP.NET

Wednesday, May 04, 2005 10:29 AM by Kary

Is there any one I can scrape the html data that is posted by the server to client after a post ?

# re: Web Scraping in ASP.NET

Wednesday, May 04, 2005 2:58 PM by Jerry Dennany

Kary -

You may continue to intercept the http stream using the pattern in the above code.

# re: Web Scraping in ASP.NET

Thursday, May 05, 2005 11:58 AM by Kary

Jerry,

How Do I do that ?

# re: Web Scraping in ASP.NET

Friday, October 19, 2007 9:55 AM by kanif

i want to read links from asp page and write them on .aspx page

give me exact example with code

please

# re: Web Scraping in ASP.NET

Friday, April 11, 2008 12:19 AM by pham tran phu

Thanks, it's very helpful

# re: Web Scraping in ASP.NET

Sunday, June 08, 2008 11:17 AM by PEPPE

I too want to read links from asp page and write them on .aspx page, give me exact example with code, please

regards

# re: Web Scraping in ASP.NET

Tuesday, September 09, 2008 2:59 AM by Carry

It is working fine in case of an asp.net site, thanks.

Could you please also provide code for scrapping of password protected site for generic site ( also for not asp.net site).

Thanks in Advance.

# Interesting Blog &laquo; Petz1step&#039;s Blog

Thursday, July 23, 2009 9:05 PM by Interesting Blog « Petz1step's Blog

Pingback from  Interesting Blog &laquo;  Petz1step&#039;s Blog

Leave a Comment

(required) 
(required) 
(optional)
(required)