Howto Web Search with .NET

Here was my initial thinking regarding database tables in my Web Search with .NET Application.  This database setup is my initial design.  It is by no means my final design.  I chose this database setup initial because I wanted to perform full text lookups on as few rows as possible and wanted to perform as few operations against the Search Results table as possible.  There are two tables in this application:

  1. Search Urls.  The Search Urls table is a list of possible URLs to search through.  This table contains a bigint primary key, the Url, ServerName, HashCode for the URL, date the row was entered. and the date the row was last updated.
  2. Search Results.  The Search Results table contains a bigint primary key, the Url, contents of URLs that have been retrieved, server name, HashCode for the URL, date the row was entered, and the date the row was last updated.

Here is how things go:

  1. Get a value from the Search Urls table.
  2. Flip the value of the UrlStatus to “SEARCHING” for that Url.
  3. Check to see if that Url has already been searched.  If so, go get another Url and start over.
  4. Retrieve the contents of that Url.  Curently, this is done using the WebClient of .NET.  This was a bad choice on my part, which will become apparent in my ThreadPool post, which is up next.
  5. Parse the Url for links and put them into the Search Urls table.  Insertion is done by a sproc which checks to see if that url already exists.
  6. Put the contents of that Url into the Search Results table.
  7. Repeat until entire Web is searched (or somebody hits the “StopSearching“ button).  ;-)

On my laptop, this runs pretty well.  It fills up my cable modem at home and the DSL line at the hotel here in Washington DC, fairly quickly.  Given the limits I have and the environment I have to work with, I think performance is pretty good.  For example, this morning, when the world was fairly quiet, I was pulling back about 250 URLs per minute for insertion into the Search Results table and about 2400 Search Urls per minute into the Search Urls table.  This is with about 250,000 entries in the Search Results table and about 3.8 million in the Search Url table.  The time to add a single URL to the Search Urls table is showing up at about 10-20 msec per insert.

Wally

2 Comments

  • could you release the code in closed source

    so that some can install on pc and ftp you the results? similar toa P2P network

  • Stefan,



    Right now, I am trying to fix a few problems I have with the Search. It works pretty well, but there are some rough spots and mistakes I have made along the way. I haven't even looked at licensing issues. Don't think it would be a big money maker, but I want to make sure I do things correctly. I have to clean this stuff up, and wait on MS to release a couple of goodies. :-)



    Wally

Comments have been disabled for this content.