Why searching the Web is slow.

The Web is nothing more than a really large graph.  The problem is that when you get a node, you don't want to walk over to another node that you have already walked (within some constraint).  As I watch Sql Profiler and see that I am calling my stored procedure to add nodes to my Search Url table, I see that it is literally being called hundreds of times per minute, yet I don't see the URLs added at anywhere near the same rate.  Then **doink** it hit me.  Within my sproc, I check and if the URL already exists in my Search Url table, I don't add it again.  What am I getting at?  Well, I have the logic to see if the URL already exists within the table within my sproc.  I only add a URL if the URL does not already exist within the URL table.  The end result is that my routine expends significant processing power to search and see if the URL already exists within the Search URL table before it actually adds the entry to the table.  Why do things like this?  Well, I don't want to add entries and then merely check them before I input them into the Search Results table.  If I did things that way, I would most likely end up with infinite recursion, which would be bad.

Wally

4 Comments

  • What kind of index are you using for the URL field? How many records do you have in your table? I would think with a good index and a tuned database, it could do simple exist checks rather quickly. I mean, even if you had, say, 1 BILLION records in your database table, a good index would require, what, say 20 lookups? Granted, it's going to take some disk accesses, but you'd think with a fast disk and oodles of RAM it would still be fast....

  • I am wondering, if you only check a page once, how to cope with updates?



    Imagine you spider site X, and 2 days after that there replace a big part of their site, the url of site X is in your 'already spidered' table, so you won't check it again, but now your info in your 'search results table' is totally wrong with the actual site, and you will never know of it because even if the possibilty comes to revisit the site, your algorithm will ignore it?

  • David,



    I used the term "within some constraint" to mean that there is some reason to go back, such as "not been checked in the last two weeks" or some other reason to go check it. At this time, I don't want to worry about re-tracing steps. I just want to fill up my tables with a bunch of records and start testing things at this point.



    Wally

  • do it in batches/asychronously to insert fast:



    1) store your urls from your web spidering in some tab delimited text file(s)

    2)bcp into your db on a temp or staging table(s) without indexes/triggers/keys

    3)schedule a sproc to move "good" values from staging table to your main table every few minutes in 1 transaction

Comments have been disabled for this content.