Attention: We are retiring the ASP.NET Community Blogs. Learn more >

Text searching fun

I mentioned the other day I wanted to figure out a good way to search volumes of text in my forum without the full-text engine of SQL Server. I didn't really find a lot, other than the way that SQL Server does it (or apparently does it, I've not seen a really thorough explanation).

The obvious path is to break up every piece of text and create a table full of words, filtering out the "junk" words like "the" first. So I copied the posts from CoasterBuzz (about 440k posts) and gave that a shot with a quick prototype.

I started to get bored with the indexing process and stopped it at about 2 million rows. I'm not sure why, as I had no reason, but I was skeptical that this was going to be very fast. With an index on the Words column, I started running some queries in Query Analyzer, and what do you know, the searches were nearly instantaneous. Huh. Not the results I expected. I did some AND's and OR's, still fast.

So now I'm scratching my head wondering why I didn't try something like this, I don't know, years ago. I expected an endless tweaking effort to get performance to an acceptable level, and it just works. Stuff is never this easy! I assume that boards like vBulletin do something like this as well.

So now I just need to figure out some kind of word ranking scheme. For that I think I need to just look at existing topics that I as a human understand as relevant to a word, then apply that to some goofy algorithm.

Hooray for things being easy for a change, and hooray for SQL Server!

2 Comments

  • I'm a little confused on how you think you are reiventing the wheel with this.....



    isn't what you describe what SQL's Full-Text does? (and already does rankings)

  • SQL Server's ranking has no regard for the whether or not a word appears in the topic's title, the author's name or if the word appears in the first post in the topic.

Comments have been disabled for this content.