Text searching fun
I mentioned the other day
I wanted to figure out a good way to search volumes of text in my forum without the full-text engine of SQL Server. I didn't really find a lot, other than the way that SQL Server does it (or apparently does it, I've not seen a really thorough explanation).
The obvious path is to break up every piece of text and create a table full of words, filtering out the "junk" words like "the" first. So I copied the posts from CoasterBuzz
(about 440k posts) and gave that a shot with a quick prototype.
I started to get bored with the indexing process and stopped it at about 2 million rows. I'm not sure why, as I had no reason, but I was skeptical that this was going to be very fast. With an index on the Words column, I started running some queries in Query Analyzer, and what do you know, the searches were nearly instantaneous. Huh. Not the results I expected. I did some AND's and OR's, still fast.
So now I'm scratching my head wondering why I didn't try something like this, I don't know, years ago. I expected an endless tweaking effort to get performance to an acceptable level, and it just works. Stuff is never this easy! I assume that boards like vBulletin do something like this as well.
So now I just need to figure out some kind of word ranking scheme. For that I think I need to just look at existing topics that I as a human understand as relevant to a word, then apply that to some goofy algorithm.
Hooray for things being easy for a change, and hooray for SQL Server!