I checked to see how CoasterBuzz's search
index was going today, and was pretty shocked to see that it now as
about 15 million rows. Can't say that I've personally written anything
that generated that much data on its own before.
Once upon a time, POP Forums
used SQL's FullText service to index posts. The frustration that came
out of that black box was two-fold: It was grinding the disk and CPU,
and frankly the results sucked. Long before I ever got v8 into a usable
state, I prototyped the current search engine against data from the old
version.
Building something like that is one of those exercises
where you worry about all kinds of scalability issues instead of just
trying to build something and refactor when you fail the first time.
That's why it took me so long to just try something. Eventually I
passed that brain block and wrote something, which predictably sucked.
I kept refactoring until I had a workable solution.
The solution
went like this. Find all of the words in a topic, toss out junk words
like "the" and other things people won't search for, and score them
based on frequency. Bonus points if it lived in the topic's title. Then
save the words, along with their topic ID and score.
To search,
simply find the topics that rank highest by averaging the rank for each
word. There is an absolutely horrible query built ad-hoc in the data
query that does it. It's partly ugly because it has to page the data,
so there are some weird common table expressions being formed. It's so
hard to read that I'm not even sure where I'd start to refactor it! But
despite this, it works surprisingly well.
I think the one thing
I'd tweak is the scoring, but aside from that the searching part works
pretty well. I'm sure that I'm not the first to think of it. The joy
comes from the fact that SQL Server is fast enough to get the work
done. One of these days I'll see if I can get a guru to look harder at
it and see how it can be made even faster.