Attention: We are retiring the ASP.NET Community Blogs. Learn more >

Forum search win

I checked to see how CoasterBuzz's search index was going today, and was pretty shocked to see that it now as about 15 million rows. Can't say that I've personally written anything that generated that much data on its own before.

Once upon a time, POP Forums used SQL's FullText service to index posts. The frustration that came out of that black box was two-fold: It was grinding the disk and CPU, and frankly the results sucked. Long before I ever got v8 into a usable state, I prototyped the current search engine against data from the old version.

Building something like that is one of those exercises where you worry about all kinds of scalability issues instead of just trying to build something and refactor when you fail the first time. That's why it took me so long to just try something. Eventually I passed that brain block and wrote something, which predictably sucked. I kept refactoring until I had a workable solution.

The solution went like this. Find all of the words in a topic, toss out junk words like "the" and other things people won't search for, and score them based on frequency. Bonus points if it lived in the topic's title. Then save the words, along with their topic ID and score.

To search, simply find the topics that rank highest by averaging the rank for each word. There is an absolutely horrible query built ad-hoc in the data query that does it. It's partly ugly because it has to page the data, so there are some weird common table expressions being formed. It's so hard to read that I'm not even sure where I'd start to refactor it! But despite this, it works surprisingly well.

I think the one thing I'd tweak is the scoring, but aside from that the searching part works pretty well. I'm sure that I'm not the first to think of it. The joy comes from the fact that SQL Server is fast enough to get the work done. One of these days I'll see if I can get a guru to look harder at it and see how it can be made even faster.

8 Comments

  • Indexing can be one of the most challenging things I've come across when trying to come up with a "home-brew" method. If you'd really like to run it past some guru's, we have a couple of SQL Server MVP's (along with some other very talented SQL folk) over the LessThanDot arena (http://forum.lessthandot.com/viewforum.php?f=17) who would love the challenge of looking over it and looking for improvements (I'd be very surprised if they couldn't tweak it!). I hope you consider posting it there as a bit of a challenge for us!

  • Isn't that algorithm basicly doing what full text search indexing is doing as well?

    I don't have problems with full text search in sqlserver, though you might want to give the sqlserver engine enough system memory to keep the index in-memory completely.

    There's been a lot of research in this area in the past decades, as well as many ideas how to do the perfect search engine. IMHO, re-doing it with a home-brew algo is nice for times when you've nothing better to do or are just interested, but if you have to solve the problem for a system to be released, I'd say pick one algo from the shelve, or an engine ready to use and use that one instead.

  • Well, just because you don't have problems with it doesn't make my experience go away. The results were consistently irrelevant and it was a resource hog.

    I'm all for "getting real" and deploying product, but exercises like this (which amounted to a day or two of work three years ago) to me seems like time well spent.

  • Which SQL Server version were you using? I believe 2008 has FT enhancements.

  • When I started it was 2000, but these days I use 2005. I don't encounter a lot of people using 2008 yet.

  • Did you try Lucene.NET?

  • Nope. Again, I have a solution that I'm content with.

  • I realize you have a solution that you like, but I think SSIS's term generation deserves a quick plug here. It has several options for what to consider a term and how to score them. We use term frequency / inverse document frequency for our automatic tag cloud term generation. Hmm, maybe I should throw up a post about this.

Comments have been disabled for this content.