December 2004 - Posts
I mentioned the other day
I wanted to figure out a good way to search volumes of text in my forum without the full-text engine of SQL Server. I didn't really find a lot, other than the way that SQL Server does it (or apparently does it, I've not seen a really thorough explanation).
The obvious path is to break up every piece of text and create a table full of words, filtering out the "junk" words like "the" first. So I copied the posts from CoasterBuzz
(about 440k posts) and gave that a shot with a quick prototype.
I started to get bored with the indexing process and stopped it at about 2 million rows. I'm not sure why, as I had no reason, but I was skeptical that this was going to be very fast. With an index on the Words column, I started running some queries in Query Analyzer, and what do you know, the searches were nearly instantaneous. Huh. Not the results I expected. I did some AND's and OR's, still fast.
So now I'm scratching my head wondering why I didn't try something like this, I don't know, years ago. I expected an endless tweaking effort to get performance to an acceptable level, and it just works. Stuff is never this easy! I assume that boards like vBulletin do something like this as well.
So now I just need to figure out some kind of word ranking scheme. For that I think I need to just look at existing topics that I as a human understand as relevant to a word, then apply that to some goofy algorithm.
Hooray for things being easy for a change, and hooray for SQL Server!
Don't hate me because I'm ignorant... remember that I don't have a formal programming background. If I'm to understand generics correctly, using System.Collections.Generic.List<T> will be ridiculously faster than using an ArrayList, because everything is boxed/unboxed to and from an object in an ArrayList, right? Whereas List<T> is a collection of objects that are a predictable type?
In a related question, is BinarySearch() still the best method to find objects that have a particular property value? And I assume that the type I'm searching has to implement IComparer?
I was happy to see that a federal court upheld the notion that states shouldn't be taxing voice-over-IP
service. This is the correct decision.
States like New York are trying to make the case that it's just like any other phone service, but that's not even remotely true. Heavy regulation of traditional phone service is justified because it's essentially a limited resource and a natural monopoly. It's not like every company can string up a phone network. Cable TV is regulated under the same premise.
Have you looked at your phone bill lately? I'm absolutely astounded at the number of taxes. They take up one entire page of the bill now. It's ridiculous. The only thing that has kept me from flipping to Vonage is that, for the moment, I can't carry over my phone number... yet.
I'm looking for articles that explore text searching strategies. I've read a lot of general ideas about creating word indexes and giving the words rank based on frequency, and referencing those indexed words to the database records that contain the full text. I'd like to read something with a little meat to it regarding performance and such.
Anyone personally take a shot at this kind of thing? And no, I'm not looking for, "I just use SQL Server's full-text indexing." :)
Related to that, I'm curious if anyone has advice on having a Web application communicate with a thread it launched. Please, let's not go into the case against launching new threads from a Web app. The kind I'm thinking of (like indexing text) would be run periodically on a timer from an HttpModule, not user initiated stuff.
Your opinions and knowledge are, as always, appreciated.
In my last post about open-source and documentation
, Chris Martin makes the assertion that
: "If you don't like the way something is implemented, you do it yourself and it ends up in the distribution of said software... Instead of complaining, you should contribute."
I'm not even sure where to begin with that one. I would say more than half of the projects I've ever encountered on SourceForge don't have an ounce of documentation. I'm the last person in the world that believes every projects needs a scope document and a stack of use cases, but if you can't at least write some basic documentation to get me started, why should I be interested to continue or improve upon your work?
Furthermore, this "contribute instead of complain" nonsense is laughable. The biggest open-source zealots say this kind of thing all of the time, and I can't help but wonder what their time is worth. I can do 30 to 40 hours a week for work, but the rest of the time goes to family, friends, and even getting my ass kicked on Halo 2 from time to time. If those work hours aren't generating revenue in some way, I'm really not interested. The good cause of freesoftwaredom is not even on my radar. If there's not a good open-source implementation of something, I'm perfectly fine with paying someone to do the work and hold them accountable. S Dot One says
: "You got the ULTIMATE documentation with open-source... The source code itself."
If I had a dollar for every time I heard that in an agile workshop, I'd live somewhere more tropical than Cleveland. It's the worst cop-out in software development today. Yes, it's true, that in an agile/XP environment that the code is generally simple enough that you should be able to read it and understand what it does. I get that. We heard that over and over again in a stint I had at The Second Largest Auto Insurance Company. What no one ever explained is how that translated to some kind of context for the use of said code.
The truth is that no matter how narrowly focused a piece of code is, it's rarely something that you can consider language/platform neutral, and there's almost always a different way to do exactly the same thing. So when it comes time to revisit the code, revise it, change it, whatever, you're left scratching your head because you have no idea what context the code was running in, or what business problem it was trying to solve. I saw it happen even in short itterations from an agile team.
I guess what bugs me the most is that people don't speak up and form their own opinions about things like open-source or agile. I don't know if it's fear of retaliation, fashion addiction, consultant backlash, or what.
Am I against open-source software? Of course not. I've been giving away POP Forums
for about a year. I also document all 600+ classes, properties and methods. I don't remember what I wrote and what my own reasoning was; I certainly don't expect someone else to guess.
When you get bounced out of Hotmail you of course land on the MSN page, which has links to top stories on MSNBC. So they were plugging the amateur video from the tsunamis that sadly have killed nearly 30,000 people at this point, and I was curious to see just what this looked like.
I bounce on over there and launch the video clip. Wouldn't you know it, you can only view it with Internet Explorer. Could that be any more lame? I can deal with having to use Windows Media Player, but MSNBC wants to force me to use IE? And if I'm using a Mac I should do what?
For all of the crap that Microsoft has taken for exercising its monopoly power to dominate certain markets, and given the pending litigation in Europe, I can't believe they would allow something this lame to occur.
Sure, it's their business, and they can require users to use whatever software they choose. However, when you start doing this with news media, you're doing little but giving your critics the fuel they need to blast you some more. Controlling the distribution of news media in this manner is a real kick in the nuts to the journalists that risk their lives to get the story.
I have to say that the .NET world if fortunate to have a lot of open-source stuff available. I can't even tell you how much I like NUnit
The problem I find, however, with a lot of open-source software is the complete lack or reasonable documentation. That drives me nuts, although I'm not entirely surprised. It's one thing to give away and share your work, but that scenario doesn't exactly provide a ton of incentive to document it properly.
NUnit ended up being useful to me I think because it was in use at a project I was on, and it essentially has its own book
. Most stuff I've encountered doesn't have that luxury.
I'm not suggesting even for a moment that the world would be a better place without these projects, it's just that the price of entry is kind of high for something that is monetarily free.
Yep, the more I think about it, the more I think I want to write another book. I've got an idea that I think will sell, I can write it reasonably faster than the last one, and I'm feeling some enthusiasm for it. I'm going to put together a proposal and see if anyone is interested.
As my role in Maximizing ASP.NET
winds down leading to release, I have to say that working with the folks at Addison-Wesley has been a really good experience. They bend over backward to give you the support you need. I felt even before this that they had the best ASP.NET titles on the shelf, and I'm really excited that it was them that picked up the project.
On a side note, I noticed at Border's last week that there aren't really as many ASP.NET books on the shelf as there used to be (casual observation, not a scientific statement). Before Wrox went down the tubes and was sold, there were a lot of really quality niche books that covered specific areas (threading, text manipulation, performance, etc.), and I don't think those areas are being served now. Granted, with such a narrow focus, I don't know what the market is for those books. Did they even sell three or four thousand copies? I get the impression that it's hard to justify publication of anything that doesn't at least get into the five-digit count.
I love ArrayLists. I find them to be among the most useful collections in the .NET Framework. I remember seeing a discussion somewhere a few months ago about making ArrayLists into strongly-typed collections. This was achieved by simply inheriting ArrayList and overriding the Add/Insert methods to make sure the objects being added were a particular type.
I don't know enough about what's going on under the hood to know if there's a performance penalty involved with this. Is checking the type of an object an expensive process?
I mentioned the other day
that I was going to revisit the text parsing engine of POP Forums and essentially start over. What a difference that made. In less than a day I turned years of crap upon crap into something much leaner, about a third less code. I got there with about twice the unit tests that I originally had. Around a dozen regular expressions took care of all of my line break and blockquote woes that I kind of eluded to in that last post. I started with the first test, and kept working through them until they all passed. I don't know if it's the most elegant thing ever, but it appears to work. I dropped it into two of my production sites and so far, so good (yeah, TDD makes you that confident).
Since this entire exercise is really about arriving at the next version, I can now think about features. The big question is, do I want to endeavor into the world of allowing different text colors, and perhaps text sizes? On the pro side, it would be something other forums already offer. That's the entire list for the pro side.
On the con side, I have to deal with different implementations of rich text editors, decide how best to present the changes (span tags, probably), decide if the various heading tags make the most sense, and above all, know that I'll be responsible for some forum where I see something like: o my f***ing god!!!!!!!111 u suX0rZ!!!!111
One must be a responsible code monkey, after all!
More Posts Next page »