Archives

Archives / 2004 / January
  • Scott Mitchell Articles on MSDN

    I had a chance to sit down and read Scott Mitchell's articles that are on MSDN regarding data structures in .NET.  First off, I like the content of them.  While I have been programming professionally for 14.5 years, I have a BS and MS in Electrical Engineering, not in Computer Science.  As a result, I sometimes miss certain basic items.  It was good to read the info in the articles.  Secondly, I like the fact that he spent some time focusing on algorithms and how long operations take.  Algorithms and the amount of time spent solving a problem is an area that is very near and dear to my heart as I see a lot of programmers implementing algorithms that are sub-optimal.

  • Why Microsoft .NET Presentation

    Just thought I would share with everyone a Microsoft PowerPoint Presentation that I did a few months ago regarding what Microsoft .NET is to me and why a company should be interested in it.  While it doesn't preach the Web Services Everywhere Manifesto I hear many people say, it does seem to hit the major issues that organizations have. 

  • Server Side Cursors

    One of the things that I always found interesting when looking at someone elses code in Classic ADO 2.x was the number of developers that misused the all of the cursor and locking options with a Classic ADO RecordSet when running against Sql Server.  There are some good situations where there is a need for a scrollable updatable server-side cursor, but I would say about 50% of the time that I see one, it is not necessary.  Well, with .NET Whidbey, ADO.NET will have the a scrollable updatable server-side cursor in the framework.  The advantage to the .NET version will be that it is not directly associated with the SqlDataReader or the SqlDataAdapter, so it will be harder to misuse.  This is unlike the situation with Classic ADO 2.x where creating a scrollable updatable server-side cursor was one of several options within the recordset object. 

  • Managed Threads (MT) vs. ThreadPool Threads (TP)

    As I was first building my Web Spider, i figured that the easiest thing to build the spider with would be the TP.  So based on my previous ramblings, I was disappointed by the fact that the WebClient also used the TP to retrieve its results, even when used in a synchronous fashion.  This effectively cut my possible performance in half.  Add to this the fact that the TP in .NET only supports 25 threads per cpu at any one moment and I was double frustrated.  The result was that I could only fire up 12.5 threads per cpu on my development system.  I just knew that if I could just switch to managed threads, I would be able to pull in 25 threads per cpu (based on the WebClient in System.NET).  While I am also constrained by the bandwidth at my office, I knew that the addition of more threads would allow me to “smooth” in the waves when the TP version wasn't able to access the network due to other work that was going on.  I just knew that I could outsmart the TP scheduling mechanism that will only allocate a specific number of threads based on system resources.

  • Full-Text Search with Yukon

    While I as sitting here fiddling with things this evening, I decided to do a little test to see just how well full-text search worked in Yukon.  Man, I was blown away.  Granted I don't have millions of rows in my table to search through, but I do have a system with about 70,000 rows setup for full-text search and am adding them at the rate of about 40 URLs per minute (hey I am bandwidth constrained at my office).  I decided to do a full-text lookup for 'President Bush' on the Yukon database while simultaneously runing the spider and having already setup a full-text index in Yukon.  In just a few seconds, I got about 200 rows back.  Doing this same test on my other system that is running Sql Server 2000 and is taking hours to build the full-text index resulted in a query that took several orders of magnitude longer to search a table with about 700,000 rows in it.  Now, I realize that this is not a fair comparison for several reasons.  Once the full-text index is built on my Sql Server 2000 system, I am going to run a comparison.  No, I am not going to post the complete results.

  • Size Limits of Sql Server Indexed columns

    FYI, there is a limit on the size of a column that can be used for an index with Sql Server (and I assume other databases).  With Sql Server, a column over 900 bytes in size can not be indexed.  I would assume that the total number of bytes for an index can not be over 900 bytes, but I am not sure on that.  I tried to index my UrlAddress field, which is defined as varchar(4096) and I got a nice message box saying that this was not possible.

  • First thoughts on Yukon

    Seems that MS Sql Server Yukon, even at this level makes better use of memory than Sql Server 2000.  I have been working with my Web Search routines and the database engine just seems to use less memory than Sql Server 2k.  Granted, I am running on two separate machines but both systems have 1 gig of ram.  Who would have thunk it, a new piece of software from anyone that uses less memory than the previous version of that software.

  • Why searching the Web is slow.

    The Web is nothing more than a really large graph.  The problem is that when you get a node, you don't want to walk over to another node that you have already walked (within some constraint).  As I watch Sql Profiler and see that I am calling my stored procedure to add nodes to my Search Url table, I see that it is literally being called hundreds of times per minute, yet I don't see the URLs added at anywhere near the same rate.  Then **doink** it hit me.  Within my sproc, I check and if the URL already exists in my Search Url table, I don't add it again.  What am I getting at?  Well, I have the logic to see if the URL already exists within the table within my sproc.  I only add a URL if the URL does not already exist within the URL table.  The end result is that my routine expends significant processing power to search and see if the URL already exists within the Search URL table before it actually adds the entry to the table.  Why do things like this?  Well, I don't want to add entries and then merely check them before I input them into the Search Results table.  If I did things that way, I would most likely end up with infinite recursion, which would be bad.

  • The Ballad of Clayon Homes - (Warning: Not Technical)

    I picked up my copy of FastCompany from the mail drop today as I have been out of town for two weeks at our customer in Washington, DC.  I came across this article in it.  It is a very interesting read.  My wife and I were invited to a private Christmas Party at Jim & Kay Clayton's house back before Christmas.  There were about 30 people there and we were treated to Jim's guitar playing about 9:00 pm that evening.

  • Modified Web Search Algorithm

    With new file space in tow, I updated my search algorithm. Instead of having a single threaded Url Dispatch, Each thread is now responsible for grabbing a set of Urls from the database to search.  With that change, I appear to have gotten a bump in performance in that I no longer have a set of downtime in getting URLs.  I am going to let it run for the next few hours and see how things go.  So far, I like what I see.

  • Sql Server Bottleneck Resolved

    I fixed my Sql Server problem with my Web Spider today.  I had run out of hard disk space.  I bought a 250 GByte FireWire hard drive today.  I hooked it upto my laptop, copied my Sql Server files, reattached my database and bang, it ran just fine.  The amazing thing is that I figured I'd have some type of speed problem.  While FireWire is “suppossed” to run at  400 Mbits / sec or some speed like that, I still figured that somehow it will still be somewhat slow.  Man was I wrong.  Everything seems to be running just fine and with no let down in speed.

  • A Sql Server bottleneck in my Web Search Project?

    Ok, not necessarily a bottleneck, but something I wanted to mention.  Indexes are a great thing when used properly.  Over using indexes causes stability trouble, but it also causes file space trouble.  Since I am running my Web Search on my laptop at this time, disk space is at a premium even on a system with 60 gigs of drive space.  Indexes take up some amount of space.  Be careful in using too many indexes.

  • Socket Connections use the ThreadPool, don't they?

    I tried a suggestion to use the Socket class to get around the ThreadPool problem I mention yesterday.  Seems the Socket class also uses the ThreadPool.  Does anyone have any way to retrieve http content from a web site (no user interface is allowed and I may switch to a Windows Service soon) without getting the ThreadPool involved?

  • My next bottleneck on Web Search with .NET: The Mysterious ThreadPool

    When I started mapping this “Web Search with .NET“ project out, I figured that using the threadpool was the right thing to do.  It made a lot of sense in concept.  “Doing a small amount of work that doesn't depend on anyone else” sounds like the right thing to use the ThreadPool for.  Well, let's look underneath the covers.  The ThreadPool is limited by design in .NET to 25 threadpool threads per process per CPU and there is nothing that you can do to change it (except use your own ThreadPool, as several have pointed out including my buddy Scott Sargent).  Now, what happens when you use an object within a ThreadPool thread that itself uses the ThreadPool?  You get an error, that's what you get.  Yes boys and girls, using an object within the ThreadPool, that itself uses the ThreadPool is a bad idea.  In my case, I used the WebClient object and got an error back from it that said that there were not enough ThreadPool threads to complete the request.  Well, that was bad.  Thanks to Dave Wanta for suggesting to use Async Sockets for this.  He said it uses the IO Completion Ports, which have 1000 available in .NET. 

  • Howto Web Search with .NET

    Here was my initial thinking regarding database tables in my Web Search with .NET Application.  This database setup is my initial design.  It is by no means my final design.  I chose this database setup initial because I wanted to perform full text lookups on as few rows as possible and wanted to perform as few operations against the Search Results table as possible.  There are two tables in this application:

  • Database Indexing - SQL Server Tools

    Proper Database Indexing is important with any application.  It is even more important with the applications that I am involved with due to the large amount of data, number of transactions, and number of users of the application that I am typically involved with.  As I have been working on my Web Search Engine with .NET, I have learned about the importance of proper database indexing once again.  First, let's look at the tools with Sql Server.  Sql Server comes with two really good tools for performance tuning: