So, I've been 'under the Radar' recently, as I've just started a new job with Eclipsys Corporation. I work on our interface engine, eLink. You could think of it as a very specialized BizTalk server.
Well, eLink is high-throughput and high-availability, and so we leverage Microsoft Windows 2000 Advanced Server with Cluster Services. This is my first foray into a clustered environment in the Windows world, though I’ve set up a Linux cluster before.
Microsoft Clustering for High Availability is not terribly impressive (This is one area where Microsoft is still playing catch-up to the UNIX and Mainframe world.) The basic concept is that two machines share a disk array, and usually act in an active / passive mode. If the active server fails, the passive server takes control of the shared drive array, and steps up to the plate, starting the services that failed on the previous node.
Well, one of the problems is that this works somewhat well for cluster-aware applications, and poorly for non-cluster aware applications. No problem - we'll just write our applications to be cluster aware.
Well, if it was only that simple - the problem isn't with our software, which works fine in a clustered environment. It's Windows 2000 Advanced server itself! There are entire portions of the operating system and several tools are not cluster aware.
Perfmon seems to be especially difficult to work with in a cluster, and our event viewers quickly fill up with spurious Perfmon related messages.
IIS isn't really cluster aware - although the process itself will fail over, you lose any config information in the metabase, unless you manually replicated beforehand. You also lose ASP.NET session state, unless you use a session state server on a third machine or you store session state in SQL Server. Not a big deal, but still something to pay attention to.
The biggest problems we had were actually MSMQ related. First, we had a non-cluster related MSMQ issue where MSMQ would consume 80% of kernel memory and then stop, refusing to allocate any more memory. Windows is actually designed to garbage collect kernel memory at 90%, but we never get there because MSMQ hangs before reaching that point. MSKB 811308 describes this problem, and the solution.
A bigger problem was that MSMQ would not always successfully failover to the backup node. This was actually reproducible in our lab when we would get the MSMQ storage up around 800 or 900 MB.
After banging my head against a brick wall for several days, I put in a support call to Microsoft, and ended up talking to Muhammed Ismail from the MSMQ team. Let me tell you, he knew his MSMQ backwards and forwards. Still, it wasn’t anything that we were doing wrong, so he sent us the latest version of MSMQ 2.0, we installed it in our test labs, and noticed an immediate resolution to all of our MSMQ problems.
Advice – if you are going to be architecting solutions that rely on Microsoft’s High Availability using Microsoft Cluster Services, be certain to research (and prototype) solutions before implementing them. We noticed several things that simply aren’t supported in a clustered environment, and several things that were supposed to be supported that just plain don't work well. (For example, MSMQ Triggers are a cluster no-no on Windows 2000, though they are available on Windows 2003 clusters.)
So, I've come out of all of this wishing that if only I could cluster myself and make me a High-Availability Developer, then I could get some sleep while I worked!