Dogfooding and Showstoppers

Friday, February 21, 2003

At Microsoft we use the terms "dogfood" and "showstoppers" a lot when describing the "end game" of a product cycle.

Dogfooding is slang for self-hosting mission critical systems on top of beta or release candidate software (as in "our dogfood tastes so good I eat it myself"). It is something we do internally pretty aggresively to help chase out the last few production bugs in a server release. We've found that running your payroll, email, source control and health benefits system on top of your own beta software can be a highly motivating way to drive quality into the product early in the development cycle. ;-)

A showstopper is slang for a critical bug that needs to be fixed before a product can be released. As a product enters the end game, the respective "war team" (a group of senior leaders on the team) will continuously watch and triage bug reports to identify and fix these showstopping bugs as soon as possible. By definition, a showstopper bug prevents a final release.

Over the last two weeks my team has been working with a number of internal teams and external companies to do a final dogfood deployment of our new release of ASP.NET V1.1 that will ship as part of Windows Server 2003. Ensuring that major sites and applications are up and running without problems is obviously key to making sure the bits are ready to go.

The past two weeks we've had two interesting firedrills with one of our ASP.NET customers -- http://www.msn.com (which receives 800 million customer visits to the site each month). They've run on ASP.NET V1.0 for about a year with excellent stability, and so are a great test to give us feedback on V1.1 (they have a very high bar to evaluate us on). They kindly let us replace some of their production servers with ASP.NET V1.1 builds to see how things compared.

Different servers, networks and applications exhibit different load characteristics -- which can in turn cause previously unseen bugs to appear. Sure enough, after a few days of running in production they discovered a rare bug in our performance monitor counters -- where all ASP.NET V1.1 counters occasionally dropped to zero (even though the box was running fine). They weren't able to repro this bug in their testing stress lab -- only the production farm seemed to cause the problem (and then only ocassionally and under non deterministic scenarios). Even more mysteriously, the counters would suddenly start working again after 90 minutes of deadtime.

After a few days of frustrating investigation (it ended up being fairly painstaking for us because debug private drops of the perf counter DLL changed the timing in the system enough that the problem went away -- it only repro'd on retail builds with no debugger attached), we finally tracked down the problem. It was an obscure race condition that occured in our perfmon listener when the perfmon monitor was running on a remote machine, and an I/O completion occured while the perfmon was about to decrement a certain counter, and no other I/O thread was available to service the call. When that happened the named pipe connection became corrupted and dropped. After 90 minutes the perfmon system automatically attempted to reconnect again -- which was why the counters would mysteriously restart.

Although rare, the bug clearly fell into the "showstopper" bucket -- since it could prevent administrators from correctly monitoring their servers (a major no-no for mission critical applications). We put together a private fix, had MSN run on it for a few days to verify it worked, and got it approved by the war team last week for checkin.

All was then great in the world (we thought) until we had a rude problem wake us up at 5:30am this morning -- when one of the MSN servers running our V1.1 build fell over in a sudden way. These types of failures tend to put an ugly knot in your stomach when you are in the last few days before shipping...

Figuring out what has happened in situations like this is both an art and a science. Since the server farm is located somewhere other than Redmond, you have to investigate everything remotely. All you have to figure out what went wrong are the perfmon logs and a raw memory process dump before it terminated (since these production servers had no debuggers attached). Analyzing them is a bit like a TV episode of CSI -- involving forensic science (what perfmon counter showed what and when) and a bit of detective work (trying to figure out what happened over the course of a day's run -- in particular in the last few minutes and seconds before things started going bad).

For the first two hours we sat around slicing and dicing the perfmon data. The more we looked at it, the more confused we were. Typically when a server has problems there are hints of things to come for minutes (or at least seconds) of the data prior to the actual problems starting to show themselves externally. With ASP.NET we look at the steady state of a number of key perfmon counters: ASP.NET RPS (request/sec), CPU usage, private bytes allocated, num GCs (to reverse calculate garbage collection/sec), cache size entries, memory allocations/sec, GC Gen2 size (to see how many objects are promoted to the long lived object memory heap), ASP.NET Request Queue depth (to see how many requests are stacking up), and ASP.NET rejection rate (to see if any requests are being turned away because the system is under too much load).

This particular server did not exhibit any problems before the failure occured. Specifically, CPU time was holding steady around 60% (so still plenty of capacity), and both the private bytes and Gen2 GC rate seemed fairly constant (indicating this wasn't a memory leak or cache item leak situation). The process also did not suffer any access violations (AVs) prior or during the time the incident happend. Everything looked absolutely normal until 5:30am when things suddenly went very, very bad.

When the Windows War team asked us at 10:00 for a status update -- we had to confess that we still didn't know what the problem was (not a good answer) and that we needed more time to investigate further (an even worse answer).

As the morning went on, we started to hypothesis that the problem kind of looked like the system starting blocking on something -- but what exactly we couldn't understand from the application we were looking at. There was some data access -- but the database server logs indicated no problems in the SQL database around the time of the issue. The other servers in the farm also didn't have any of the same problems (which seemed to rule out both SQL and network connectivity as the cause).

We finally came accross an interesting entry within the system log about 20 minutes minutes before the problem occured that warned that the drive C: was running low on disk space. After a few frantic calls to the operations team, we were able to confirm that drive C: was indeed the system drive for the server (as well as the drive where the OS memory pagefile lived). And after careful inspection of the machine they discovered that it had filled its very last byte of free space approximately 1 second before the web server started failing all requests.

The cause of the filled disk turned out to be a custom ISAPI filter built for this application that was also running on the server -- and that had some logging logic that was accidentally left in the "verbose log" mode when the server was put together. On each authentication request it logged a small bit of data to its log file on the C: drive -- and its write logic didn't take into account of out of disk space errors and had caused the problems the second after it filled up the disk (blocking all I/O threads as to repeatedly retried to write to its logfile).

Thankfully our bits were fine -- which was a huge relief. The Windows War team enjoyed a good laugh when they heard about the final resolution.

Meanwhile, as I type up this entry we continue to sit in a holding pattern watching more live production sites dogfood the final builds and wait to see if there are any remaining showstoppers that rear their heads. The process takes time -- but time well spent. Every additional site that sucessfully upgrades to run live on the final builds helps stress test things in a new and unique way, and helps guarentee that the release is really solid. Once we are confident of the quality, we'll release things to the world.

P.S. For more information on war rooms, builds and the Windows Server end-game, check out this online article: http://www.winsupersite.com/reviews/winserver2k3_gold2.asp. For those interested in a bit of software history, I'd also recommend the classic Showstopper! The Breakneck Race to Create Windows NT and the Next Generation at Microsoft

Cool... Someone else bitten by resources becoming full. I had a similar problem with the event log filling up while using the Exception Application Building Blocks from MS. By default the EL does not overwrite.

Adam Hill - Friday, February 21, 2003 4:38:00 PM

It's really cool to hear how things work "behind the scenes" at Microsoft. Keep up the cool blog!

Steve

Steven Smith - Sunday, February 23, 2003 5:19:00 AM

This is great, what a great way to get news about the goings on of a Microsoft product release. As someone releasing 5 projects on the Windows Server 2003 this is informative and entertainment reading. Thanks for sharing.

Joseph Shook - Sunday, March 2, 2003 2:10:00 AM

A look at the behind-the-scenes activities at Microsoft!!

Good reading. Enjoyed it and the others as well.

Keep it up!!!

Srinivas - Monday, March 31, 2003 9:11:00 AM

Really it is a good reading...Thank You.

selva - Monday, June 7, 2004 12:21:00 PM

I am deligthed to know how people at Microsoft deal with these kind of issues especially right before the product release.

Arif Mosharraf - Thursday, October 5, 2006 3:26:09 AM

6 Comments