At Microsoft we use the terms "dogfood" and "showstoppers" a lot when describing the "end game" of a product cycle.
Dogfooding is slang for self-hosting mission critical systems on top of beta or release candidate software (as in "our dogfood tastes so good I eat it myself"). It is something we do internally pretty aggresively to help chase out the last few production bugs in a server release. We've found that running your payroll, email, source control and health benefits system on top of your own beta software can be a highly motivating way to drive quality into the product early in the development cycle. ;-)
A showstopper is slang for a critical bug that needs to be fixed before a product can be released. As a product enters the end game, the respective "war team" (a group of senior leaders on the team) will continuously watch and triage bug reports to identify and fix these showstopping bugs as soon as possible. By definition, a showstopper bug prevents a final release.
Over the last two weeks my team has been working with a number of internal teams and external companies to do a final dogfood deployment of our new release of ASP.NET V1.1 that will ship as part of Windows Server 2003. Ensuring that major sites and applications are up and running without problems is obviously key to making sure the bits are ready to go.
The past two weeks we've had two interesting firedrills with one of our ASP.NET customers -- http://www.msn.com (which receives 800 million customer visits to the site each month). They've run on ASP.NET V1.0 for about a year with excellent stability, and so are a great test to give us feedback on V1.1 (they have a very high bar to evaluate us on). They kindly let us replace some of their production servers with ASP.NET V1.1 builds to see how things compared.
Different servers, networks and applications exhibit different load characteristics -- which can in turn cause previously unseen bugs to appear. Sure enough, after a few days of running in production they discovered a rare bug in our performance monitor counters -- where all ASP.NET V1.1 counters occasionally dropped to zero (even though the box was running fine). They weren't able to repro this bug in their testing stress lab -- only the production farm seemed to cause the problem (and then only ocassionally and under non deterministic scenarios). Even more mysteriously, the counters would suddenly start working again after 90 minutes of deadtime.
After a few days of frustrating investigation (it ended up being fairly painstaking for us because debug private drops of the perf counter DLL changed the timing in the system enough that the problem went away -- it only repro'd on retail builds with no debugger attached), we finally tracked down the problem. It was an obscure race condition that occured in our perfmon listener when the perfmon monitor was running on a remote machine, and an I/O completion occured while the perfmon was about to decrement a certain counter, and no other I/O thread was available to service the call. When that happened the named pipe connection became corrupted and dropped. After 90 minutes the perfmon system automatically attempted to reconnect again -- which was why the counters would mysteriously restart.
Although rare, the bug clearly fell into the "showstopper" bucket -- since it could prevent administrators from correctly monitoring their servers (a major no-no for mission critical applications). We put together a private fix, had MSN run on it for a few days to verify it worked, and got it approved by the war team last week for checkin.
All was then great in the world (we thought) until we had a rude problem wake us up at 5:30am this morning -- when one of the MSN servers running our V1.1 build fell over in a sudden way. These types of failures tend to put an ugly knot in your stomach when you are in the last few days before shipping...
Figuring out what has happened in situations like this is both an art and a science. Since the server farm is located somewhere other than Redmond, you have to investigate everything remotely. All you have to figure out what went wrong are the perfmon logs and a raw memory process dump before it terminated (since these production servers had no debuggers attached). Analyzing them is a bit like a TV episode of CSI -- involving forensic science (what perfmon counter showed what and when) and a bit of detective work (trying to figure out what happened over the course of a day's run -- in particular in the last few minutes and seconds before things started going bad).
For the first two hours we sat around slicing and dicing the perfmon data. The more we looked at it, the more confused we were. Typically when a server has problems there are hints of things to come for minutes (or at least seconds) of the data prior to the actual problems starting to show themselves externally. With ASP.NET we look at the steady state of a number of key perfmon counters: ASP.NET RPS (request/sec), CPU usage, private bytes allocated, num GCs (to reverse calculate garbage collection/sec), cache size entries, memory allocations/sec, GC Gen2 size (to see how many objects are promoted to the long lived object memory heap), ASP.NET Request Queue depth (to see how many requests are stacking up), and ASP.NET rejection rate (to see if any requests are being turned away because the system is under too much load).
This particular server did not exhibit any problems before the failure occured. Specifically, CPU time was holding steady around 60% (so still plenty of capacity), and both the private bytes and Gen2 GC rate seemed fairly constant (indicating this wasn't a memory leak or cache item leak situation). The process also did not suffer any access violations (AVs) prior or during the time the incident happend. Everything looked absolutely normal until 5:30am when things suddenly went very, very bad.
When the Windows War team asked us at 10:00 for a status update -- we had to confess that we still didn't know what the problem was (not a good answer) and that we needed more time to investigate further (an even worse answer).
As the morning went on, we started to hypothesis that the problem kind of looked like the system starting blocking on something -- but what exactly we couldn't understand from the application we were looking at. There was some data access -- but the database server logs indicated no problems in the SQL database around the time of the issue. The other servers in the farm also didn't have any of the same problems (which seemed to rule out both SQL and network connectivity as the cause).
We finally came accross an interesting entry within the system log about 20 minutes minutes before the problem occured that warned that the drive C: was running low on disk space. After a few frantic calls to the operations team, we were able to confirm that drive C: was indeed the system drive for the server (as well as the drive where the OS memory pagefile lived). And after careful inspection of the machine they discovered that it had filled its very last byte of free space approximately 1 second before the web server started failing all requests.
The cause of the filled disk turned out to be a custom ISAPI filter built for this application that was also running on the server -- and that had some logging logic that was accidentally left in the "verbose log" mode when the server was put together. On each authentication request it logged a small bit of data to its log file on the C: drive -- and its write logic didn't take into account of out of disk space errors and had caused the problems the second after it filled up the disk (blocking all I/O threads as to repeatedly retried to write to its logfile).
Thankfully our bits were fine -- which was a huge relief. The Windows War team enjoyed a good laugh when they heard about the final resolution.
Meanwhile, as I type up this entry we continue to sit in a holding pattern watching more live production sites dogfood the final builds and wait to see if there are any remaining showstoppers that rear their heads. The process takes time -- but time well spent. Every additional site that sucessfully upgrades to run live on the final builds helps stress test things in a new and unique way, and helps guarentee that the release is really solid. Once we are confident of the quality, we'll release things to the world.
P.S. For more information on war rooms, builds and the Windows Server end-game, check out this online article: http://www.winsupersite.com/reviews/winserver2k3_gold2.asp. For those interested in a bit of software history, I'd also recommend the classic Showstopper! The Breakneck Race to Create Windows NT and the Next Generation at Microsoft
Thomas Marquardt, one of our developers on ASP.NET, recently published an article on MSDN about ASP.NET performance monitoring. Along with the article are a number of great utilities (all complete with C# source code).
We developed them internally to help automate diagnosing customer issues with production apps, and they are also used as part of our ASP.NET stress testing program (I'll provide more details about our stress lab in a future blog entry). These source utilities include:
Snap.exe -- a command-line tool for logging performance data for processes.
HttpClient.exe -- a simple client that records time to last byte (TTLB) for HTTP requests.
ErrorHandler.dll. This is an IHttpModule that you can add to the HTTP pipeline to log unhandled exceptions to the event log. It is better to log errors to a SQL Server database (that is what we do in our stress lab -- since it makes reporting across hundreds of servers easier), but the example uses the event log for simplicity.
QQQ.exe -- a command-line tool for stress testing an ASP.NET application. When used in combination with a stress client, such as Homer or ACT, this tool will attach debuggers to the worker process and monitor certain performance counters. It can be tuned to automatically break into the debuggers when performance degrades.
QQQ is probably my favorite utility -- especially given that its name is complete nonsense. "QQQ" stemmed from the fact that Dmitry (ASP.NET dev manager) likes to name all of his test ASP.NET pages with those initials, usually under various temporary directories scattered accross his harddrive.
QQQ is usually used in combination with the windbg and cdb debuggers (which you can download from: http://www.microsoft.com/ddk/debugging/installx86.asp). These debuggers are more appropriate for system level dubugging (versus the VS debugger) in that they have less effect on the process they are attached to (unlike the VS debugger which tends to impact timings within the target process -- making stress investigations harder).
QQQ is also typically used in conjunction with a .NET Framework debugger extension we call "sos.dll". Strike.dll was the parent debugger extension we used in V1 of the .NET Framework -- "sos.dll" affectionately stands for "son of strike". You can download son of strike at: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnbda/html/DBGch03.asp.
If you are doing hard-core production ASP.NET development, and want to build up an infrastructure for testing and monitoring your web applications, then I'd definitely recommend spending an hour browsing through the above code. There are a lot of cool tips/tricks you can pick up and re-use.
Quick Update: A few people have pinged me as to how they can get the source to the above utilities. Note that a link to them is in the top article above. The direct url is: http://download.microsoft.com/download/3/c/8/3c820668-3095-40e2-b462-0bc8a0b6690f/monitoring_tools.exe
Paul Wilson has put together a pretty darn cool Viewstate decoder that enables you to see what the viewstate on an ASP.NET page looks like: http://www.wilsondotnet.com/Demos/ViewState.aspx
He has also written a good article that describes the ASP.NET viewstate process: http://www.aspalliance.com/PaulWilson/Articles/?id=7.
Note that ViewState is MAC (message authentication check) encoded to prevent tampering by remote clients. As such, while you can decypher the viewstate settings using the above program, you can't actually change them and repost to the server. You can also set a stronger level of ViewState to prevent the above decoder from working by changing your encryption key settings in either machine.config or web.config.
This weekend has been a pretty busy one for my team (ASP.NET and the Visual Studio web tools) as we drive towards hitting code complete (CC) for our current V2 feature milestone. Our target date for hitting CC was this past Friday (the 14th) -- which really means late Sunday night. About half of my team has been here at somepoint this weekend finishing up -- we look to be on track for hitting our goal.
Clarrification: Note that the Code Complete we hit this past weekend wasn't for the entire product -- just the particular coding milestone we are currently in (see more details below).
One of the frequent questions people ask me is what the methodology is at Microsoft for developing software. The process varies by team and division, but are usually organized around several fixed feature development milestones (typically 5-7 milestones for a V1 product, 2-4 milestones for a product upgrade).
Feature milestones are usally split up into 3 components: planning, coding and stabilization. During the planning phase the PM team drives finalizing the milestone feature list, and complete detailed design specs for each feature. During the coding phase (usually 8 weeks) the development team implements the features and complete checkin suites to provide basic verification coverage of them. During the stabilization phase (usually 6-10 weeks) the QA team works to implement automation tests that provide detailed coverage of the milestone features -- and the dev team fixes the corresponding bugs that are found.
Our goal when exiting a milestone is to leave it with a stable product that contains a compelling feature set, and that has enough automation tests written against it to ensure that we can maintain "self-host" quality coverage of the features on a nightly basis going forward.
The automation of these tests is critical. Because we typically localize products into 8-32 languages (VS.NET is localized into 8 languages, ASP.NET and the .NET Framework into all 30+ Windows languages) we can't rely on manual testing. It is simply not possible to manually run tens of thousands of tests 32 different times (one for each language) every day. Instead our QA team builds all tests to plug into an automation framework (affectionately named "MadDog"). The automation framework is then integrated into our test lab -- and allows a tester to kick off a test run in a matter of minutes.
A tester -- sitting in their office -- can configure the operating system and hardware they want to use (for example: Windows Server 2000, Advanced Server Edition on a 2 processor hardware configuration with a German language build) as well as the test suite they want to run (for example: all session state tests). The automation framework with then find a hardware machine that meets the specification selected, reformat and re-image the operating system for the choice selected, and then execute the test suite run. All pass/fail results can then be viewed by the tester on their own machine with their office (fairly fancy -- we are quire proud of this testing framework). On my team the test organization typically runs 20% of their automated tests every day (varying the system configuration). We then do a full automation run of the remaining test cases every 2 weeks.
After we finish our feature milestones, we'll typically have a long stabilization period (8-14 extra weeks) and then begin our formal beta process. The goal with the betas are two fold -- 1) help find bugs, and 2) start getting people excited about the release.
Most products at Microsoft these days have either 2 or 3 widespread beta releases. As we get close to the end, we'll also usually cut a few release candidate (RC) releases that we hand out to a small list of our most active beta customers to sign-off that all their bugs are fixed and that the release is ready. Once we feel all the release criteria are met, we'll declare the product RTM (released to manufacturing) -- after which the product will be formally launched and released to customers.
And of course after a product is released, the whole process repeats itself again (2-4 new feature milestones, 2-3 betas, and RTM again)....
Scott Watermasysk was kind enough to help set me up with this new blog -- and this is my first entry in it. So far, I've been very impressed with the system he has built -- it looks very polished!
And best of all (of course) it is built with ASP.NET. :-)