Staying focused over such a long time is hard, but our team held up really well and did a super job shipping a high quality release. I’m particularly proud of the folks on my team who run our stress program. They keep a lab of ~1000 machines running 24/7 throughout the year, and delivered a punishing level of stress load on machines looking for hard to find production bugs.
We have about 40 different stress variations that we run constantly -- some that do normal web operations (data access, security, output caching, session management, etc), some that do lots of less typical things (lots of compilation changes under load, changing configuration files under load to cause app-domain restarts), and some that do things that are deliberately downright nasty (memory leaks, AV crashes, deadlocks -- where the goal is to ensure that ASP.NET automatically recovers from them).
We hook up debuggers and watch for first-chance exceptions, monitor memory usage in the worker processes to watch for leaks, and ensure that performance stays within a consistent RPS range throughout the runs. Any deviations trigger our debuggers to automatically break-into the process and halt the run for analysis and investigation.
We basically repeat this process over and over again 24 hours a day until each of the 40 different variations passes 300 10-hour runs in a row without incident on each different hardware and OS configuration we support (Windows 2000 Single Proc, Windows 2000 2P, Windows 2003 1P, Windows 2003 2P -- with WS03 repeated for x86, x64 and IA64 processors). We then also do longer-haul runs that run on higher-end 4P hardware that runs for a week under extreme load using a combination of different variations (we also then on the IIS side do long-haul runs that take a full 21 days and put extremely heavy load testing our worker process reliability).
Needless to say, it can take awhile to get everything passing. In the early days of the stabilization for Beta2 it was pretty easy to find issues. As we lock down changes in the overall stack, it takes longer and longer for them to surface. Sometimes we had to-do 200-250 runs for a bug to surface, and even then it might take multiple “hits” before we could figure out what exactly was causing it. The stress team did an awesome job driving this process forward and chasing down the final bugs through long days and late nights (it was not uncommon for them to be in the lab until 4-5am and then have to be back by 9am to give updates to our war team).
We report our stress numbers every few days as all of the runs for a particular build complete in the lab. Over the last month the numbers have steadily risen by a few percentage points a run as each remaining stress issue was found and fixed. About two weeks ago we knew we were getting close -- with hardware architectures starting to report 100% one by one (x86 single proc and multi-proc first, then x64, etc). A week ago we kicked off the final long-haul run -- it completed with 100% passing around 3:30pm on Thursday. We officially signed off as a division on Beta2 a little later that night.
Yesterday we held our obligatory ship party to celebrate. It has been a long road together, and it was really cool to see the 1500+ people who have worked on the project all in one place kicking back. Dmitry and I chipped in and bought some fun t-shirts that arrived just in time before the party to hand out. They immortalize one of our last stress failure stack traces that we fixed (click the image to see the full details):
The official beta2 bits should show up on MSDN shortly. Along with the bits we’ll be releasing the “Go-Live” license which means that you’ll be able to go live with production applications on top of the beta (one of the reasons we’ve been so hardcore about fixing all stress issues the last few months). We are also in the process of updating key Microsoft internal and external sites to run on top of the final Beta2 build starting this week as well. The final result of this work will be the reliable, most scalable and fastest web platform out there.
All of us on the team are looking forward to seeing all the cool apps built on top of it. Good luck with it!