What doesn’t kill you makes you stronger (nostalgic BVT war story)
Well, it finally will, but at least this will take some time ;-) After shipping Speech Server I felt nostalgia for a couple of days and noticed that I started forming sentences like "When I was young and we started this project..." It’s not so bad now, but here’s one relatively fresh war story I sometimes tell to people. About 1.5 years ago I was complaining and criticizing almost everything about our existing BVT system. I’m just one of these persons who enjoys criticizing and picking on little things. It’s my internal fear that most of the coworkers find it very annoying during some periods of time, but I usually like to say that if Thomas Alva Edison would have been satisfied with the candle light then world will be very different today ;-) At some point we reached a situation where I volunteered to own the entire process myself and "make things right". This rookie foolishness/optimism ;-)
It turned out to be a very "exciting" decision. During first three months after I started owning the BVT process for the entire Speech organization I was just scared every day ;-) Every morning I got up and repeated "I must not fear. Fear is the mindkiller." For those of you outside Microsoft - BVT-s mean a great deal for us and we passionately care about the BVT results and getting them on time. One of the major things I immediately realized was that my manager was right again - perception is everything. Almost only criterion people judged the trustworthiness of entire BVT process was the fact that if they came to work at 09:00 AM and there wasn’t e-mail with results of BVT-s for every applicable platform waiting for them then it with high probability was BVT team’s fault until proven otherwise. I understood very-very quickly that having evidence of strong statistics is the only way to improve the current process, save my skin, and stop receiving e-mails from all the people in the team where the question "What went wrong again?' was repeated.
There’s one person in our team whose entire job is to make sure that infrastructure functions, all the BVT-s are executed, problems are solved during non-conventional work hours, results are reported, machines are available for investigation etc. After looking at all the problems we were having, we just started writing them down and documenting on daily basis. We used the following categories:
- Build break: build is kicked off at some time during the night, all the different editions of the product are built and published to build server by early morning hours. If the build is broken then usually the first thing that happens in the morning is that developer comes in and fixes the problem ASAP and something we call "point build" is made. But the fact that build is broken also means that you can’t run BVT-s against specific build. Most of the people don’t read build e-mails and the immediate thing they see when arriving in the morning is the BVT e-mail containing lots of red lines in it. The general perception was - "those BVT people did something wrong again." I was certainly unhappy with this and therefore I started setting my alarm clock for 05:30 AM. When the alarm clock made its horrible noise then I got up, looked at the build result e-mail and if build succeeded then I went back to sleep. If build failed then I either replied to this e-mail saying that we have a build break, opened a blocking bug in our bug database or sent e-mail to relevant people and said that we won’t be getting BVT results this morning because there was a build break. After that I went back to sleep ;-) This actually worked pretty well because it was kind of hard to argue against this specific fact - no build therefore no BVT-s.
- Setup break: same thing here, if there’s a problem with setup - some files aren’t installed/registered properly, setup hangs during the automated installation or something else goes wrong then the probability of your product’s BVT-s passing is very low. The opposite question should be asked: how can your BVT-s pass if setup is broken? We got very good at shifting blame ;-) in this area also. First thing in the morning the person responsible for BVT-s looked at the installation process and setup logs and if he noticed that something is wrong: nice error message on the screen, number of scary entries in NT Event Log etc. then he went ahead and opened a bug and followed up with the quick e-mail to BVT alias. Also in this case it was kind of hard to argue against the fact - no proper installation therefore no BVT-s.
Now we’re reaching the point where other more interesting problems started to happen. If build and setup succeeded and BVT-s failed then the problem could be generally classified as following:
- Bug in product: something is wrong with the product itself and basic user scenarios don’t work. Test organization could celebrate the fact that BVT-s caught a problem ;-) and development organization tries to figure out how they broke the product and how can they fix the problems ASAP.
- Test tools are not in sync with product code: this is one of these embarrassing cases when either something in product functionality changes and test is not aware of the change or test is aware of the change but the code changes are not coordinated properly and therefore things got out of hand during next build.
- Human error: somebody restored a wrong image, pulled out the network cable, sent out wrong e-mail, terminated test execution by mistake etc. Yeah, this was bad and you can’t do it too much ;-(
- Problems with tests itself: this is the most embarrassing case which may happen to you if you own any of the BVT-s. False alarm, people are running around, and suddenly everyone understands that it’s actually test which is incorrect. Quis custodiet ipsos custodes?
- Problems with the BVT infrastructure: nothing to say about this. No justification of any kind, just humble nodding, investigating the root cause, and making sure that the same problem won’t ever happen again.
Needless to say after having this amount of responsibility, stress, and tension on my shoulders I got pretty passionate about documenting everything, trying to understand what’s causing the problems, and fixing the root cause as fast as possible. It took some months of the following dialogs to restore the confidence in BVT team:
Somebody: BVT-s are broken again.
Me (looking at my notes): In fact the problem was caused yesterday by code change #123456 which caused build break this morning. We notified build team at XX:YY AM and here’s the copy of e-mail.
Somebody: BVT-s are broken again.
Me (looking at my notes): If truth be told the problem was caused by bug #654321 which we opened at YY:XX AM, the positive hand-off e-mail was sent to development team at ZZ:WW AM. E-mail with the status will be sent out by triage time.
Somebody: BVT-s are broken again.
Me (sighing deeply): Yes, this time it’s our fault. One of the tools wasn’t updated properly yesterday. Relevant developer is working on the task, ETA for the fix is 11:45 AM.
I used to send out every second week these detailed e-mails to the team with number of problems per every category (another example of "The beatings will continue until moral improves." ;-)) and action items. After digging into every single failure for months we finally reached a point where the number of problems started to decrease noticeably and after some period people started to trust the BVT system (at least I hope so) and with very high probability when something failed then it was a problem with the product. Just to be clear, we're all humans, mistakes still happen and we have to accept this fact. I tend to be perfectionist, but I still have some understanding of real world ;-)
To summarize: though this was very stressful and quite tough experience I don’t regret any moment of it. After owning the BVT process for more than year (especially when you’re shipping the server product) in addition to all my other duties there’s not much which can scare you anymore during the software product lifecycle ;-) The next scariest thing is probably the build process.