Gunnar Kudrjavets

Paranoia is a virtue

Adventures in software root cause analysis

As we’re looking back on what it took to build a Microsoft Speech Server 2004 (MSS) we’re also wondering could it have been done better, cheaper, and faster. Everyone involved in software development knows that discovering bugs, finding the cure for them, and retesting relevant parts of product takes lots of time and money. The MSS root cause analysis (RCA) I’m involved with is focused on bugs only - we’re not doing general postmortems about what went possibly wrong with design, were the software project management decisions optimal, how was work-life balance etc. This is a subject for other studies. Currently we’re focusing only on four different categories of bugs:

  1. Bugs found and fixed late in product cycle. To shortly summarize this: take n last bugs you fixed in your product and try to understand why these issues weren’t found earlier and what caused them. Was it a feature which was added late in the cycle? Was it something in testing coverage which we missed? Is there a bug pattern in the way we do exception handling? The catch is to know how to pick the value of n. After some thinking and taking in account the relative complexity and size of our product, bug trends, and their impact I decided that n = 100 will be good enough for me (probably 128 will be more “geekish“ though ;-)). The rationale here is that if we decided to fix something it’s apparently important enough to do so. Important usually means that majority of our customers would be affected if we won’t fix this problem. The goal is to find out how it happened and how to prevent it from happening in next milestone/version.
  2. Bugs reported by our customers and fixed during Alpha, Beta, Epsilon, and Gamma Releases ;-) These are the bugs we didn’t find ourselves during doing our internal testing and were important enough to be fixed. There’s lots of useful feedback coming from our customers: starting from typos in documentation and ending with "Hey guys, you have a major performance bottleneck while exercising this precise scenario!"
  3. In future: all the QFE-s we will be issuing. QFE stands for Quick Fix Engineering as everyone already guessed. For QFE to be issued something pretty serious has to be going on: security hole in some component, major deployment blocker for our customers, constant service instability while running some scenarios etc. For QFE-s there’ll be a couple of learning points a) if we knew about this bug earlier then we should’ve fixed it instead of postponing (pay $1 now or pay $10 later); b) if we didn’t knew about specific problem earlier then why?
  4. In future: Dr. Watson analysis. As optimistic as I may be, this will happen. Either our product or something running in our process space will cause problems and we’ll see an indication of it. Here we would want to behave like a greedy algorithm and start fixing first the issues which are causing the biggest number of problems and then going through the Dr. Watson crash dumps and figuring out where the problems are.

During last couple of weeks I’ve been mainly working on having the solid base data i.e., going through every single bug which meets criteria specified above. As we all know, without solid data all the analysis we do will be meaningless. Data in my case is bugs and everything associated with this: lines of code which fixed something, test cases related to specific area, bug history, triage's decision-making process etc. Here are a couple of basic things I'm re-discovering (2 + 2 = 4, really?) even before starting any “real analysis“:

  • Bugs must always contain accurate and very specific information. Sounds like truism which every developer/tester knows. True, but sometimes simple things are hard to follow. All the bug attributes like priority, severity, and title need to match the actual contents. The bug resolution type (‘By Design’, ‘External’, ‘Fixed’, ‘Won’t Fix’ etc.) needs to be used correctly. Generally, the more concise and detailed bugs you have, the happier the person doing RCA will be ;-) I’m for example looking at some bugs edited by me 4-5 months ago and sometimes it quite puzzles me what this particular comment means or how we made some decisions ;-)
  • One code change should preferably fix only one bug. This is actually two-sided sword. Here’s one scenario - I’ll make one change to the code base which includes a) fixing one priority 1 issues; b) fixing three different GUI glitches; c) refactoring a couple of methods. As making a code change is usually a rigorous process (buddy testing, code reviews, submitting your change to “Gauntlet” etc.) this tends to happen quite often. The problem is that are you able four months later to determine what exact lines of code fixed this priority 1 issue? With high probability not. From the other side we don’t want waste too much developer’s time on bug fixing because of some artificial process and blindly following the rules. Requiring a separate code change per every bug will be IMHO overkill. Anyhow, for next version we need to work out something to be able to perform RCA and at the same time not make everyone feel that their life is like Dilbert ;-)

Next couple of months should be quite interesting in regards to RCA.

Posted: Apr 14 2004, 10:23 PM by gunnarku | with no comments
Filed under:

Comments

No Comments

Leave a Comment

(required) 

(required) 

(optional)

(required)