Measuring the value of preventing the bugs
One of the things I’m constantly thinking about is how to measure the value of someone's actions which prevented bugs reaching our customers. How do I quantify something like "If Alice hasn’t fixed this buffer overflow which Bob discovered then three months from now we would had to issue a security bulletin and spend $X amount of money as a result of it all." It’s very hard (rather impossible) to prove that if some bug hasn’t been discovered by us then somebody would have discovered it in outside world and there’ll a disaster on our hands. Well, in case when your utility accidentally overwrites the boot sector then it’s quite clearly a bad thing ;-) But what about bugs which don't have so clear impact? Not every memory leak is a showstopper. Not every buffer overflow is a security hole.
Quite large percentage of bugs has a root cause which is relatively easy to fix: variables not initialized properly, missing call to release some kind of resource, string being not terminated properly etc. Here are a couple of situations we had while developing both external and internal tools:
- Missing call to
closesocket() and therefore leaking socket descriptors caused us and our partners to spend days diagnosing the root cause of the issue. Fix was very simple, just one line of the code.
- Occasionally some of our BVT-s crashed for no reason at all. Every time we spent hours troubleshooting the problem, unable to understand why it happened. Finally we had an opportunity to capture the crash information and found out that the problem was string which wasn’t properly terminated.
- We had to spend significant amount of time during shipping our first version because in one of the components we were using was tiny memory leak. Just a couple of bytes, but in a couple of days it all accumulated so much that OS started running out of virtual memory.
Probably every programmer can continue this list with hundreds of examples, there’s nothing new here and let’s note the fact I’m being very modest when talking about the cost of bugs. I’m not talking about bugs which caused products to be late or things like Code Red or Nimda. If you have some time on your hands then check out "Collection of Software Bugs".
What tends to happen quite often IRL is that after we’ve hit some kind of blocking issue then somebody spends day and night chasing some bug down in his code, fixing it, and we pat him on the back and say "Good work, that’s the spirit!" instead of asking the question "If Trent had asked Eve to review his code before checking it in, would it sill have happened?" Or let’s take simplified example and two hypothetical development teams: team A and team B. Team A stores all their string constants in resource files because they think it’s a right thing to do going forward. Team B thinks that "We’re US English only, let’s not bother." At some point the decision is made that the product needs to be shipped at international markets. Team A doesn’t do much when it comes to these string constants. Team B spends weekend fixing the code, testers test it during the night, and finally it’s ready. Guess who looks as a hero to the public’s eye? Of course team B when the actual prize should have been team A’s because they followed the proper engineering practices from the beginning.
But how do you effectively measure this? Do you take notes about everything during the entire year and later analyze all this? Can you even compare one person’s commitment to spend the entire weekend fixing bugs against other person’s thorough approach to use proper engineering practices and prevent these bugs from happening? Lots of things to philosophize about ;-)