Gunnar Kudrjavets

Paranoia is a virtue

Measuring the value of preventing the bugs

One of the things I’m constantly thinking about is how to measure the value of someone's actions which prevented bugs reaching our customers. How do I quantify something like "If Alice hasn’t fixed this buffer overflow which Bob discovered then three months from now we would had to issue a security bulletin and spend $X amount of money as a result of it all." It’s very hard (rather impossible) to prove that if some bug hasn’t been discovered by us then somebody would have discovered it in outside world and there’ll a disaster on our hands. Well, in case when your utility accidentally overwrites the boot sector then it’s quite clearly a bad thing ;-) But what about bugs which don't have so clear impact? Not every memory leak is a showstopper. Not every buffer overflow is a security hole.

Quite large percentage of bugs has a root cause which is relatively easy to fix: variables not initialized properly, missing call to release some kind of resource, string being not terminated properly etc. Here are a couple of situations we had while developing both external and internal tools:

  • Missing call to closesocket() and therefore leaking socket descriptors caused us and our partners to spend days diagnosing the root cause of the issue. Fix was very simple, just one line of the code.
  • Occasionally some of our BVT-s crashed for no reason at all. Every time we spent hours troubleshooting the problem, unable to understand why it happened. Finally we had an opportunity to capture the crash information and found out that the problem was string which wasn’t properly terminated.
  • We had to spend significant amount of time during shipping our first version because in one of the components we were using was tiny memory leak. Just a couple of bytes, but in a couple of days it all accumulated so much that OS started running out of virtual memory.

Probably every programmer can continue this list with hundreds of examples, there’s nothing new here and let’s note the fact I’m being very modest when talking about the cost of bugs. I’m not talking about bugs which caused products to be late or things like Code Red or Nimda. If you have some time on your hands then check out "Collection of Software Bugs".

What tends to happen quite often IRL is that after we’ve hit some kind of blocking issue then somebody spends day and night chasing some bug down in his code, fixing it, and we pat him on the back and say "Good work, that’s the spirit!" instead of asking the question "If Trent had asked Eve to review his code before checking it in, would it sill have happened?" Or let’s take simplified example and two hypothetical development teams: team A and team B. Team A stores all their string constants in resource files because they think it’s a right thing to do going forward. Team B thinks that "We’re US English only, let’s not bother." At some point the decision is made that the product needs to be shipped at international markets. Team A doesn’t do much when it comes to these string constants. Team B spends weekend fixing the code, testers test it during the night, and finally it’s ready. Guess who looks as a hero to the public’s eye? Of course team B when the actual prize should have been team A’s because they followed the proper engineering practices from the beginning.

But how do you effectively measure this? Do you take notes about everything during the entire year and later analyze all this? Can you even compare one person’s commitment to spend the entire weekend fixing bugs against other person’s thorough approach to use proper engineering practices and prevent these bugs from happening? Lots of things to philosophize about ;-)

Posted: Jul 31 2004, 08:16 PM by gunnarku | with 1 comment(s)
Filed under:

Comments

Balaji said:

Recently something happened in my company on a release date which created a few heroes here just like you mentioned. There was a connection leak in the db code and once the system went into production, the system came to a grinding halt in a couple of hours because of the connection leak. Everyone jumped on the issue and worked through-out the weekend and fixed it. The management saw these people as heroes and rewarded them with certificate of accomplishment!

I sincerely appreciate the hard work put in by those guys during the weekend, but the first thought that came to my mind was, how this could have been prevented. We have so many QA releases, Staging releases before production release. Why it was not caught in those. If these people receive accolade for fixing something which was an error, what about the rest of us who broke our back through out the release cycle developing features?

When I spoke to my manager about this, he said he appreciates the "sacrifice" those guys made through the weekend and hence they were "rewarded". I understand that bugs do creep in our products no matter how much testing we do, but rewarding someone who fixed the bugs he created is a bit too much for me! :-) IMHO

I sincerely believe that code reviews and stress testing can certainly help catch such bugs much earlier in the process.
# August 3, 2004 11:53 AM
Leave a Comment

(required) 

(required) 

(optional)

(required)