July 2004 - Posts
Sometimes it’s good to associate people's faces and blog entries they compose. When you see them walking in the cafeteria then you can say “Hey, you're the person who got it all wrong in your blog ;-)“ Here are two pictures of me during my last week’s vacation in San Francisco. My error in judgment (not using any sunblock) meant that after second day my face was in constant pain and I looked very unhappy ;-) That’s what happens when you’re used to Seattle and occasionally visit California.
San Francisco Bay Area rules!
One of the things I’m constantly thinking about is how to measure the value of someone's actions which prevented bugs reaching our customers. How do I quantify something like "If Alice hasn’t fixed this buffer overflow which Bob discovered then three months from now we would had to issue a security bulletin and spend $X amount of money as a result of it all." It’s very hard (rather impossible) to prove that if some bug hasn’t been discovered by us then somebody would have discovered it in outside world and there’ll a disaster on our hands. Well, in case when your utility accidentally overwrites the boot sector then it’s quite clearly a bad thing ;-) But what about bugs which don't have so clear impact? Not every memory leak is a showstopper. Not every buffer overflow is a security hole.
Quite large percentage of bugs has a root cause which is relatively easy to fix: variables not initialized properly, missing call to release some kind of resource, string being not terminated properly etc. Here are a couple of situations we had while developing both external and internal tools:
- Missing call to
closesocket() and therefore leaking socket descriptors caused us and our partners to spend days diagnosing the root cause of the issue. Fix was very simple, just one line of the code.
- Occasionally some of our BVT-s crashed for no reason at all. Every time we spent hours troubleshooting the problem, unable to understand why it happened. Finally we had an opportunity to capture the crash information and found out that the problem was string which wasn’t properly terminated.
- We had to spend significant amount of time during shipping our first version because in one of the components we were using was tiny memory leak. Just a couple of bytes, but in a couple of days it all accumulated so much that OS started running out of virtual memory.
Probably every programmer can continue this list with hundreds of examples, there’s nothing new here and let’s note the fact I’m being very modest when talking about the cost of bugs. I’m not talking about bugs which caused products to be late or things like Code Red or Nimda. If you have some time on your hands then check out "Collection of Software Bugs".
What tends to happen quite often IRL is that after we’ve hit some kind of blocking issue then somebody spends day and night chasing some bug down in his code, fixing it, and we pat him on the back and say "Good work, that’s the spirit!" instead of asking the question "If Trent had asked Eve to review his code before checking it in, would it sill have happened?" Or let’s take simplified example and two hypothetical development teams: team A and team B. Team A stores all their string constants in resource files because they think it’s a right thing to do going forward. Team B thinks that "We’re US English only, let’s not bother." At some point the decision is made that the product needs to be shipped at international markets. Team A doesn’t do much when it comes to these string constants. Team B spends weekend fixing the code, testers test it during the night, and finally it’s ready. Guess who looks as a hero to the public’s eye? Of course team B when the actual prize should have been team A’s because they followed the proper engineering practices from the beginning.
But how do you effectively measure this? Do you take notes about everything during the entire year and later analyze all this? Can you even compare one person’s commitment to spend the entire weekend fixing bugs against other person’s thorough approach to use proper engineering practices and prevent these bugs from happening? Lots of things to philosophize about ;-)
Here’s one everlasting problem from real life what I and my colleagues are constantly trying to solve efficiently. The problem statement is very simple - when is the right time to stop looking for bugs in specific component if we know that the component will be obsolete in a small number of months? Yes, I know that all the software becomes obsolete after some period of time, but also not all the releases are ground-up rewrites i.e., the noticeable amount of the code base stays the same. Also, please note that this is different from the classical question about when it's the right time to stop testing at all ;-)
Let me give you a manifestation of this situation IRL. Assume that you have a component C what you’re trying to test and you know that during the next milestone it’ll be rewritten or major design changes which will affect the majority of the code base will be applied to this component. At the same time you’re also shipping this component to the customers with the current release and you’ll need to support it for years to come.
The question is simple. Should you either:
- Continue to hammer at the current implementation even after the shipping and try to find all the places where access violations, race conditions, resource leaks etc. may occur. Being informed and knowing about the existing bugs and their impact is always better then sailing in dark. The downside here is that probably all bugs you open will be resolved as "Won’t Fix" (unless it’s something groundbreaking), triage team’s time will be wasted, and people will start questioning if you’re working on right things.
- Use the following rationale: "The code will change anyway, resources are always spare, and we should test the component so that it’s good enough for this release (Well, how do you quantify this and how "good enough for this release" is any different from our usual quality bar?). If the customer data will indicate that they’re having problems with the old implementation then we’ll fix bugs which are being discovered." The downside here is that if everyone testing the component knows that you’re doing something which will be obsolete in a couple of months then it’s very clear that nobody will spending long hours trying to discover every little bug.
There are no easy answers to this question and in my experience every situation is pretty much unique and there are lots of factors to consider. Anyone cares to share how they approach this particular decision and what are the things involved in making the decision?
Suddenly the realization came to me that I’ve been composing posts for this blog for last 6 months, but I haven’t been mentioned Speech Server much. The main reason for this is that we’re brand new product and just now appearing in the price lists. Therefore compared to other servers we don’t have books written about us, knowledge base entries, tons of traffic in newsgroups, user groups established etc. Being a new product means also that we don’t have any security vulnerabilities which are/were discovered outside Microsoft. Yet ;-)
A couple of newsgroup related things you need to know about Speech Server:
- The best newsgroup to ask any questions about Speech Server is microsoft.public.netspeechsdk. Don’t let the name SDK confuse you, this is perfectly valid place to ask questions, report bugs, and post feature requests. A number of people in our team are monitoring this newsgroup on daily basis, so you should get relatively fast response. If you don’t then please use my "Contact" link and I’ll see what I can do to speed things up ;-)
- There are two additional newsgroups microsoft.public.speech_tech and microsoft.public.speech_tech.sdk. These newsgroups are mainly for SAPI and Speech SDK related questions.
Everyone can use Google, but here’s the PC Magazine (the source authoritative enough?) review of Microsoft Speech Server to save you some time. Hopefully the probability of somebody reading this post and using the freshly released Speech Server is greater than zero ;-)
Every one of us has probably his own passions when it comes to software engineering. Mine are assertions, design by contract, root cause analysis, and static source code analysis. It’s my sad belief that this is as close to Silver Bullet as we’ll get during the next decade.
Based on that I also have to admit that I enjoy pretty much everything John Robbins has ever written. Especially the first part of "Debugging Applications for Microsoft® .NET and Microsoft Windows®". Robbins has this thrilling style of writing where he mixes technical content with his peculiar sense of humor, untypical to most of the technical books I’ve ever read. Here are a couple of quotes from him (quotes without the context are sometimes pretty hard to understand, so if you're looking for more then pick up a book):
To avoid bugs, however, I verify everything. I verify the data that others pass into my code, I verify my code’s internal manipulations, I verify every assumption I make in my code, I verify data my code passes to others, and I verify data coming back from calls my code makes. If there’s something to verify, I verify it. (Page 84)
My stock answer when asked what to assert it to assert everything. (Page 86)
Without assertions, I felt like I was programming naked, and I knew I had to do something about it. (Page 104)
Well, I think that John Robbins is definitely in my list of cool people ;-)
[GK, 07/10/2004] Please apply 's/Not a bug/Invalid/g' while reading this post. The resolution type is meant to describe the bug quality not the correctness of application's behavior (Thanks, Larry Osterman!) Lesson learned: read and reread the stuff you post ;-)
We’re currently in the process of restructuring our bug database and here’s one thing I have wanted to do a very long time - add a new resolution type called “Not a bug“. I seriously doubt that this proposal will be accepted, but there’s no harm in trying ;-) The current set of values for resolution we’re using is following:
- By Design
- Not Repro
- Won’t Fix
When I put on my triage hat then I’m quite passionate about making a very clear distinction between different resolution types. The only way to get any meaningful statistics and take action based on that is to make sure that your data is correct. For years we’ve been encountering some bug entries which we just can’t classify under existing resolution types. They are either bug entries just stating some basic facts without having expected result and actual result being specified; bug entries with content which doesn’t make sense to anyone in the room; general suggestions which are too broad to classify under ‘Suggestion’ type etc.
Typically we just assign the active bug we don’t understand back to a bug opener and send a follow-up e-mail asking additional information. There are a couple of problems with that:
- People working in test organization tend to care more about resolved bugs than active bugs (your test organization may vary of course). Based on my experience it takes more time to get a response in regards to active bug than resolved bug.
- Development leads and managers are monitoring constantly active bugs and then we run into "Why SDE/T or STE has active product bug assigned to him? Is he/she going to fix it?" discussion.
- If we would use any other resolution like "Won’t Fix" or "Not Repro" then this would imply that there is actually bug in the product which we decided not to fix or we desperately tried to reproduce the problem, but couldn’t. This will of course start playing tricks with my beloved figures ;-)
Personally I would like to resolve any bug triage team doesn’t understand as “Not a bug” because this will keep everyone honest. Also I would assume that term ‘Not a bug’ will have bigger psychological impact than “Other” or something like this ;-)
The main justification I’ve been using for this is efficiency. Let’s say that there are 20 people in triage meeting and we spend 3 minutes per every triage discussing bugs nobody understands, deciding should we either resolve it, should we ask for the additional information, to whom to assign this bug etc. Practically we just spent 1h in total of people’s time instead of making a quick decision and moving on.
It’s a bug world and I think good way to mock my post is to say that we should be also using tabs instead of spaces, because this will help us to save hard disk space ;-) This is what I call self-critical cynicism ;-)
If you know this already then feel free to ignore this post, but I discovered just today that Borland has their Blog Central at http://blogs.borland.com/. This makes me very nostalgic. I spent first two years of my career as a professional programmer writing applications mainly with Delphi 1.0 and Delphi 2.0 ;-)
In spirit of trying to be modern, here’s my GnuPG public key.
.plan files and home pages are obsolete. Trendy people push their public keys right into RSS feed ;-)
-----BEGIN PGP PUBLIC KEY BLOCK-----
Version: GnuPG v1.2.2 (MingW32)
-----END PGP PUBLIC KEY BLOCK-----
This is key's fingerprint:
E31F 29FE 74F6 60C9 9774 041A 6B16 7D90 C702 F853.
When it comes to the books then IMHO "The Pragmatic Programmer: From Journeyman to Master" by Andrew Hunt and David Thomas is one of the masterpieces anyone taking software development seriously must read. I’m pretty sure that every one of us has personal mandatory reading lists somewhere, but this one is definitely in mine. Hunt and Thomas describe the "Don't Live with Broken Windows" attitude. You can read the entire article, so I don’t need to retell the contents. Believe me, it’s worth reading.
My personal story with broken windows is following: first task I had after shortly starting at Microsoft in October 2000 was to develop an execution environment which provided the user a way to define different scenarios involving speech recognition engines and text-to-speech engines and execute them. Practically it was a little silly XML-based programming language and its interpreter. Needless to say that I was full of energy and excitement and took the entire thing very seriously. I tried to religiously follow some basic rules like:
- All member variables are prefixed with
- When it comes to indenting the code then I used only spaces.
- Every method had to had assertions as pre- and post conditions and also number of invariants. I used only one specific macro for assertions. If you know something about Microsoft Speech SDK then
SPDBG_ASSERT should sound familiar.
Occasionally some people changed the code (added some features, fixed some bugs) and then we always had the following interesting situation: whenever somebody checked in the code indented with tabs then I changed it to spaces, whenever somebody used
ATLASSERT then I went and changed it to
SPDBG_ASSERT, whenever somebody commented out just some block of code then I went ahead and remove it from to code base to keep it clean. It’s been three years since I touched this code base, but some of my old coworkers occasionally still remind me my obsessive-compulsive behavior when they see me in the hallway ;-)
For me these little things represented the act of "breaking the window". Till this day I haven’t finally figured out if the effort I threw into this was worth it or not? Sometimes I comfort myself with the thought that "Yeah, RefactorMercilessly, go-go-go, I was so cool (and young and clueless)!" Sometimes I think that do these little things really matter and did I waste my time? Common sense, most of my colleagues, and all the articles and books written about good-enough software tell me that of course this was wasted effort, perfection is impossible in commercial software, focus on things in priority order etc.
For me this is like programmer's Zen koan. Hopefully at one beautiful day the enlightenment will come to me.