April 2004 - Posts
First of all, back to basics. Here's the list of most useful top three books I've ever read about assertions (not mentioning the CRT source code ;-)). I intentionally don't include any books about design by contract (DBC), formal methods, and formal verification.
- "Debugging Applications for Microsoft® .NET and Microsoft Windows®" by John Robbins. Open the book on page 85 ("Assert, Assert, Assert, and Assert").
- "Writing Solid Code" by Steve Maguire. Open the book on page 13 ("Assert Yourself").
- "No Bugs!: Delivering Error-Free Code in C and C++" by David Thielen. Open the book on page 39 ("Assert the World"). This book is pretty interesting in this sense that I picked it up for ca $4 from the used bookstore in Bellevue. It was published in 1992 (I was in 10th grade then ;-)) and offers personally for me kind of cool view how the software development at Microsoft was happening then.
Let's see how the development will look like when all the assertions would be enabled in release builds and what problems it'll cause and what questions we would need to answer. One of the first questions I would have is: what should happen when assertion fires? Should we:
- Show the user detailed cryptic error message á la "NULL != p->Foo->Bar" and present the user with the classical "Abort, Retry, Ignore" choice? I bet this will confuse everyone and make them either complain about software or call support. Also what about non-interactive software like services for example?
- Don't show user anything technical and use logic similar to Dr. Watson, have a nice dialog explaining to user that some problems happened internally, and can we send the information related to this incident to Microsoft?
- Log all the instances when assertions fail to NT Event Log (or whenever user specifies?) and provide some actionable information if applicable (what it'll be?).
- Just terminate the process because internal assumptions are violated and we can't guarantee that software produces correct results?
I totally understand the point Niels Ferguson and Bruce Schneier are trying to make - "if internal assumptions are violated in (cryptographic) code then it's no longer safe to continue and it'll be safer just to terminate the current process/operation." For example: let's assume that we have a client and server both written by some financial institution. Server programmer writes an assertion that encryption stream cipher can't be RC2, but doesn't implement specific check in the code because his buddy next door is implementing the client piece of software which never would use RC2." Using the current model where assertions are only enabled in debug builds and somebody will implement a fake client and there's nothing in server code preventing the usage of RC2 we may have introduced a security hole. You can replace “this RC2 thing” with any other assumption you're making, but not writing specific error handler for.
Nothing is free. Especially in the server world. The very simplistic model for an assertion is following:
The typical usages of assertions are for example internal method which validates all its arguments. We're pretty sure that caller honors the contact, but at the same time we're paranoid and we want to make sure that at least in debug builds the violation of the contract is caught immediately.
private bool SomeInternalMethod(String s, Int32 i)
Debug.Assert(null != s, "null != s");
Debug.Assert(0 < s.Length, "0 < s.Length");
Debug.Assert(0 < i, "0 < i");
Now let's assume for a moment that this is a code in the XML parser and the function is called for every node in parsing tree. Are we still willing to have all this error checking in place and pay the price for these function calls? It gets even better, I've written enough code in my life which performs complex calculations or pointer manipulation and I usually want to make sure that after iteration of the cycle the invariant is correct.
for (Int32 i = 0; i < nNodeCount; ++i)
Debug.Assert(true == VerifyTreeStructure());
Am I still allowed to pay this performance price?
I don't pretend to be expert in programming or in designing secure software, but I already see a couple of things I don't like:
- If we would establish an absolute rule that every assertion must be enabled in production code then for every pre-condition, invariant or post-condition the programmer will write he/she will have this set of questions: "Can I afford this performance-wise?", "If this assertion is violated should my application really terminate?" I bet that there'll be number of instances when people just won't use assertions so liberally anymore.
- Probably after doing it some time we'll be back again on square one after somebody will propose that we should have "release assertions" and "debug assertions" which won't be much different from the situation currently where every decent developer is handling all the cases which may go wrong and writing assertions for validating the contracts. Then somebody eventually will raise the Orwellian question- should all assertions be handled equivalently? Are there some assertions which we should mark as "more important" than others?
These are my thoughts on subject and I'll be very interested to hear somebody else's opinion(s).
P. S. Criticizing kind of feels good; you can just bash everything and not offer any solutions. I think I'll soon start writing unskilled book reviews ;-)
Let’s start with the fact that I have forgotten most of the mathematics I learnt in university and therefore before going to sleep I’m able to read only non-mathematical books about cryptography like for example "Practical Cryptography" by Niels Ferguson and Bruce Schneier. The book has one chapter called "Implementation Issues (I)" with section "Quality of Code" and subsection named "Assertions" (page 148). It starts with a very pleasant quote:
"When implementing cryptographic code, adopt an attitude of professional paranoia."
As I’ve spent some years before joining the Microsoft on writing some crypto related code for financial institutions I of course wholeheartedly agree with this idea. Later the particular subsection gets even more interesting. Here’s another set of quotes:
"... There are some programmers who implement assertion checking in development, but switch it off when they ship the product. Who thought that up? ...
... Why would anyone ever switch off the assertion checking on production code? That is the only place where you really need it! If an assertion fails in production code, then you have just encountered a programming error. Ignoring the error will most likely result in some kind of wrong answer, because at least one assumption the code makes is wrong. ..."
People who professionally know me are familiar with the fact that my middle name is "Mr. Assertions" ;-) That’s right, G "Mr. Assertions" K. This specific subject is so provoking for me that if I wouldn’t be so wiped out from last couple of days (shipping server software is hard) I would write a long essay based on these quotes, but I have to get some sleep first and wait till tomorrow ;-)
If you don’t know this yet - the book is available at Amazon.com for pre-ordering. The relevant link is here. I think this will be very well spent $16.99 ;-)
The mandatory disclaimer: I’m not in any way associated with Joel Spolsky or Fog Creek Software. I just think that it’ll be very beneficial to have all his essays in hard cover.
The subject of poor quality bugs has been touched here before - I wrote a big rant about the particular problem a month ago and Larry Osterman covered some aspects of other bug related problems in one of his posts. In this entry I’ll describe how we handled and are handling poor bug quality related problems in my team (Telephony Application Services). I won’t claim that our bugs are perfect, but at least in daily triage participants are way less sarcastic than about a year ago ;-) IMHO making sure that people in your team report quality bugs isn’t a rocket science; it’s just a question of basic engineering discipline and persistence.
This is the first thing to do. It needs to be very clearly communicated to everyone what information you expect to see in bugs and how it must be presented. If your bug management tool supports templates then provide one and make sure that everyone knows that they must use this template. If you can’t use bug templates then make sure that you give a list of topics that must be covered in a bug. For junior people it’s always helpful to explain why you need specific parts of the bug report (build number of the product, information about operating system, log files, stack trace etc.), people with years of development/testing experience usually know what this is all about. The typical bug entry consists of the following parts:
- General problem summary.
- Steps for reproducing the problem.
- Expected result.
- Actual result.
- Customer impact.
One thing which may sound a little bit harsh, but IMHO if the person isn’t able to follow specific instructions to report a bug or the person isn’t able in concise technical terms to explain what’s wrong with something and why he/she thinks that this behavior is incorrect then the person shouldn’t be working in the software industry at all. Being able to write a decent bug report which everyone else is able to understand seems to me like a basic skill and precondition for a (successful) career in software quality assurance or testing.
Making it happen
I like to flatter myself with the though that at some point I was the most unpopular person as seen by the people who opened any Telephony Application Services bugs ;-) Achieving good bug quality isn’t so complicated - whenever somebody does something wrong then just send them an e-mail, explain why their bug doesn’t meet your quality bar, and ask them not to do it again.
Our daily component triage meeting has taken place usually at 11:00 AM in my office for the last two years or so. About 10:30 AM I start looking through all the bugs assigned to triage alias and per every bug what I think doesn’t have enough information, I send an e-mail to the person who opened a bug, and ask him/her to fix the problem. Our bug tracking software has a wonderful "Send Mail..." choice from the "File" menu which greatly helps ;-) The usual problems are:
- Steps for reproducing bug aren’t clear enough.
- Bug is missing a product build number.
- URI for log files pointed in bugs is incorrect.
- AV bug doesn’t have enough information attached.
- GUI bug doesn’t have related screen-shot attached.
- The problem description isn’t specific enough.
Practically this is just lots of basic "bug police" work. I don't like doing this, but it's part of my job responsibilities and ultimately somebody has to perform this task. Of course IRL it’s not so easy to make bug quality problems disappear. I personally divide the people who violate bug quality guidelines and related actions into four different categories:
- Person just didn’t know that some information is needed or it wasn’t clearly communicated to him what information is needed in bug. Perfectly understandable, I’ve been in this position myself. The solution in this case is just to send a polite e-mail and point person to existing bug guidelines and explain why we need this information.
- Person has already been informed of bug etiquette, but he continues to ignore it. In this case I send polite e-mail explaining why we really need to have for example log files from the moment when server crashed, sign the e-mail as "TAS triage team", and insist that bug guidelines will be followed.
- Third category is a phase when development leads start losing patience and making very cynical remarks or developers start sending e-mails and complaining that based on the information in the bug they can’t do much anything. In this case we use powers given to us to send a polite e-mail, include bug opener’s manager’s e-mail on ‘Cc:’ line and strongly insist that bugs will follow the specific guidelines. Usually it ends here.
- We’ve never reached fourth category, but fourth category for our triage team will be that we’ll stop accepting bugs from a person who constantly opens poor quality bugs unless his/her manager signs off on these bugs. I hope we would never have to do this.
Like I said before, IMHO it’s just a question of discipline and holding up to a certain quality bar. Poor quality bugs just waste everyone's time. Anyone who has read "The Mythical Man-Month" remembers the following: "How does a project get to be a year late? One day at a time!"
I have to start with the fact that lately whenever I talk with someone about writing code in C++ I feel like Neanderthal or at least very stagnant person. C# 2.0 is on the horizon and it seems that there are only few people left who are trying to get their dosage of adrenaline by playing dangerous games with null-terminated strings and possibility of introducing another double-free bug ;-) Actually I’m just ranting, it’s not so bad, but I promise, honest, I’ll fully switch to C# really soon!
During the last weekend I was writing some "weekend code" which involved tricky pointer manipulations and inevitably I made a mistake and some destructors didn’t get called when application exited. Of course operating system cleans up after the process, but I never feel good while leaving even the slightest memory leaks in the code. Fortunately I remembered my old friend
_CrtSetDbgFlag and found a problem in less than five minutes and as a bonus got motivation to write about CRT debug heap.
Hopefully there isn’t a Windows C/C++ programmer in the planet who doesn’t know about existence of DCRT. Famous Bugslayer John Robbins dedicates an entire chapter in his "Debugging Applications for Microsoft® .NET and Microsoft Windows®" for this purpose (Chapter 17, The Debug C Run-Time Library and Memory Management, page 667). It tells you more about DCRT than I ever would, but to get you started here are a couple of fundamental links:
This is just the simplest form of usage:
int tmpFlag = _CrtSetDbgFlag(_CRTDBG_REPORT_FLAG);
tmpFlag |= _CRTDBG_LEAK_CHECK_DF;
Love debug CRT, cherish it, and use it in every one of your C/C++ applications (unless you use some other more advanced ways to manage heap in your applications or don’t use CRT at all). If you’re developer then go and add a couple of these lines into your code and believe me, it’ll make your life easier. If you’re tester then go and open a bug per every C/C++ application which doesn’t use CRT built-in memory leak detection. But make sure you read MSDN first and set the right flags suitable for your specific needs. For example
_CRTDBG_CHECK_ALWAYS_DF is quite a big hit to performance and may render your debug builds totally useless for testing purposes.
The subject of text editors is the most controversial theme I ever touched in my blog. Just for a fun I took a quick look around to see what developers/testers in my team use. No big surprises, lots of different editors:
- Epsilon Programmer’s Editor
- Source Insight
- Visual SlickEdit
- Visual Studio
What I personally use? It depends on the situation. Mainly the following three editors:
- Visual Studio - for writing the non-trivial amounts code, I love IntelliSense and integrated help, I’ve learned already too much shortcut keys and I’m too lazy to re-learn all this in some new editor to achieve the similar level of productivity.
- Source Insight - this is IMHO very good tool for browsing and understanding the large code bases or quickly finding out where something is defined and how. For example I have one project which consists of all ATL, CRT, MFC, and Platform SDK header files. You won’t believe how useful reading the header files is ;-)
- gVim - I use it for everything else programming related, editing files quickly, writing some snippets of code etc.
I’ve been considering switching to Emacs or XEmacs (hopefully this will force me to learn Lisp also) to feel more like a “real geek“ ;-), but there’s never enough time to spend a couple of days to reach average efficiency.
As we’re looking back on what it took to build a Microsoft Speech Server 2004 (MSS) we’re also wondering could it have been done better, cheaper, and faster. Everyone involved in software development knows that discovering bugs, finding the cure for them, and retesting relevant parts of product takes lots of time and money. The MSS root cause analysis (RCA) I’m involved with is focused on bugs only - we’re not doing general postmortems about what went possibly wrong with design, were the software project management decisions optimal, how was work-life balance etc. This is a subject for other studies. Currently we’re focusing only on four different categories of bugs:
- Bugs found and fixed late in product cycle. To shortly summarize this: take n last bugs you fixed in your product and try to understand why these issues weren’t found earlier and what caused them. Was it a feature which was added late in the cycle? Was it something in testing coverage which we missed? Is there a bug pattern in the way we do exception handling? The catch is to know how to pick the value of n. After some thinking and taking in account the relative complexity and size of our product, bug trends, and their impact I decided that n = 100 will be good enough for me (probably 128 will be more “geekish“ though ;-)). The rationale here is that if we decided to fix something it’s apparently important enough to do so. Important usually means that majority of our customers would be affected if we won’t fix this problem. The goal is to find out how it happened and how to prevent it from happening in next milestone/version.
- Bugs reported by our customers and fixed during Alpha, Beta, Epsilon, and Gamma Releases ;-) These are the bugs we didn’t find ourselves during doing our internal testing and were important enough to be fixed. There’s lots of useful feedback coming from our customers: starting from typos in documentation and ending with "Hey guys, you have a major performance bottleneck while exercising this precise scenario!"
- In future: all the QFE-s we will be issuing. QFE stands for Quick Fix Engineering as everyone already guessed. For QFE to be issued something pretty serious has to be going on: security hole in some component, major deployment blocker for our customers, constant service instability while running some scenarios etc. For QFE-s there’ll be a couple of learning points a) if we knew about this bug earlier then we should’ve fixed it instead of postponing (pay $1 now or pay $10 later); b) if we didn’t knew about specific problem earlier then why?
- In future: Dr. Watson analysis. As optimistic as I may be, this will happen. Either our product or something running in our process space will cause problems and we’ll see an indication of it. Here we would want to behave like a greedy algorithm and start fixing first the issues which are causing the biggest number of problems and then going through the Dr. Watson crash dumps and figuring out where the problems are.
During last couple of weeks I’ve been mainly working on having the solid base data i.e., going through every single bug which meets criteria specified above. As we all know, without solid data all the analysis we do will be meaningless. Data in my case is bugs and everything associated with this: lines of code which fixed something, test cases related to specific area, bug history, triage's decision-making process etc. Here are a couple of basic things I'm re-discovering (2 + 2 = 4, really?) even before starting any “real analysis“:
- Bugs must always contain accurate and very specific information. Sounds like truism which every developer/tester knows. True, but sometimes simple things are hard to follow. All the bug attributes like priority, severity, and title need to match the actual contents. The bug resolution type (‘By Design’, ‘External’, ‘Fixed’, ‘Won’t Fix’ etc.) needs to be used correctly. Generally, the more concise and detailed bugs you have, the happier the person doing RCA will be ;-) I’m for example looking at some bugs edited by me 4-5 months ago and sometimes it quite puzzles me what this particular comment means or how we made some decisions ;-)
- One code change should preferably fix only one bug. This is actually two-sided sword. Here’s one scenario - I’ll make one change to the code base which includes a) fixing one priority 1 issues; b) fixing three different GUI glitches; c) refactoring a couple of methods. As making a code change is usually a rigorous process (buddy testing, code reviews, submitting your change to “Gauntlet” etc.) this tends to happen quite often. The problem is that are you able four months later to determine what exact lines of code fixed this priority 1 issue? With high probability not. From the other side we don’t want waste too much developer’s time on bug fixing because of some artificial process and blindly following the rules. Requiring a separate code change per every bug will be IMHO overkill. Anyhow, for next version we need to work out something to be able to perform RCA and at the same time not make everyone feel that their life is like Dilbert ;-)
Next couple of months should be quite interesting in regards to RCA.
This month’s issue of Dr. Dobbs’s Journal has inspired already some posts. I also have to admit that the entire issue was enjoyable reading. One piece which I particularly enjoyed was "The Irony of Extreme Programming" by Matt Stephens and Doug Rosenberg. The authors of this article are the same people who wrote "Extreme Programming Refactored: The Case Against XP". I haven’t read the particular book, but I got the inspiration from this article to write my personal rant about XP.
First of all disclaimer: "I haven’t participated in any software projects involving XP or any other Agile Software Development methodologies so I don’t have any firsthand experience therefore I’m just another theoretical critic who has no clue what he is talking about ;-)". My theoretical knowledge is based mainly on the following four books from my personal bookshelf:
and countless hours I’ve spent during last two years for reading related newsgroups or Wiki.
XP has number of very good ideas (constant refactoring, quality coding, unit testing etc.) with what I wholeheartedly agree. One of the things I disagree most in XP ideology is the common programming area, communal workspace, facilities strategy or however you want to call this. Yes, I understand the reasons behind it (Extreme Programming Explained: Embrace Change, Kent Beck, Chapter 13, Facilities Strategy, page 77 provides an overview), but I can’t possibly see how it’ll make for example me (or my team) happy and more productive. "Peopleware" by DeMarco and Lister spends significant portion of time to explain how workspace quality and product quality are tied together. Also, there is a reason why Microsoft tries to make sure that everyone has their own office ;-)
I’ve done my time in communal workspace for a couple of years and here are some things I don’t like:
- Lack of concentration. Try for example to design a detailed cryptographic protocol, debug critical issue in some multithreaded code when there are three people muttering about something in the same room with you. Design and development are the activities requiring deep mental concentration. I also like to take every day from 30 minutes till 1 hour of uninterrupted "thinking time" where I don’t read e-mail or talk with anyone and just think in peace. It seems kind of funny to have to walk to the cubby to get some amount of "virtual" privacy.
- Basic privacy issues. After fours hours of intense meetings I like to close my office door and listen to Depeche Mode, 50 Cent or Paul Oakenfold for half an hour. Yes, I have heard of things called headphones ;-) Let’s add to this an ability to have phone calls in private, talk with people in private, and the general concept of "my office is my castle".
To summarize: I’ll believe it when I’ll see it. If anyone has any positive experiences with XP projects and communal workspace then please let me know and educate me ;-) Google gives as always bunch of useful reading materials with the following basic queries: "extreme programming criticism" or "extreme programming problems".
In the first part of this post I described the situation where I was forced to cut my own couch into pieces to get if out of the apartment. This time it’s not about chopping something, it’s about transporting the couch. Everything started about week after series of events described in the first post took place. Friend of mine called me and we had approximately the following short conversation:
He: I’m moving this weekend. Can you help me with the couch?
On Saturday morning we happily gathered at my friend’s place. There were four of us (before the move there were some rumors that the couch was big and therefore some additional human resource was needed). After arriving at the scene I discovered a couple of things:
- Apartment was located on the second floor so transporting anything would definitely include some unconventional movements on stairs.
- Couch was big. Actually it was big enough not to fit out of the door in any position as we eventually discovered.
- The color of the couch was white therefore making it dirty wasn’t an option either.
After cynically suggesting slicing the couch to avoid the work this wasn’t apparently considered as a viable option ;-) Main problem was still ahead of us - how to get this thing out of the apartment? The only solution was to get it out through the second floor balcony. I don’t even want to know how this couch was transported into the apartment in a first place. As I suspect it had to be moved in through the balcony also. Fine, the algorithm was simple: a) take the couch; b) get it out somehow; c) load it to the truck and do the opposite thing after arriving to our destination.
Because of the Murphy’s Law we started having some complications - it started raining outside and grass was slowly becoming muddier and muddier. As three of us are software engineers then we quickly came up with the 'perfect solution' - let’s wrap the couch into plastic. The roll of plastic appeared magically from somewhere and we started wrapping the couch. The wrapping process was nontrivial; we had two people holding some part of the couch constantly in the air and two people doing the wrapping. Finally we managed satisfactorily to wrap the couch and reached the next part - actually moving it.
Moving process itself was quite tricky - we had two ropes which we connected to the couch in some complicated ways ;-) Algorithm was quite simple: four of us helped to raise the couch and move it to the balcony and then two persons stabilized the couch while it was hanging in the air. At the same time other two persons quickly ran downstairs to bring couch to the ground. I had some amount of common sense to resist against being the person on ground, but unfortunately somebody had to do this ;-( So, here we were - the wobbling couch in the air, ropes securing it, and two of us trying to land it safely. Fortunately luck was on our side and nobody got hurt. After we landed one end of the couch then rest of the 'couch team' ran down and helped to stabilize and carry the couch into the truck.
Anyhow, if you live in Seattle area and have any plans to cut/move your couch, let me know and we can form the non-profit organization of couch geeks ;-) Why? Because we can ;-)
P. S. One of the things they never tell you while asking help for anything is that there’s always a catch with everything ;-) This time the catch was also helping to move some of the home gym exercise equipment. Let me tell you something - these machines are heavy! Initially we expected that it’ll be possible to carry them through the door (positive thinking). Unfortunately it proved to be impossible because of the way how the machines were designed. The good thing with the exercise equipment is that it’s possible to deconstruct the apparatus into the smaller pieces and then transport pieces through door. I would never feel secure standing under the couple of hundred pounds of wobbling steel...
When people used to ask me what I do for living then my typical answer was: "I solve problems." Apparently after taking look at my short haircut and listening to my accent they were inclined to make lots of wrong conclusions ;-) After some experiences with this I changed my answer to: "I spend my time mainly on thinking", but this made it even worse. Currently I just tell that I’m a software engineer and this seems to be good enough.
Anyhow, lately I’ve been thinking about two main things: defect prevention and root cause analysis. One of the things I was looking at this weekend was the effectiveness of different development/test tools during the last 6 months of our product lifecycle. Fortunately for AppVerifier team I’ve reached a conclusion that based on cost vs. benefit analysis AppVerifier was a clean winner!
There have been some blog entries related to AppVerifier already and I’ll use this post to list first three I found:
Two main documents to read for getting started with AppVerifier:
If you’re doing driver development then use GFlags from Windows DDK. The quote from their page: “GFlags (gflags.exe), the Global Flag Editor, enables and disables advanced internal system diagnostic and troubleshooting features. It is most often used to turn on indicators that other tools track, count, and log.“
Based on our experience AppVerifier proved to be extremely useful for discovering and tracking down a number of bugs from different categories: problems with critical sections, general heap corruption related issues, invalid and leaking handles, everyone’s favorite NULL DACL issues etc. There were also some scary moments á la one of my direct reports successfully doing device driver debugging at kernel level and me looking at this and finally realizing that I was programmer once, but now I’m just a worthless PHB ;-) To summarize: AppVerifier just rules!
Word of the day: perlonked (source: Salva-Man). ‘Perlonked’- "As indicated or forced by the related, but unspecified, issue." Sample usage: "We're totally perlonked into this non-swooby solution."
More Posts Next page »