May 2004 - Posts
So is vacation. At least that’s how I thought about 3-9 years ago ;-) Currently I’m getting either more mature or just softer, but tomorrow morning I’m leaving for vacation till June 1st. As a sign of personal commitment I’m not taking my laptop with me, but if my girlfriend will be busy with something else then I hope to occasionally escape, find some Internet Café in New York City, and start single-mindedly reading Slashdot, other latest RSS feeds and personal e-mail ;-) Or perhaps not. Yes, probably not. We’ll see.
What this means in practical terms is that with high probability I won’t be able to respond to your comments or e-mail.
First of all I would like to thank everyone who shared their solutions for out-shuffle problem with the rest of us and submitted any comments related to this problem. Looking back in time it was actually three years ago when we tried to find an efficient answer to this particular programming challenge. After we were puzzled and didn’t make any progress for some time I contacted the most knowledgeable person in my circle of acquaintances when it comes to algorithms and data structures. His name is Ahto Truu and he is GodOfAlgorithmsAndDataStructures ;-)
Ahto writes regular column about programming puzzles for one of the Estonian IT magazines and as a natural result he composed an article about this particular problem. The original article in Estonian is published here. On Monday Ahto sent me an updated English translation of this article. The PDF file can be downloaded from here. Here’s the table of contents:
- The Problem
- The Setup
- A Memory-Hungry Solution
- A Time-Hungry Solution
- A Divide’n’Conquer Solution
- A Combinatorial Solution
- Acknowledgments
As also pointed out in Ahto’s article, this paper by Ellis, Krahn, and Fan describes an algorithm that solves out-shuffle problem in O(N) time and O(log N) space.
The other day I and fellow lead of mine were chatting about general code quality issues. The person I was talking with brought up a very good point - every change you make to the code base must do what it’s supposed to do and in addition something else to make general quality of the code better. This "something else" can be very simple: rename method to make its intent clearer, extract duplicated code to separate function, "clean up" one class/file or practically apply any of the other refactorings. Some of you may realize that this is basically kaizen and refactoring mixed together ;-)
Problem during typical product cycle is that there’s something we call "open coding season" during what every change to the code base doesn’t have to have triage’s approval and then there’s a period during what every change to the code base has to get blessing from the triage team and go through lots of scrutiny to get accepted. There are a couple of reasons why there’s so much bureaucracy to get something changed late in the product cycle. I’ll mention just some of them:
- The main rationale is that software itself is extremely complex and any little change in one module can cause butterfly effect which could possibly render some feature of the product totally useless. Having a triage team with development, program management, and test representatives from every related component team ensures to some level that there’s at least some oversight in regards to what’s happening with the product. If one team will like to change something which theoretically can cause major scenario not to work, affect number of our customers, affect some other team, has great percentage of risk, has big testing impact etc., then probability that one of the people will catch this is pretty good.
- Backwards compatibility. For example you don’t want to force all your Beta customers to rewrite all their application because we just radically decided to change all the API-s or rename number of fields in DOM. From developer’s point of view very simple change - refactor the names of some fields to make them comply with new naming conventions, do quick search’n’replace on all the test-cases, change them - can cause major trouble later.
- Possibility of breaking Windows Logo requirements. If you’ve read the logo certification document you’ll understand that writing under wrong registry key or installing some files under wrong locations may ruin your product’s chance to conform to logo certification standards.
- List goes on: possibility of breaking Windows user interface guidelines, violating the accessibility standards, introducing unwanted dependencies etc.
Cool, now we reached the stage where everyone is afraid to make any changes because it has possibility of breaking everything ;-)
Fortunately it’s not so bad IRL as you may have deduced from information above. Common sense has still place in our triage room and some of us even know what EmbraceChange and RefactorMercilessly mean ;-) The above was just something to think about when choosing the proper strategy depending on your specific product and the phase of the milestone. There’s no absolute truth (á la "after the second RC you must not change anything which has to do with locking"), but one needs to have something in the middle of being extremely cautious and inflexible vs. having huge code churn every day. If you can’t change at all then you’re doomed anyway. I personally tend to incline more to the "taking risk and changing something side" rather than being very careful. Possibly because I have never been significantly bitten by the late code changes and I’m strong believer in the common sense over the process. Your mileage may vary.
The good thing about being in the middle of milestones is that life isn’t so stressful and in addition to the planning and postmortems there’s at least some amount of time for self-education purposes. I discovered today that Alistair Cockburn (TheGodOfAgileSoftwareDevelopment) has draft version of his new book "Crystal Clear - A Human-Powered Methodology For Small Teams, including The Seven Properties of Effective Software Projects" available here. Though this isn’t the final version, the fact itself that more and more respectable writers in software development community publish their books for reviewing is noteworthy. One of the latest examples was Steve McConnell’s "Code Complete 2" or is Keith Brown's “A .NET Developer's Guide to Windows Security“.
P. S. At the same page Cockburn has also his Dr. Philos. dissertation accessible for viewing. In fact the entire site is worth carefully studying (articles and talks) if you want to understand AllCoolAgileRelatedStuff in depth.
Almost everything from professional literature I've been lately reading is written by the following authors: Jon Bentley, Donald Knuth, and Robert Sedgewick. In the process I've been also going through number of different programming problems. Here's one old problem which puzzled me and number of my colleagues from Speech Component Group (team which is responsible for core speech recognition engine and
SAPI) about two years ago. We never came up with efficient solution though ;-) The problem statement is pretty trivial, but be careful - it's not as simple as it seems.
The problem. An array which contains 2 N elements needs to be arranged from
a1, a2, a3, ..., an, b1, b2, b3, ..., bn
to
a1, b1, a2, b2, a3, b3, ..., an, bn.
Of course this needs to be done as efficiently as possible in terms of both computational complexity and memory usage. The best solution known to me (old coworker of mine from Estonia came up with the algorithm) has the complexity of O(N log(N)) and uses no more than constant amount of memory.
Can you code up the solution which has the same characteristics as the best solution known to me? Can you do better? Is it even possible to do better? If you can solve this problem under these constraints or prove mathematically that there's no better solution then you should definitely send your CV to Microsoft ;-)
I’ll post the best solution known to me in 1.5 weeks with full credits to the original author.
Update
Apparently I'm not the only one who has been thinking about this problem lately, one former teammate of mine pointed out some related articles:
Well, it finally will, but at least this will take some time ;-) After shipping Speech Server I felt nostalgia for a couple of days and noticed that I started forming sentences like "When I was young and we started this project..." It’s not so bad now, but here’s one relatively fresh war story I sometimes tell to people. About 1.5 years ago I was complaining and criticizing almost everything about our existing BVT system. I’m just one of these persons who enjoys criticizing and picking on little things. It’s my internal fear that most of the coworkers find it very annoying during some periods of time, but I usually like to say that if Thomas Alva Edison would have been satisfied with the candle light then world will be very different today ;-) At some point we reached a situation where I volunteered to own the entire process myself and "make things right". This rookie foolishness/optimism ;-)
It turned out to be a very "exciting" decision. During first three months after I started owning the BVT process for the entire Speech organization I was just scared every day ;-) Every morning I got up and repeated "I must not fear. Fear is the mindkiller." For those of you outside Microsoft - BVT-s mean a great deal for us and we passionately care about the BVT results and getting them on time. One of the major things I immediately realized was that my manager was right again - perception is everything. Almost only criterion people judged the trustworthiness of entire BVT process was the fact that if they came to work at 09:00 AM and there wasn’t e-mail with results of BVT-s for every applicable platform waiting for them then it with high probability was BVT team’s fault until proven otherwise. I understood very-very quickly that having evidence of strong statistics is the only way to improve the current process, save my skin, and stop receiving e-mails from all the people in the team where the question "What went wrong again?' was repeated.
There’s one person in our team whose entire job is to make sure that infrastructure functions, all the BVT-s are executed, problems are solved during non-conventional work hours, results are reported, machines are available for investigation etc. After looking at all the problems we were having, we just started writing them down and documenting on daily basis. We used the following categories:
- Build break: build is kicked off at some time during the night, all the different editions of the product are built and published to build server by early morning hours. If the build is broken then usually the first thing that happens in the morning is that developer comes in and fixes the problem ASAP and something we call "point build" is made. But the fact that build is broken also means that you can’t run BVT-s against specific build. Most of the people don’t read build e-mails and the immediate thing they see when arriving in the morning is the BVT e-mail containing lots of red lines in it. The general perception was - "those BVT people did something wrong again." I was certainly unhappy with this and therefore I started setting my alarm clock for 05:30 AM. When the alarm clock made its horrible noise then I got up, looked at the build result e-mail and if build succeeded then I went back to sleep. If build failed then I either replied to this e-mail saying that we have a build break, opened a blocking bug in our bug database or sent e-mail to relevant people and said that we won’t be getting BVT results this morning because there was a build break. After that I went back to sleep ;-) This actually worked pretty well because it was kind of hard to argue against this specific fact - no build therefore no BVT-s.
- Setup break: same thing here, if there’s a problem with setup - some files aren’t installed/registered properly, setup hangs during the automated installation or something else goes wrong then the probability of your product’s BVT-s passing is very low. The opposite question should be asked: how can your BVT-s pass if setup is broken? We got very good at shifting blame ;-) in this area also. First thing in the morning the person responsible for BVT-s looked at the installation process and setup logs and if he noticed that something is wrong: nice error message on the screen, number of scary entries in NT Event Log etc. then he went ahead and opened a bug and followed up with the quick e-mail to BVT alias. Also in this case it was kind of hard to argue against the fact - no proper installation therefore no BVT-s.
Now we’re reaching the point where other more interesting problems started to happen. If build and setup succeeded and BVT-s failed then the problem could be generally classified as following:
- Bug in product: something is wrong with the product itself and basic user scenarios don’t work. Test organization could celebrate the fact that BVT-s caught a problem ;-) and development organization tries to figure out how they broke the product and how can they fix the problems ASAP.
- Test tools are not in sync with product code: this is one of these embarrassing cases when either something in product functionality changes and test is not aware of the change or test is aware of the change but the code changes are not coordinated properly and therefore things got out of hand during next build.
- Human error: somebody restored a wrong image, pulled out the network cable, sent out wrong e-mail, terminated test execution by mistake etc. Yeah, this was bad and you can’t do it too much ;-(
- Problems with tests itself: this is the most embarrassing case which may happen to you if you own any of the BVT-s. False alarm, people are running around, and suddenly everyone understands that it’s actually test which is incorrect. Quis custodiet ipsos custodes?
- Problems with the BVT infrastructure: nothing to say about this. No justification of any kind, just humble nodding, investigating the root cause, and making sure that the same problem won’t ever happen again.
Needless to say after having this amount of responsibility, stress, and tension on my shoulders I got pretty passionate about documenting everything, trying to understand what’s causing the problems, and fixing the root cause as fast as possible. It took some months of the following dialogs to restore the confidence in BVT team:
Somebody: BVT-s are broken again.
Me (looking at my notes): In fact the problem was caused yesterday by code change #123456 which caused build break this morning. We notified build team at XX:YY AM and here’s the copy of e-mail.
Somebody: BVT-s are broken again.
Me (looking at my notes): If truth be told the problem was caused by bug #654321 which we opened at YY:XX AM, the positive hand-off e-mail was sent to development team at ZZ:WW AM. E-mail with the status will be sent out by triage time.
Somebody: BVT-s are broken again.
Me (sighing deeply): Yes, this time it’s our fault. One of the tools wasn’t updated properly yesterday. Relevant developer is working on the task, ETA for the fix is 11:45 AM.
I used to send out every second week these detailed e-mails to the team with number of problems per every category (another example of "The beatings will continue until moral improves." ;-)) and action items. After digging into every single failure for months we finally reached a point where the number of problems started to decrease noticeably and after some period people started to trust the BVT system (at least I hope so) and with very high probability when something failed then it was a problem with the product. Just to be clear, we're all humans, mistakes still happen and we have to accept this fact. I tend to be perfectionist, but I still have some understanding of real world ;-)
To summarize: though this was very stressful and quite tough experience I don’t regret any moment of it. After owning the BVT process for more than year (especially when you’re shipping the server product) in addition to all my other duties there’s not much which can scare you anymore during the software product lifecycle ;-) The next scariest thing is probably the build process.
A couple of weeks ago one of the developers reporting to me had a problem with a 3rd party tool which shall remain nameless. In the middle of the execution the following error message was displayed: "Internal Error XXXX". After that the tool terminated. The person trying to understand what the problem is was furious when he realized that there is no reference of internal errors anywhere in the documentation. What does "Internal Error XXXX" mean? Why there’s nothing in tool’s log files? Why there’s nothing in NT Event Log? Who on the earth will display error message like this?
At the end it took him more than a day to figure out the root cause of the problem and solve it, but that’s not the point. After both of us did certain amount of complaining and whining then we suddenly understood that during our careers we’ve been guilty of identical things:
- Returning just plain
E_FAIL for all the different error conditions.
- Not logging clear and detailed error messages.
- Writing code which gives difficult to understand error messages to users.
- Not documenting all the things which may go wrong.
- ...
Therefore instead of further complaining we felt regret, shame, and some feeling of depression ;-) Something one can call "programmer’s poetic justice". At least I made a promise to myself that I’ll commit to working on my "error handling/reporting skills" ;-) We’ll see.
When first version of the product ships then it’s very natural initially to be extremely happy and celebrate, but soon you start thinking about how to improve things with the process or even better - how to use some cool technologies and tools to make everyone’s life easier. Here are a couple of things we’re planning to do in regards to preventing bugs and increasing the code quality in near future:
- From the personal point of view I finally decided to accept the fact that we live in the managed world. Majority of our server code has been written using C#. Most developers who move around in C# world know that FxCop is a good thing. One improvement we decided to do is to make sure that our system which processes every addition to code base will automagically run FxCop on resulting assemblies, compare the results from run against previously specified subset of FxCop rules, and bark on every code change which will cause some non-expected warnings to appear. This is one of the things which everyone can incorporate into their software development process. FxCop is available for download and has active community behind the tool. Couple of notes though: 1) you must spend time on figuring out what are the rules you care about, absolutes i.e., using all the rules is almost never a good thing; 2) you must make sure that everyone in your team agrees to these rules or that at least they’re enforced by The Powers That Be. If FxCop is lacking of some checks you desire or you want to use some additional standards then feel free to use
System.Reflection to do some additional checking ;-)
- The same thing for unmanaged static source code analysis tools: PREfast and PREfix. Again, there are number of similar tools provided by different vendors which everyone can incorporate into their development process. What we currently do is we run these tools on weekly basis on our code base and analyze the results. Some diligent people run them every time before changing any line of code, some people run these tools every time after they see somebody changing some lines of code, but we don’t have very formal and strict process yet. This will change soon.
- Test-Driven Development and unit testing is a thing everyone is talking about. For a months now everything I hear is that how TDD and writing unit tests will make world a better place. Everyone seems to be particularly discussing the "Test Driven Development in Microsoft .NET" book. Fine, I had to get a copy myself and read through it during last week. Now whenever I listen to 50 Cent I catch myself humming "NUnit!" instead of "G-Unit!" Well, let’s see how this TDD thing turns out to be. Is it a next silver bullet or not?
A curious reader may ask that why we haven’t implemented all these processes already? Like the slogan from Nike says - "Just Do It". Very simple answer is that because of basic risk analysis - changing the process in the midst of coding or lockdown demands resources, increases risk, and requires time (training, experimenting, increased development/testing estimates because of new requirements etc.) which nobody ever has. Ship which is an ice-breaker is good, but occasionally people dream of motor-boat and vice versa ;-)
William D. Bartholomew asked me in private e-mail exchange the question about practices and processes for software testing or software quality assurance in small company. Software testing and software quality assurance (QA) are two different things, but for the sake of shortness I’ll use QA while referring to either one of them. While thinking about this question during the weekend I came up also with number of related philosophical questions:
- From what size forward the software company needs to have dedicated QA people? Do you always need QA? Do you need it when number of developers is greater than n? Do you need it because your software engineering professor in university said so? Before MS I worked for 2 years in one software company with the number of employees less than 10 and for 2.5 years in other software company with the number of employees less than 30. During the time I worked in both of these companies we never had anything resembling for example MS QA practices. Everything was pretty simple - you designed something, you coded it up, you played around with it for some hours (sic!), and to customer it went ;-) Surprise, surprise, we stayed in business and these companies are still in business. (GK, 05/02/2004: Just to make it crystal clear. I'm not saying that not having dedicated QA was a bad decision, probably summa summarum it was even the right one in these circumstances. That's the interesting aspect of not playing the standard rules ;-)) I think it’s rather rule than exception in the software industry where the life of small software companies is mostly about survival.
- Ethics or cost vs. benefit? "Software Engineering Code of Ethics and Professional Practice" states very clearly that one must "Ensure adequate testing, debugging, and review of software and related documents on which they work." But what does this mean? If I have 4 developers working on product must I hire dedicated QA person or can I rotate my developers to role-play QA for n days a week? Someone may now point out that I'm totally slipping - don't I know that “developers can't test“ or “people who're developing have different mindset than people who're testing“. I have to admit that I never understood where these clichés came from. The best QA people I've seen are the persons who write code on daily basis. IRL it still gets down to cost vs. benefit. Unless you measure exactly how much time developers spend fixing bugs or going to customer site investigating problems related to your product or how much negative feedback are you receiving then it’s very hard to quantify how beneficial having QA will be.
Cool, here I am, questioning the role of QA. Possibly today is just very controversial day.
But to get back to William’s question. Giving any specific advice without knowing exact details is very slippery road. The best thing I can do is to recommend reading this book - "Under Pressure and On Time" by Ed Sullivan. It’s the best book I’ve read so far when it comes to building up something (startup software company, development/QA team) from scratch or start implementing some certain process (daily builds, daily tests, quality assurance, release management, source control etc.). It covers pretty much everything I would have to say and backs it up with examples from real life. I have to even admit that I revisit the certain chapters from this book in about every three months ;-)
More Posts