Think first, 'doing' is for later

Sunday, July 26, 2009

In the comments section of Ayende's blog, I recently debated the usage of principles like the ones in SOLID and argued that these principles aren't really the important thing to focus on. Instead, people should focus on thinking. In the Netherlands we have an old saying: "Bezint eer ge begint", which translated to English is something like "Think everything through before you start". Now, before I wake up the anti-Waterfall people, I'd like to add that this post isn't about Waterfall at all. Instead, I'd like to line out how I write my software, how thinking is an essential part of every step I take in the whole process and will illustrate it with an example which hopefully will illustrate that some extra time spend on the thought process before writing any code is very valuable.

The main thing to understand about this post is that it's not a guide to design a whole system as it is about how to approach and successfully implement features. Features are pieces of functionality the software needs to have at various abstraction levels. TDD people might describe them as user stories, I stick with feature because I don't use TDD so I use a different name to avoid confusion. The example feature I'd like to use in this case is the following: say you are working on an O/R mapper designer and you have the ability to define value types. Value types are types which have one or more attributes (fields) but don't have an identity of their own like Entities do.

A good example is 'Address', with the fields 'StreetName', 'City', 'Zipcode' and 'HouseNumber', but you could also define value types with just one field, like 'EmailAddress' with the field 'Value'. A value type with just one field can be useful if you want to place logic related to that single field (like in this case the check if the value for the email address is valid) with that field and do that just once: every time you now need to define an EmailAddress field in an entity, you can set its type to 'EmailAddress' instead of 'string' and validation is built in, as you already did that once for the value type 'EmailAddress'.

This feature will give a couple of problems, where one being pretty complex: if you want to add the feature to add new fields to a value type definition, you might run the risk of creating an infinite loop: a field in Address which also is of type Address. Or worse, you could have a field Foo inside Address which has a field which is of type Address. How do you prevent that from happening? I'll show you how to solve this in a general way and also how to get there.

Let's go back to our feature: adding a field with a valid type to a value type. Which steps to take to add this feature successfully? The steps to take are the same steps I'll always take with every feature, I've summed them up briefly in the list below:

Think first. 'Doing' is for later. Think everything through, use reasoning to learn more about the feature, try to think of all possible problems related to the feature and possible solutions.
Analyze what you need based on step 1. Try to find generic, proven algorithms and data-structures which might be what you need, and build on top of that.
Document the results of step 1 and your analysis result from step 2: write which alternatives you've investigated and which one you've picked and most importantly, why.
Prove your algorithms first. This is not that hard and doesn't require any code
Implement algorithms / data-structures as designed. Then check whether you implemented the algorithm correctly. This is doable by simply looking at what you wrote. How you implement it, is dictated by the algorithm: which steps are there to take.
Test the implementation of what you wrote. As the algorithm is already proven, the tests prove the implementation. If you want, you can also start with this before 'Implement algorithms' by implementing it using mocking and tests. That's not the point, the point is that the code is the end-result, not the start.

I gave them numbers in the list above to easily refer to them. Let's address them one by one and along the way see how our example plays a part in this.

Step 1: Think first

This might sound like pretty common sense, but it's the most important step. You should spend considerably time in this step, thinking things through, what is the feature all about? Which side effects can be recognized? Should you split it up in various sub-features/parts? If you don't feel comfortable with what you have discovered during this step, like you don't really have a good feeling about how to approach it further, you shouldn't proceed yet. This step is the foundation of what you will do in future steps. What's also important to notice: no coding. This is all thinking, not doing. The doing is for later isn't there for nothing. The most important mistake people can make is act before thinking things through. You'll see why in a bit.

In the feature at hand, adding a field to a value type, we can discover several things: we have to validate on the name within the value type, the field has to be correct by itself (e.g. has to have a valid type, name etc.). We also recognize the necessity of avoiding an infinite loop within a value type as described above: we can't add a field F to value type VT if by adding that field, an infinite loop is created inside VT, or indirectly in a containing type of VT. A field can get a different type, which also should avoid this infinite loop. We can also recognize the requirement that when we add a field to a value type VT, this field is then also required to be mapped onto a target, but for the sake of simplicity, we don't go further in the mapping scenarios here.

I also leave a notification system for other subsystems and undo/redo outside the scope of this post but the requirements for these aspects are the same: they too have to be thought through, how to solve them and have we solved them before already, if so, how did we do that? An experienced software engineer solves things often in the same way as s/he already knows what to do when different kind of small problems occur. Software engineers which aren't that experienced are often faced with these decisions and don't really know what to do. If you find yourself in such a position, don't be arrogant and deny it, but simply take the steps of thinking it through, try to look if (based on theory!) you have made previous attempts to solve the same problem elsewhere.

After thinking the feature through, we look at our findings, and see that for the most part there's some straight-forward validation needed for the name and the correctness of the field and also a complex validation for the infinite loop protection. As we have thought about our feature and the implications, we can proceed with the next step. Though, don't have the illusion that the thinking is now done, the next steps still require a lot of thought.

Step 2: Analyze what you need

When analyzing the results of step 1, we see there are three kinds of validation we have to solve:

Validation of the field itself (does it have mandatory properties set?)
Validation of the field within the value type it is added to (does it have a unique name within the value type?)
Validation across all value types (does it create directly or indirectly an infinite loop?).

The validation for a) is straight forward, and we have several standard approaches for that: add a validation method to the class or a property itself which signals that the field is valid, and internally the field maintains the value for this by checking itself whenever it changes. There might be others as well, all have their strong points and weak points. As this is a common problem, it's likely that in an earlier feature it already has been solved, so we can leverage that analysis and see if it applies here too.

The validation for b) is similar to a), we simply apply it at the value type level instead of at the field level. We can re-use our analysis results and verify whether it applies here too. Validation for c) is something else though: how to approach this? It depends on what you like most: visual approaches on a white board, or reasoning in theory without images. The key is that you avoid writing code. This is perhaps going to be considered a bad thing in some people's eyes, but it's key here, as writing code makes you fall into the trap that you'll think that the code is what you are trying to express in executable form, but it's not, as you haven't thought of what you want to express in executable form (as you haven't decide what algorithm / data-structure to use!) so you can't possibly write code which represents that.

The problem of c) is at first pretty straight-forward until you discover that it is actually less simple because you have to investigate a lot of different paths and different situations, as the value type VT1 you add the field to might be inside another value type VT2, which is inside another value type VT3 so the field to add isn't allowed to be of type VT1, VT2 and VT3. To find all those paths requires an algorithm which keeps track of all kinds of things, which makes it less easy to implement, maintain and test. So what's there to do?

We first step away from the idea that we're the first to solve these kind of problems. To be able to find a general theoretic solution which is already proven to be correct, we have to make our own problem more generic and realize that others before us have already solved this same problem with theory (not code! This isn't about copying pieces of code from Google, it's about re-using theory). In this case, we have to analyze when an infinite loop occurs in our situation, in short: what are the criteria for that. It comes down to: which types should a field F not have if we add it to value type VT1? Obviously VT1, but which value types should we also avoid? If you draw a picture of this on the white board, you'll see that every value type known in the system (so every value type the field F can have) which is directly or indirectly referring to VT1 is not allowed, as by setting the type of F to one of these value types will automatically create an infinite loop.

If you don't see it directly, draw a picture on a piece of paper or whiteboard of the following: VT3 points to VT2. VT2 points to VT1. VT4 points also to VT2. VT5 points to VT6. How to find the value types referring to VT1 directly or indirectly? Look at your drawing at the whiteboard. It will likely look like a graph, a directed graph. Graphs are one of the most important data-structures you'll need in software engineering. One of the key aspects of graphs which makes them so great is that for graphs (directed, non-directed) a lot of algorithms have been discovered and described (and proven to be correct) which we can re-use without doing any effort. The nice thing about these algorithms is that they often solve problems we face every day (like ordering elements which are related to each other, or this one: which paths are there?).

The algorithm we need here is Transitive Closure. Transitive closure gives all pairs of vertices ('nodes') in the graph which are directly or indirectly connected to each other. So in our graph example above, it will give VT3 -VT2, VT3 - VT1, VT2 - VT1, VT4 - VT2, VT4 - VT1, VT5 - VT6. This means that I can travel from VT3 to VT2 and VT1, from VT2 to VT1, from VT4 to VT2 and VT1 and from VT5 to VT6. This also means that all value types from which I can travel to VT1 are therefore not allowed, as these all refer directly or indirectly to VT1! This leaves us with VT5 and VT6 which are the only value types allowed for a field F added to VT1.

We're not done yet. If you dare to look at the wikipedia page I linked to above, you'll likely stall at the complex math formulas you're faced with. Don't worry, these are part of the mathematical background of the concept. As we're software engineers, we need to look for theory about an algorithm which describes transitive closure of a graph G so we can implement it. As this is a problem which has been solved before us, some very clever people have already come up with what we're looking for: there's a very efficient algorithm designed and proven for transitive closure: the Floyd Warshall algorithm.

You might now wonder how you would have gotten to the same conclusion. One of the most important steps is to avoid home-brew algorithms unless you really really have to. A typical bad habit of software engineers (and don't worry, I also still make that same mistake once in a while) is that they think they're the first to face a problem and also that there's nothing generic discovered yet which could possibly help them out so they have to cook something up themselves: the home-brew algorithm. The downside of home-brew algorithms is that they're not yet proven to be correct. For general algorithms like the Floyd Warshall algorithm, it is proven to be correct so we can skip that important step: we don't have to think about if it will work with whatever graph we throw at it, it will. We just have to type in the three nested loops and we're done.

Now that we've done our analysis and have come up with good candidates of how to solve the problems we faced with this feature, we can proceed with the next step.

Step 3: Document the results

Software engineers tend to dislike to document their work, however it's very important that we document our thought process, and especially why we took decision A and not B. The key of documenting the design decisions and which alternatives were not chosen, is that if we have to face a similar choice again, for example in two years time we have to alter this piece of code and have to look up why the feature was designed with a graph and a Transitive Closure algorithm, we will learn from the design decisions that the graph approach was a good one because it was a proven path: we don't have to worry about the fact if the algorithm would give us all the paths we would be interested in, it will. So we can keep that implementation and don't have to worry about alternatives being better: we can learn from the documentation we made, the alternatives are not better. This documentation for this feature doesn't have to take ages to write, it might even be half a page, as long as it contains the information that explains why which alternative is chosen.

After documenting our findings from the first two steps, we proceed with the next step.

Step 4: Prove your algorithms first

Here we'll see that what we have invested in pays of. We use a proven algorithm so we don't have to do any work, it is already proven to be correct. Would we have chosen an algorithm we designed ourselves, we would have to prove if it works. This can be a bit time consuming and, I'll admit, boring, but it is worth it. One of the key aspects of proving an algorithm is that you think each step through, you define the pre-/post conditions, what will make it go wrong, and perhaps even you'll see flaws in the algorithm and have to start over. It's easier to do this without code, because code can contain bugs due to bad implementation. If you write tests for your code (or start with tests when implementing code), your tests not only test the implementation but also the algorithm. If the test fails, is that due to an algorithm error or due to an implementation error or both? If you prove your algorithms first, you know it's an implementation error.

Some say that this step is not doable or it's even worthless. However it's easier than you might think: write pre/post conditions down for each step, and reason about the algorithm you designed, when will each step break and why not? It doesn't have to result in hard-core math, it's often already enough to think through each step of the algorithm what the pre/post conditions are and when a step will fail to spot problems.

As we've already proven our algorithm by pointing to the work of others (Floyd and Warshall) we can move on to the next step.

Step 5: Implement algorithms / data-structures as designed

This is the step where we actually will write code. For the people using TDD, you will likely combine this step with the next one by writing tests first and using mocking to work towards a working implementation, but that's really semantics. In step 2, we decided to use a directed graph algorithm, so we need a graph data-structure which can handle directed edges. Writing a graph class is straight forward, the only thing you have to make a decision on is how to store which vertices are connected: using adjacency lists or by using an adjacency matrix. Both have strong points and weak points, you can learn more about them by reading the linked wikipedia articles. You can also decide to use a prefab graph class, as there are several graph classes already written for .NET, it's up to you.

Once we have the graph class, we can implement the Transitive Closure algorithm. As this algorithm is really about three nested loops, it's straight forward. Below is the Transitive Closure implementation of Algorithmia's DirectedGraph class:

/// <summary>
/// Returns the transitive closure of this graph using the Floyd-Warshall algorithm.
/// See http://en.wikipedia.org/wiki/Transitive_closure and http://en.wikipedia.org/wiki/Floyd-Warshall_algorithm.
/// </summary>
/// <returns>The transitive closure of this graph.</returns>
public DirectedGraph<TVertex, TEdge> TransitiveClosure()
{
    DirectedGraph<TVertex, TEdge> result = new DirectedGraph<TVertex, TEdge>(this, this.EdgeProducerFunc);
    if(this.EdgeProducerFunc == null)
    {
        throw new InvalidOperationException("The EdgeProducerFunc isn't set to a value.");
    }

    foreach(TVertex i in this.Vertices)
    {
        foreach(TVertex j in this.Vertices)
        {
            foreach(TVertex k in this.Vertices)
            {
                if(!j.Equals(i) && !k.Equals(i))
                {
                    if(result.ContainsEdge(j, i) && result.ContainsEdge(i, k) && !result.ContainsEdge(j, k))
                    {
                        result.Add(this.EdgeProducerFunc(j, k));
                    }
                }
            }
        }
    }
    return result;
}

Is this implementation correct? We can of course test this with some unit tests. We can additionally to these tests, check whether we have indeed implemented everything correctly by looking at the steps in our algorithm and then go to the code to see if we made a proper projection of the step to the executable form: the code. If not, we have to change the code, not the algorithm. Don't think lightly of this and be careful not to cut corners by 'assuming' it is OK. A human isn't a very good source code interpreter but the developer of the code should at least try hard to check whether the implementation is indeed how it should have been. That already will catch obvious bugs and mistakes, as the algorithm is correct so we have to worry only about the implementation.

There are several implementations possible of a given algorithm A. It depends on how you interpret each part of the algorithm and how you think it's best to implement these parts. This might cause problems but these are caught by this step as well because you've to verify what you wrote is indeed what you should have written. This is also the place where code reviews come into the picture and the reason why they work: other people will look into how an algorithm is implemented exactly, and as each person will actually make the same projection from the algorithm to code, any difference in what they would have done themselves vs. what they are reviewing will trigger discussion and will in the end result in better code.

One might wonder if this is really worth the effort. Yes it is. The exact same algorithm implementation above for example is also used to find inheritance loops when inheritance is defined between entities (as it's in fact the same problem) and you can solve other problems with the same algorithm as well. Furthermore, the more time you spend on making sure the code is actually of high quality and correct, the less time you'll spend on maintenance later on or bug-fixing, as the code and more importantly, the reasoning behind it, is well understood and debated, analyzed, documented and considered the best alternative of the list of possible options.

For this particular feature we picked a data-structure and algorithm which were already well-known. As we've implemented them in our code-base, we can re-use these general building blocks whenever we need to solve similar problems. The advantage of using well-known data-structures and algorithms is that they're not subject to change because they don't evolve: their definition is well-formed, the problems they solve are well defined and therefore it's safe to use precisely these kind of building blocks to base your own code on instead of home-brew data-structures and algorithms. Always try to use general, well-known data-structures and algorithms. The more the better, and once implemented, you can re-use them without worry: a transitive closure is always a transitive closure. A topological sort is always a topological sort and so on.

Let's move on to the final step.

Step 6: Test your implementation

If you're using TDD, you likely already have the tests written as part of the previous step. If you're not using TDD, you should now test your implementations of the parts in the previous step. As the algorithm or algorithms and data-structures, are already proven to work, we can focus on if we have written the right code to implement them. Because of the work in the previous steps, we have in-depth knowledge of how things should work, what each step in the algorithm should do, and thus what happens in each step of the feature. This leads to tests which test on cases where we can expect problems, for example around the pre/post conditions we found. Keep in mind that in our example, there is an unlimited amount of graphs to test, and you can't write tests for every single one of them. Using proved algorithms helps tremendously in this case, as the proving of the algorithms already solves you from the quest to test for every possible input. Let me violate DRY and repeat myself here: Always try to use general, well-known data-structures and algorithms. The more the better.

When our tests succeed, we can be pretty confident that our feature works. The way we implemented it from start to end has given us a lot of valuable assets: we have documented design decisions for later use, for ourselves but also for others who have to maintain our work, we have written generic data-structure and algorithm classes of proven algorithms which we can re-use in a lot of other situations, and we have a working feature which is based on theory and the reason why the code is the way it is can be tracked back to the theory and information we formulated and collected in the early steps.

Conclusion

In this post I tried to shed some light on a different form of software engineering, which is based on a thought process to base code on a solid theoretic foundation. I deliberately avoided any pattern, fancy methodology or principle like the ones from SOLID, because I wanted to focus on the thought process behind writing code, why we write the code we write and that that process doesn't started within a code editor but that it ends within the code editor and starts somewhere else. I have tried to illustrate how I write my software on a daily basis and hope it is valuable to others.

Frans, you've made rather stupid mistake here:
- You're using "shortest path" algorithm here, which requires t ~= O(N^3). I definitely can't imagine why this idea has came to you.
- You need just "topological sorting" (dependency finding) or anything similar, which requires t ~= O(N). I.e. it is ~ N^2 times faster on N nodes.

Hopefully you don't use the same approach while detecting the sequence of persistence of entities on commit, where count of entities to sort can be much higher than typical N = 10...20 here.

And... We also have a folk saying in Russia related to subject of this article: 7 times measure, cut once ;)

P.S. DO4 has the same feature - Structures.

Alex Yakunin - Sunday, July 26, 2009 8:12:32 PM

Forgot to add - if the idea is just to detect the possibility to declare VTX field in VTY, topological sort is not necessary. Deep or wide search suit well here, with the same O(N) metric.

Alex Yakunin - Sunday, July 26, 2009 8:25:24 PM

@Alex: shortest path is one of the things you can use Floyd Warshall, though another one is transitive closure (see also wikipedia article). Topological sorting for this? I'm not seeing how that will work, or you have to create a special depth first search (DFS) graph traverser for this. Topological sorting just gives an order back which shows dependency, which could be used for this, however you also have to rule out that VT5 and VT6 aren't seen as possible elements in the same sub graph. With topological sorting results, you aren't there, you have to detect it along the way (so you can't look at the result, you have to look at the graph while sorting) and I'm not sure if this is sufficient. Transitive closure is sufficient, it fits the problem precisely.

For ordering of entities to save, sure I use a topological sort on a graph :) No worries :)

I like the russian saying btw. :) Good luck with DO4. If you want us to support DO4 in LLBLGen Pro v3's designer, let me know.

FransBouma - Sunday, July 26, 2009 8:41:55 PM

Do I understand you correctly: you're saying you just need TC *transitive closure) result because it suits.

But the original problem is: "We also recognize the necessity of avoiding an infinite loop within a value type as described above: we can't add a field F to value type VT if by adding that field, an infinite loop is created inside VT, or indirectly in a containing type of VT."

The solution of exactly this problem is DFS. You add a field and use DFS to find a loop. TS (topological sort) will work here as well - with the same algorithmic efficiency. All you need is to find out if at least one loop loop is returned by TS.

Obviously, if you want to maintain a kind of map indicating the possibility to add VT(X) to VT(Y), you'll need O(N^3) to build it (= O(N^2) * O(N), where O(N^2) is the time you need to try all pair combinations of VT(X) and VT(Y)). But actually I don't see ANY reason to build such a map instead of checking a particular case on demand. In fact, you're just wasting the time on these 99.9...% of cases which user simply won't try (but you do).

Alex Yakunin - Sunday, July 26, 2009 9:01:49 PM

> If you want us to support DO4 in LLBLGen Pro v3's designer, let me know.

This looks attractive, taking into account our "code first, and only code first" approach. I must think about this. What are the supposed conditions?

Alex Yakunin - Sunday, July 26, 2009 9:24:03 PM

@Alex: the whole point of the article isn't to find the most efficient algorithm for the infinite loop problem, it's to illustrate how to design software. The infinite loop problem is an example to illustrate the point. In step 2, it might be that additional algorithms (e.g. a DFS based algorithm with path collecting) are also suitable. That's perfectly fine, in fact, one should dig up these algorithms and check whether they are better suited or not and document that at step 3. For this particular case, I can imagine one would still pick TC instead of a DFS based algorithm because _all_ types which are valid have to be found (to show them in a list for example ;)) and the TC algo gives them right away, while the DFS based algorithm requires more coding and is more complex (including the DFS searcher with vertex coloring etc.)

About v3 of llblgen pro: I'll mail you about that later this year. V3 is a designer which will do model first (as well as db first) and supports all kinds of runtime frameworks, like ours, nhibernate, entity framework and we also want to support other o/r mapper frameworks.

FransBouma - Monday, July 27, 2009 6:31:25 PM

Frans, why don't you publish my later comment?

Alex Yakunin - Monday, July 27, 2009 9:54:37 PM

@Alex: what later comment? In this piece of j*nk weblog software control panel, all comments are published? (I see just 5 comments by you in the control panel... and I checked whether they're indeed all published which they are)

FransBouma - Tuesday, July 28, 2009 6:46:01 AM

> what later comment?

Frans, sorry - they're all published now. Likely, I simply missed this.

> "You made a rather stupid mistake" might not be the optimal way to start a civilised discussion...

Yes, thanks. Frans, I apologize for this. I understand it was really not a good tone to start from.

Concerning supporting other frameworks: may be you know that we have a very solid SQL DOM part in DO4. Actually I don't know any alternative to it that is so comprehensive. In particular, there are database model & schema extractor So if you need it, we can discuss this.

Alex Yakunin - Tuesday, July 28, 2009 10:17:26 AM

@Alex: no worries about the tone :)

About the sql dom: I already wrote that. The point is more that you can manage your model at an abstract level where you both produce code/mappings from and also the relational model. Wait and see, if you want to have us add support for your framework later on, no problem.

FransBouma - Tuesday, July 28, 2009 10:37:10 AM

@Alex: It's all about design time, so not runtime, as it generates mapping + code (if you want code) for the framework of the project, so if you have an EF project, it creates the EF xml file and classes. The designer is about the entity model (at the abstraction level of NIAM), the meta-data (pure schema data) and the mappings between them. You have that in code, but not in a designer. But as I said, wait for what we'll releaase, and it might make working with DO easier for people who don't want to write code to design their models. ;)

FransBouma - Tuesday, July 28, 2009 6:26:16 PM

Thanks for a great post!

Adam - Wednesday, July 29, 2009 11:57:37 AM

Hi Frans,

"Bezint eer ge begint" :-) Thanks for this post about how to design software!

Melle Koning - Wednesday, August 12, 2009 9:12:23 PM

@Jeff: exactly!

Yeah, Alex is a bit enthusiastic, it's not the first time this pops up ;). The point of selling a product is to let the product do the selling. He doesn't seem to grasp that concept.

FransBouma - Friday, August 21, 2009 6:48:34 AM