Frans Bouma's blog

Generator.CreateCoolTool();

Syndication

News



    Visit LLBLGen Pro's website

    Follow me on Twitter

    Add to Technorati Favorites

About me

Fun stuff I created

My work

December 2007 - Posts

Codebase size isn't the enemy

In Steve Yegge's latest blog post, he argues that the size of a code base is the code's worst enemy. Today, Jeff Atwood wrote a follow-up with the same sentiments. Now, both bloggers are great writers and have almost always insightful articles. However, this time they both disappointed me a bit: both can't really give a set of reasons why a big code base is particularly bad, and more importanty: what is too big ?

Yegge sits on a codebase of 500,000 lines of Java code, written in 9 years. He finds it way too big to maintain. Reading his blogpost, I got the feeling this conclusion was based on "This will take too much time, at least more time I want to spend on it"-kind of measurement. Atwood, not sitting on a codebase of 500,000 lines of code as far as I can tell from his last article, adds to that a list of rules you should be using in your daily work. Let me quote them below first.

  • If you personally write 500,000 lines of code in any language, you are so totally screwed.
  • If you personally rewrite 500,000 lines of static language code into 190,000 lines of dynamic language code, you are still pretty screwed. And you'll be out a year of your life, too.
  • If you're starting a new project, consider using a dynamic language like Ruby, JavaScript, or Python. You may find you can write less code that means more. A lot of incredibly smart people like Steve present a compelling case that the grass really is greener on the dynamic side. At the very least, you'll learn how the other half lives, and maybe remove some blinders you didn't even know you were wearing.
  • If you're stuck using exclusively static languages, ask yourself this: why do we have to write so much damn code to get anything done-- and how can this be changed? Simple things should be simple, complex things should be possible. It's healthy to question authority, particularly language authorities.

It might not come as a surprise, but I don't like 'gut-feeling'-based science. If someone claims 500,000 lines of code is way too big and you should run away screaming or to quote Atwood: "You're so totally screwed", I find that interesting but more importantly I want to know why these people claim 500,000 lines of code is so incredibly bad (and on that scale, what's good?)

Apparently Yegge and Atwood have found some mysterious threshold which judges the goodness factor of the size of a codebase. 500K lines of code (LoC) is apparently way too big, but what's 'OK'? 150K LoC? If so, what research proved that that threshold is better? Let's assume 150K is 'better' and start from there to see if that can be a good threshold or not.

If you've never seen a big codebase, I can tell you 150K LoC is a truckload of code. LLBLGen Pro's total codebase (drivers, runtime libs, designer, code generator engines, plugins) roughly around 300K LoC. Add to that 11.5MB of template code and you're looking at a codebase which is likely to be called 'rather big'. So I have a bit of an idea how big 150K LoC is. With codebases like that, if you don't keep proper documentation what each and every part of that code means, why it's there etc., 150K LoC is too big. But, so is 20K LoC. The thing is: if you have to read every single line of code to understand what it does, then 20K LoC is still a lot of code to read and understand, it will likely take you weeks.

However, if you understand what the meaning is of a piece of code in your project, why it's there, so in short: what the intent is, what the code represents, 20K LoC isn't big at all, nor is 150K and if I may say so, neither is 1 million lines of code. The question therefore isn't "What's a good threshold for a bad codebase size", but "When does a codebase become unmaintainable".

Do I think that the 300K LoC I've written for the LLBLGen Pro project, together with the massive amount of templates are unmaintainable? No, on the contrary. The thing is that I do know why piece of code X is there, what its meaning is and what its intention is. I can lookup the design decisions made why X was made and not the alternative Y. The core question is: can I make a change to the codebase without unforeseen side effects?. If you have a codebase at hand which you don't understand in full, size doesn't matter. It can be 1K LoC and you still can mess things up, badly, when making a change to it. However if it's 10 million lines of code and the documenation of it is good enough, making a change to it shouldn't be that much of a challenge: you know where to change what and can predict what the effects are because you understand what the code does. Not line by line, but block by block, class by class, because that's properly documented.

Note: For the people who'll overheat for 10 seconds when they read the word 'Documentation', you should read 'theoretical base' instead of 'documentation'. With 'documentation' I mean a description of what the code does and why it's there. If that's described in a model, in a pile of BDD stories, be my guest, as long as what you have as descriptions represents the code you're looking at.

What I found particularly sad about the two articles mentioned is that both articles don't mention the real disadvantages of having to work with a big codebase and also they avoid giving proper advice. Instead they come up with the, sorry to say it, lame conclusion to use a dynamic language. According to the articles, the core reason to use a dynamic language is that it gives less lines of code in a lot of occasions. Oh wow, we'll go from 500K LoC to 150K-200K LoC. Now things suddenly became maintainable again!

The thing is, if you still don't have the code properly documented, why that code is there, what it represents, 150K LoC is still 2500 printed pages, with 60 lines on a page. Therefore, going to a dynamic language doesn't solve a thing. You only change the language, but the root problem remains.

Attack of the clones
The true problem with large codebases is the clone. A clone is a routine or class or code snippet which roughly does the same as another piece of code somewhere else in the codebase. A clone isn't always bad, sometimes they're intentional: in LLBLGen Pro for example I have a clone of a multi-value hashtable class: in the designer and in the runtime library, the same class exists (more or less). The main reason is that both projects are completely separated, they share zero libraries, except .NET. The reason I chose to use a clone and not a shared library is that I could change the class for the runtime library if I wanted to without affecting the designer and vice versa. (For the people interested, a multi-value hashtable class is a class where you can add multiple values under the same key. In .NET 3.5, you can easily create one like: MultiValueHashTable<TKey, HashSet<TValue>> and a couple of lines of code in the Add methods.).

Often however, clones are unintentional and even hard to recognize as a clone. Clones make a codebase less maintainable, as they have the side effect of duplicating code. In several Computer Science departments across the globe, people are doing research how to detect clones, and more importantly: how to remove them without human intervention, for example by AST (Abstract Syntax Trees) transformations inside the compiler or code editor, using refactoring tools or special analysis tools. Even with a codebase which is considered rather small, e.g. 10K LoC, you can have clones which make the code less maintainable. It doesn't matter if the code is 10K LoC or 1Million LoC: if the clones are in the piece of code of 1K LoC you have to maintain, you have to deal with them.

The bigger a codebase becomes, the more you'll likely ask yourself, when writing code in an editor, "Is there already a method/class etc. in the codebase which does what I have to write?" It's a valid question and if the answer is "No", and the programmer doesn't do any research to base that "No" on, other than "It can't be"-gutfeeling-science, the programmer is likely to introduce a clone to the codebase if that "No" should have been "Yes". Still, that clone doesn't have to be bad. Re-using code means dependencies. Dependencies make codebases also less maintainable, because a change could affect a lot of code if a piece of code you're changing is code a lot of other methods/classes depend on. The core point is realizing that you're introducing a clone. So next time you add a class or method, do realize that what you're adding could be a clone.

Clones aren't always full methods. Often a series of checks, which are repeated over and over again in various methods are good examples of clones, for example a series of guard clauses for nulled input parameters. Take for example this paper. It's about detecting clones in the 10 million lines of C code in ASML's wafer stepper machines. They used 19,000 lines of code to learn the code base and extrapolated that result on the complete code base. The paper discusses various approaches to clone detection in that 19K LoC and also the different categories of clones, and their relation to the various concerns in a typical code base.

Creating the 'paper-trail'
When I started at the university back in 1988 as a freshman, we only had text-based editors, monochrome monitors, and 80x24 resolutions to work with. When you wrote a piece of C code or Pascal code, to keep overview you couldn't rely on the editor in front of you: you had just 24 lines of code to look at, tops. As we didn't know better, we didn't mind. It was also not a big problem, because we would approach writing software differently than some of us do today:

  • Analyze the problem
  • Break it into sub-problems
  • Find solutions for the sub-problems in the form of algorithms and abstract models
  • Decide per algorithm and model how to write it into code
  • Write the code

The advantage of this is that you get a 'paper-trail' to the code you'll write: it's not based on an idea that popped into your head when you were hammering out code in some editor, it was a result of a thinking process without any code in sight. Make no mistake, this isn't waterfall. It's applicable to any problem you might face, be it e.g. a way to read all lines in a textfile in reverse order, or an order editing screen: the problem is whatever you have to solve. A paper-trail doesn't have to involve dead trees nor word docs. A paper-trail is semantically used in this context: it's a trail which started with the initial analysis of the problem and ended as the representation of the solution in executable form, the code and contains every step made in between. How you formulate that trail is up to you. Whatever rocks your boat. The key point is that you can follow back the trail to make a different turn at step X, or change a decision made at step Y. From there you then create a new path, back to the code, which can involve the old path but slightly changed.

Why is the paper-trail so important? Well, because you have a theoretical foundation of the various pieces of your code. This is essential in your quest not to introduce clones, and more importantly: to keep codebases maintainable. I've written before about the essence of proper documentation. The idea is still the same: create a theoretical base for your code, so you can answer the question "Why is class XYZ in your code?". That's all there is to it, so when a new feature which changes XYZ can be properly implemented and the ripple effect of changes can be controlled, because you know and understand what you're changing.

Having a solid documentation, having a proper overview of what functionality is there, and following from that, which code is there (and not vice versa! Code follows functionality, not the other way around, code represents functionality, in an executable form), can help you maintain codebases, be it 10K LoC, 1Million LoC or even bigger. Don't fall into the trap to swap languages because they seem to be more expressive so you can write the same functionality in less lines of code. A 100K LoC codebase in Ruby is still 100K LoC. That's still a very thick book if you print it out on paper.

So in other words: measuring the maintainability of a codebase in Lines of Code alone is pretty silly. One should look at other elements to measure maintainability of a codebase, like the form in which the theoretical base of the code is defined in: is there an easy way to get an overview of the code and why that code is there? Only then you can conclude to run away screaming or if you really insist: switch languages and re-write the whole application.

Posted Monday, December 24, 2007 12:41 PM by FransBouma | 24 comment(s)

Developing Linq to LLBLGen Pro, part 10

(This is part of an on-going series of articles, started here)

Whoa, almost a month without an update! The truth is that I wanted to finish GroupBy support before posting another article in this ongoing series, and it took almost 3 weeks to get it right. But more on that later, first some easy stuff to get things started up again.

OrderBy, ThenBy, OrderByDescending, ThenByDescending
Linq has four different sort clause extension methods. When you use a normal query in C# or VB.NET without using extension methods, you won't notice that there are actually four methods. These four methods are a little weird, because Linq expression trees and also extension method usage in Linq to Objects do have an order in which the statements/expressions are defined and to be used. But perhaps I'm overlooking a tiny detail which made Microsoft include four methods instead of one with a sort direction parameter.

Sorting is pretty easy with expression trees. You simply convert OrderBy and ThenBy to ascending operations and the other two to descending operations. Each operation is its own 'query', with no projection nor anything else but a sort operation in a single field in the projection. In the previous article I talked about aliases and alias scopes. A sort clause isn't a scope on its own, so you can merge them all back together afterwards, by merging the parent in front of the source to keep the right order. As scope determination earlier in the process already defines which 'query' definitions can be merged and which don't (and thus become a derived table in their parent's source (FROM clause)), you won't have any problem with misplaced order by statements. Misplaced order by statements are order by statements before a where for example: in SQL an ORDER BY is always placed after a WHERE. The main reason is that the ORDER BY is executed on the projection result of the rest.

Aggregates without group by
After I implemented the sort operations in a couple of hours, I hoped the rest would be as easy as these, after I had seen the problems Matt Warren had in his last post about order by operators in his Linq provider example. Two of the biggest hurdles yet to take in our Linq provider were yet to be taken and the sooner these were out of the way the better. One of them were Aggregates, the other its ugly step-brother GroupBy.

Linq defines 5 (actually 6) aggregate functions: Average, Count, Sum, Min and Max. It also contains a LongCount, which is actually simply Count but with an Int64 typed result value. SQL Jedi's will immediately spot that this list is rather small. Where are the Standard Deviation and Variance aggregates? And where are the distinct variations of these aggregate functions? And why is there just a Count(filter), but not a Count which is usable on a field or expression? Questions which popped in my head pretty quickly. For standard deviation and variance, two simply extension methods would do the trick, so I added Queryable.StandardDeviation and Queryable.Variance, both accepting a field specification and an optional boolean for Distinct usage.

First a word about distinct aggregate functions. Say you want a list of all CustomerIDs from Northwind and added to that in a second column the number of different employees these customers have handled their orders. This requires a distinct count. To do that in linq you'd do:

// Linq to SQL
NorthwindContext nw = new NorthwindContext();
var q = from c in nw.Customers
        select new { c.CustomerID, NumberOfDifferentEmployees = c.Orders.Select(
                  o => o.EmployeeID).Distinct().Count() };
This gives this SQL: -- SQL from Linq to SQL SELECT [t0].[CustomerID], ( SELECT COUNT(*) FROM ( SELECT DISTINCT [t1].[EmployeeID] FROM [dbo].[Orders] AS [t1] WHERE [t1].[CustomerID] = [t0].[CustomerID] ) AS [t2] ) AS [NumberOfDifferentEmployees] FROM [dbo].[Customers] AS [t0]

A COUNT(DISTINCT EmployeeID) would have been easier. It's not to say this is less efficient, but it's a little different from what you'd expect. It took me a while before I realized how to write the query with Count(). Of course after I already added Queryable.CountColumn which accepts a field/expression specification and an optional boolean for Distinct usage.

With Sum and Average it could get more complicated so I added another overload to Queryable.Sum and to Queryable.Average for applying the aggregate on a Distinct set, by adding an optional boolean for Distinct usage to my versions of Sum and Average.

Aggregates aren't difficult to implement in Linq per se. That is, in most very straight forward cases. However, there are cases where things get out of hand. The thing is that aggregate functions are typical scalar functions which represent a single value but the source of what they aggregate is always a set. In the query above you see a typical example of it: a scalar query is placed inside the projection and the scalar query itself fetches a set and applies the aggregate (Count(*) in this case) on that set. Let's look at a tough example of this. I'll first give you the LLBLGen Pro query using Linq and then the produced SQL query. The main element to see is that aggregates applied on aggregated values need a set to aggregate, so this set has to be created from the nested aggregate, otherwise the RDBMS will complain that you're on crack and should go home to get some sleep.

Ready?

// Linq to LLBLGen Pro nasty multi-aggregate query
[Test]
public void GetAllCustomersWithAnOrderTotalHigherThan5000()
{
    using(DataAccessAdapter adapter = new DataAccessAdapter())
    {
        LinqMetaData metaData = new LinqMetaData(adapter);
        var q = from c in metaData.Customer
                where c.Orders.Sum(o => o.OrderDetails.Sum(    
                   od => od.Quantity * od.UnitPrice)) > 5000
                select c;

        int counter = 0;
        foreach(var v in q)
        {
            counter++;
        }
        Assert.AreEqual(54, counter);
    }
}

Looks pretty simple? The query is pretty small, most of the code given is test code or initialization stuff. As you can see, we have disconnected meta-data for query construction and the target the query is executed on. So with Linq to LLBLGen Pro you can write db generic queries without knowning up front which db the query is executed on later on.

So, what SQL does this baby produce?

SELECT    DISTINCT [LPLA_1].[CustomerID] AS [CustomerId], [LPLA_1].[CompanyName], 
          [LPLA_1].[ContactName], [LPLA_1].[ContactTitle], [LPLA_1].[Address], [LPLA_1].[City],
          [LPLA_1].[Region], [LPLA_1].[PostalCode], [LPLA_1].[Country], [LPLA_1].[Phone], 
          [LPLA_1].[Fax] 
FROM [Northwind].[dbo].[Customers] [LPLA_1]  
WHERE 
(
    SELECT SUM([LPLA_5].[LPAV_]) AS [LPAV_] 
    FROM 
    (
        SELECT DISTINCT [LPLA_2].[CustomerID] AS [CustomerId], 
               (
                    SELECT SUM([LPLA_4].[LPAV_]) AS [LPAV_] 
                    FROM 
                    (
                        SELECT  DISTINCT [LPLA_3].[OrderID] AS [OrderId], 
                                [LPLA_3].[Quantity] * [LPLA_3].[UnitPrice] AS [LPAV_] 
                        FROM    [Northwind].[dbo].[Order Details] [LPLA_3]  
                        WHERE   [LPLA_2].[OrderID] = [LPLA_3].[OrderID]
                    ) LPLA_4
                ) AS [LPAV_] 
        FROM     [Northwind].[dbo].[Orders] [LPLA_2]  
        WHERE     [LPLA_1].[CustomerID] = [LPLA_2].[CustomerID]
    ) LPLA_5
) > @LPFA_31

(The DISTINCT keywords are sometimes still misplaced, this is a small glitch still left to be fixed, but not important now) As you can see, the aggregate functions are converted into scalar queries. One thing to notice is that the reference to a related set e.g. c.Orders, doesn't result in a join in this case. It results internally into what I dubbed a correlation relation. This relation is the connection of a set reference to another entity reference somewhere else in the query. In this particular context, the correlation relations end up in a WHERE clause inside the scalar queries in the projections of the derived tables which are the sources of the two SUM aggregates.

The mechanism behind this is quite complicated at first, but if you start with a couple of simple queries and build them out you see the pattern pretty quickly and it then also makes perfect sense. It's then not hard to write code for it which handles all cases. That is, if the aggregate targets a non-grouped set. Yes, people, aggregates like in the query above are first-grader material compared to what's unleashed when GroupBy enters the room.

GroupBy
Let's start with GroupBy first and then go back to Aggregates, as GroupBy deserves its own introduction. GroupBy is Linq's weakest link (pun intented). What's the problem? Well, grouping operators in SQL have a nasty side-effect: they have a fixed location where they can be. Furthermore, they have a limitation: boolean values. SQL normally doesn't understand boolean values in projections. Take this example query from the MSDN documentation:

// Linq to Objects query
var booleanGroupQuery =
    from student in students
    group student by student.Scores.Average() >= 80; //pass or fail!

Now try to rewrite this in a Linq to db system variant, e.g. Linq to Sql. You won't get the result the Linq to Objects query will produce. This isn't the fault of Linq to Sql, it's the ease of which a user can write a query in Linq without thinking what the target will be when the query is executed. This is a weak spot in Linq, as it shows that the developer may never ever forget what the target is of the query: a set of in-memory objects or a database system. So the matra 'a query is the same, no matter what the source is', isn't really something you should live by.

GroupBy has some overloads, some useful in a database scenario, some not. With the various overloads, you can create pretty awkward queries which result into garbage on a database system. So if you're using GroupBy in a Linq query targeting a database system, pay attention to what you want to achieve, what your intention is with the query. This is valid for Linq to Sql but also for Linq to LLBLGen Pro and I'd be surprised if it wasn't true for other Linq implementations to the various O/R mappers out there. So a query which works on in-memory objects could probably not work on a database even though it's not really clear why.

Aggregates with group by
The last three weeks I've spend on making aggregates work with GroupBy, make ordering and WHERE clauses work with GroupBy keys (which can be anonymous types with multiple fields!) and it wasn't a picknick. What's causing the delay? Well, when an aggregate is targeting a groupby it simply has a parameter which represents the GroupBy expression as its source. This sounds like a dull detail, but it's essential. The earlier mentioned non-groupby targeting aggregates above could be handled and placed at the spot where they appeared: simply create a scalar query at the spot the aggregate was seen and apply some algorithm on the source of the query the aggregate is placed in to handle aggregates on aggregates, and that was it. But not with aggregates targeting GroupBy expressions: these aggregates have to be injected into the projection of the GroupBy query. At the same time, at the spot where the aggregate appears in the expression tree (e.g. in the projection, in a filter) a field referencing the field in the GroupBy query has to be placed. It makes sense actually: aggregates targeting a GroupBy have to be executed on the grouped set so they have to appear in the SELECT clause of the same query scope where the GROUP BY is located as well.

But you're not there yet. Because multiple times an aggregate in the query, targeting the same GroupBy requires for example different sets these aggregate functions are applied on. As an example of such a complicated monster, I'll show you a bogus query which nevertheless illustrates the point: it has multiple aggregates on the same GroupBy and all operate on their own subset of data, so, you guessed it, we need a similar algorithm for folding the source of the GROUP BY clause into a derived table. This algorithm is quite more complicated as the algorithm has to take into account correlation relations, which might target fields on a derived table which are actually from a table inside the derived table.

var q = from o in metaData.Order
        group o by o.Customer.Country into g
        orderby g.Key
        where g.Sum(n => n.OrderDetails.Count()) > 10
        select new { Country = g.Key, 
            Num = g.Average(n => n.OrderDetails.Count(od => od.ProductId == 3)) };

Looks complicated? Wait till you see the SQL The complicated elements of this query aren't the misplaced orderby statement. The complicated elements are the g.Sum and the g.Average at two different locations, both targeting a different set of data to work on. For kicks, I also threw in a filter on the Count, plus a grouping on a related entity's field.

-- Linq to LLBLGen Pro query output
SELECT [LPLA_3].[Country], [LPLA_3].[LPAV_1] AS [Num] 
FROM 
(
    SELECT [LPLA_11].[Country], SUM([LPLA_11].[LPAV_]) AS [LPAV_], 
           AVG([LPLA_11].[LPAV_1]) AS [LPAV_1] 
    FROM 
    (
        SELECT [LPLA_10].[OrderId], [LPLA_10].[Country], [LPLA_10].[LPAV_], 
               (
                    SELECT     COUNT(*) AS [LPAV_] 
                    FROM     [Northwind].[dbo].[Order Details] [LPLA_4]  
                    WHERE     [LPLA_10].[OrderID] = [LPLA_4].[OrderID] 
                            AND [LPLA_4].[ProductID] = @ProductId1
               ) AS [LPAV_1] 
        FROM 
        (
            SELECT  [LPLA_1].[OrderID] AS [OrderId], [LPLA_2].[Country], 
                    (
                        SELECT     COUNT(*) AS [LPAV_] 
                        FROM     [Northwind].[dbo].[Order Details] [LPLA_4]  
                        WHERE     [LPLA_1].[OrderID] = [LPLA_4].[OrderID]
                    ) AS [LPAV_] 
            FROM    [Northwind].[dbo].[Customers] [LPLA_2] 
                            RIGHT JOIN [Northwind].[dbo].[Orders] [LPLA_1]  
                    ON  [LPLA_2].[CustomerID]=[LPLA_1].[CustomerID]
        ) LPLA_10
    ) LPLA_11 
    GROUP BY [LPLA_11].[Country]
) LPLA_3 
WHERE [LPLA_3].[LPAV_] > @LPAV_2
ORDER BY [LPLA_3].[Country] ASC

The query isn't really interesting for the result it gives, but it's interesting to see what has to be done to make it valid SQL. If you try the query in Linq to Sql you'll get a similar looking query. The complicated issue is, at least it was for me, the alias rewriting when a query part was folded into a derived table during the evaluation of the expression tree. Don't make the mistake of doing that at a point in time which is too late, because you'll then have no other option than to revert to re-aliasing the complete query, however that gives a problem with joins with self (Employee join Employee, which one are you referring to?).

I must say that I'm quite pleased with the result, even though it was quite a struggle to get everything lined up to do their thing in every situation thinkable: the main point was to fully understand which elements had to be moved around in the query and why, so you'll get some context for reasoning when it might fail.

So am I done now? No, there's still a lot of work to be done, however the biggest hurdles are behind me: I've covered all parts of a SQL query now. There are still some nasty Queryable extension methods which might cause some headscratching, like the All() method which results in a NOT EXISTS query, and the vast amount of database functions which can be called by mapping extension methods of .NET classes onto these functions, as well as, converting some of these functions to predicates like IN and LIKE. One thing that already strikes me is that Microsoft didn't opt for making the extension method -> DB function mapping an extensible system, while it is so simple to do, as the query is evaluated at runtime anyway. That's definitely something I'll do differently: make it pluggable which DB function is called when an extension method is used in the Linq query.

Anyway, let's first enjoy the holidays and a few days with low work pressure. .

Posted Friday, December 21, 2007 3:18 PM by FransBouma | 4 comment(s)

Contact form for emailing me has been disabled for now

I've disabled the contact form on this blog to email me, as spammers have found a way to spam me through that form and as I don't like spam, I have disabled that form till Telligent patches this hole (if ever).

Posted Monday, December 10, 2007 9:47 AM by FransBouma | 7 comment(s)

Filed under:

More Posts