Codebase size isn't the enemy

Monday, December 24, 2007

In Steve Yegge's latest blog post, he argues that the size of a code base is the code's worst enemy. Today, Jeff Atwood wrote a follow-up with the same sentiments. Now, both bloggers are great writers and have almost always insightful articles. However, this time they both disappointed me a bit: both can't really give a set of reasons why a big code base is particularly bad, and more importanty: what is too big ?

Yegge sits on a codebase of 500,000 lines of Java code, written in 9 years. He finds it way too big to maintain. Reading his blogpost, I got the feeling this conclusion was based on "This will take too much time, at least more time I want to spend on it"-kind of measurement. Atwood, not sitting on a codebase of 500,000 lines of code as far as I can tell from his last article, adds to that a list of rules you should be using in your daily work. Let me quote them below first.

If you personally write 500,000 lines of code in any language, you are so totally screwed.

If you personally rewrite 500,000 lines of static language code into 190,000 lines of dynamic language code, you are still pretty screwed. And you'll be out a year of your life, too.

If you're starting a new project, consider using a dynamic language like Ruby, JavaScript, or Python. You may find you can write less code that means more. A lot of incredibly smart people like Steve present a compelling case that the grass really is greener on the dynamic side. At the very least, you'll learn how the other half lives, and maybe remove some blinders you didn't even know you were wearing.

If you're stuck using exclusively static languages, ask yourself this: why do we have to write so much damn code to get anything done-- and how can this be changed? Simple things should be simple, complex things should be possible. It's healthy to question authority, particularly language authorities.

It might not come as a surprise, but I don't like 'gut-feeling'-based science. If someone claims 500,000 lines of code is way too big and you should run away screaming or to quote Atwood: "You're so totally screwed", I find that interesting but more importantly I want to know why these people claim 500,000 lines of code is so incredibly bad (and on that scale, what's good?)

Apparently Yegge and Atwood have found some mysterious threshold which judges the goodness factor of the size of a codebase. 500K lines of code (LoC) is apparently way too big, but what's 'OK'? 150K LoC? If so, what research proved that that threshold is better? Let's assume 150K is 'better' and start from there to see if that can be a good threshold or not.

If you've never seen a big codebase, I can tell you 150K LoC is a truckload of code. LLBLGen Pro's total codebase (drivers, runtime libs, designer, code generator engines, plugins) roughly around 300K LoC. Add to that 11.5MB of template code and you're looking at a codebase which is likely to be called 'rather big'. So I have a bit of an idea how big 150K LoC is. With codebases like that, if you don't keep proper documentation what each and every part of that code means, why it's there etc., 150K LoC is too big. But, so is 20K LoC. The thing is: if you have to read every single line of code to understand what it does, then 20K LoC is still a lot of code to read and understand, it will likely take you weeks.

However, if you understand what the meaning is of a piece of code in your project, why it's there, so in short: what the intent is, what the code represents, 20K LoC isn't big at all, nor is 150K and if I may say so, neither is 1 million lines of code. The question therefore isn't "What's a good threshold for a bad codebase size", but "When does a codebase become unmaintainable".

Do I think that the 300K LoC I've written for the LLBLGen Pro project, together with the massive amount of templates are unmaintainable? No, on the contrary. The thing is that I do know why piece of code X is there, what its meaning is and what its intention is. I can lookup the design decisions made why X was made and not the alternative Y. The core question is: can I make a change to the codebase without unforeseen side effects?. If you have a codebase at hand which you don't understand in full, size doesn't matter. It can be 1K LoC and you still can mess things up, badly, when making a change to it. However if it's 10 million lines of code and the documenation of it is good enough, making a change to it shouldn't be that much of a challenge: you know where to change what and can predict what the effects are because you understand what the code does. Not line by line, but block by block, class by class, because that's properly documented.

Note: For the people who'll overheat for 10 seconds when they read the word 'Documentation', you should read 'theoretical base' instead of 'documentation'. With 'documentation' I mean a description of what the code does and why it's there. If that's described in a model, in a pile of BDD stories, be my guest, as long as what you have as descriptions represents the code you're looking at.

What I found particularly sad about the two articles mentioned is that both articles don't mention the real disadvantages of having to work with a big codebase and also they avoid giving proper advice. Instead they come up with the, sorry to say it, lame conclusion to use a dynamic language. According to the articles, the core reason to use a dynamic language is that it gives less lines of code in a lot of occasions. Oh wow, we'll go from 500K LoC to 150K-200K LoC. Now things suddenly became maintainable again!

The thing is, if you still don't have the code properly documented, why that code is there, what it represents, 150K LoC is still 2500 printed pages, with 60 lines on a page. Therefore, going to a dynamic language doesn't solve a thing. You only change the language, but the root problem remains.

Attack of the clones
The true problem with large codebases is the clone. A clone is a routine or class or code snippet which roughly does the same as another piece of code somewhere else in the codebase. A clone isn't always bad, sometimes they're intentional: in LLBLGen Pro for example I have a clone of a multi-value hashtable class: in the designer and in the runtime library, the same class exists (more or less). The main reason is that both projects are completely separated, they share zero libraries, except .NET. The reason I chose to use a clone and not a shared library is that I could change the class for the runtime library if I wanted to without affecting the designer and vice versa. (For the people interested, a multi-value hashtable class is a class where you can add multiple values under the same key. In .NET 3.5, you can easily create one like: MultiValueHashTable<TKey, HashSet<TValue>> and a couple of lines of code in the Add methods.).

Often however, clones are unintentional and even hard to recognize as a clone. Clones make a codebase less maintainable, as they have the side effect of duplicating code. In several Computer Science departments across the globe, people are doing research how to detect clones, and more importantly: how to remove them without human intervention, for example by AST (Abstract Syntax Trees) transformations inside the compiler or code editor, using refactoring tools or special analysis tools. Even with a codebase which is considered rather small, e.g. 10K LoC, you can have clones which make the code less maintainable. It doesn't matter if the code is 10K LoC or 1Million LoC: if the clones are in the piece of code of 1K LoC you have to maintain, you have to deal with them.

The bigger a codebase becomes, the more you'll likely ask yourself, when writing code in an editor, "Is there already a method/class etc. in the codebase which does what I have to write?" It's a valid question and if the answer is "No", and the programmer doesn't do any research to base that "No" on, other than "It can't be"-gutfeeling-science, the programmer is likely to introduce a clone to the codebase if that "No" should have been "Yes". Still, that clone doesn't have to be bad. Re-using code means dependencies. Dependencies make codebases also less maintainable, because a change could affect a lot of code if a piece of code you're changing is code a lot of other methods/classes depend on. The core point is realizing that you're introducing a clone. So next time you add a class or method, do realize that what you're adding could be a clone.

Clones aren't always full methods. Often a series of checks, which are repeated over and over again in various methods are good examples of clones, for example a series of guard clauses for nulled input parameters. Take for example this paper. It's about detecting clones in the 10 million lines of C code in ASML's wafer stepper machines. They used 19,000 lines of code to learn the code base and extrapolated that result on the complete code base. The paper discusses various approaches to clone detection in that 19K LoC and also the different categories of clones, and their relation to the various concerns in a typical code base.

Creating the 'paper-trail'
When I started at the university back in 1988 as a freshman, we only had text-based editors, monochrome monitors, and 80x24 resolutions to work with. When you wrote a piece of C code or Pascal code, to keep overview you couldn't rely on the editor in front of you: you had just 24 lines of code to look at, tops. As we didn't know better, we didn't mind. It was also not a big problem, because we would approach writing software differently than some of us do today:

Analyze the problem
Break it into sub-problems
Find solutions for the sub-problems in the form of algorithms and abstract models
Decide per algorithm and model how to write it into code
Write the code

The advantage of this is that you get a 'paper-trail' to the code you'll write: it's not based on an idea that popped into your head when you were hammering out code in some editor, it was a result of a thinking process without any code in sight. Make no mistake, this isn't waterfall. It's applicable to any problem you might face, be it e.g. a way to read all lines in a textfile in reverse order, or an order editing screen: the problem is whatever you have to solve. A paper-trail doesn't have to involve dead trees nor word docs. A paper-trail is semantically used in this context: it's a trail which started with the initial analysis of the problem and ended as the representation of the solution in executable form, the code and contains every step made in between. How you formulate that trail is up to you. Whatever rocks your boat. The key point is that you can follow back the trail to make a different turn at step X, or change a decision made at step Y. From there you then create a new path, back to the code, which can involve the old path but slightly changed.

Why is the paper-trail so important? Well, because you have a theoretical foundation of the various pieces of your code. This is essential in your quest not to introduce clones, and more importantly: to keep codebases maintainable. I've written before about the essence of proper documentation. The idea is still the same: create a theoretical base for your code, so you can answer the question "Why is class XYZ in your code?". That's all there is to it, so when a new feature which changes XYZ can be properly implemented and the ripple effect of changes can be controlled, because you know and understand what you're changing.

Having a solid documentation, having a proper overview of what functionality is there, and following from that, which code is there (and not vice versa! Code follows functionality, not the other way around, code represents functionality, in an executable form), can help you maintain codebases, be it 10K LoC, 1Million LoC or even bigger. Don't fall into the trap to swap languages because they seem to be more expressive so you can write the same functionality in less lines of code. A 100K LoC codebase in Ruby is still 100K LoC. That's still a very thick book if you print it out on paper.

So in other words: measuring the maintainability of a codebase in Lines of Code alone is pretty silly. One should look at other elements to measure maintainability of a codebase, like the form in which the theoretical base of the code is defined in: is there an easy way to get an overview of the code and why that code is there? Only then you can conclude to run away screaming or if you really insist: switch languages and re-write the whole application.

17 Comments

I know it's another discussion, but might be relevant as well: When you use AOP in .NET using Policy Injection. You might use a very small code base, but what does a LoC say when a lot of other stuff is injected?

I wonder how the .NET BCL-team manages the whole clone story...

- Alex

Alex de Groot - Monday, December 24, 2007 12:36:17 PM

Ah, what's another day without some CodeBetter schmuck coming in and prostelyzing static typing and testing, like they're the only two things that matter?

Atwood is simply wrong. He's never actually written anything all that big, so he really has no idea about the issues faced by large applications.

As for Miller, he's a flake.

foobar - Monday, December 24, 2007 5:02:20 PM

Also, just need to add: Atwood and the merry band of groupthinkers at CodeBetter are truly stupid.  Let's look at some really, really obvious examples of applications that are really big:

- World of Warcraft

- MS Word

- Hell, Open Office

- The .NET framework.

Apparently, all these things should be totally unmaintainable.  And they should've all been failures because they didn't use static typing and had "large codebases". Doubtless the amount of revenue these products generate is miniscule compared to the perfect code created by Team CodeBetter.

foobar - Monday, December 24, 2007 5:14:33 PM

"foobar", please next time post a real name or I wont publish the comments. I don't need a flame fest in the comments here. Thanks :)

FransBouma - Monday, December 24, 2007 5:25:36 PM

I came from dynamic languages to Java and .NET and prefer it's verbose syntax for the simple reason that I find it easier to maintain.

Of course, I think when people praise dynamic languages, they probably should just say "ruby". I've worked with a lot of perl and php code and I sure wouldn't call it's terseness and clever syntax tricks easier to maintain. If you asked me to maintain a 10k LOC perl codebase vs. a 100k C# codebase, I'll take C# any day of the week.

Arne Claassen - Monday, December 24, 2007 6:11:45 PM

Clearly I think Atwood has missed the point about complexity. I work with a 80K line code base that is pretty easy to work with (even though its pretty tightly coupled to the DAL), because its consistent and has a clear separation of concerns. Another application I maintain is only 5000 lines of code and is an absolute nightmare, because it's an inconsistent, hacked up, tightly coupled, procedural nightmare.

To an extent Atwood is right that LOC can be a problem. Given equally well designed codebases, the one with more LOC is going to be more difficult just because there's more noise and code to sift through - dynamic language or not. I'm sure checked exceptions in Java make you write more code when you open a file, but they certainly don't make the code more difficult to understand. Although LOC are a factor, I think it's a pretty small piece of the overall problem.

Dynamic languages have there place, but they're not a silver bullet to complexity. I like my static checking most of the time, it makes me feel safe (especially with ReSharper). Sometimes I hate static checking, like when I change several interfaces so several classes won't compile and I just want to build one class in the project to unit test it.

Shawn Neal - Monday, December 24, 2007 6:47:54 PM

Frans, I don't see why this needs a rant. The writers of the articles are more of the "organic" programmers, and you describe a more mathematical/scientific way of developing software. I can not say that one is better than the other.

The scientific way of developing software, in my eyes, isn't mature enough. Software development at in it's current state is more like an art. We have a gazillion ways of solving problems, but we are not able to scientifically prove which one is the best. Yes off course we can develop in a structured way, but no one really know what's really good.

The organic way of developing software is kinda like moulding software. It's the brute force way of developing software. You hack and hack and hack until it is what you want. Test Driven  Development is one of those organic ways, but also the way how Agile development is used in practice. This organic way became populair the last couple of years because of IDE's which have improved a lot, and dynamic languages which support this way of programming. This type of development is harder to manage (ah who needs those damn managers anyway) because you rely more on programmer skill than on the process at hand.

I agree with you, Frans, that in order to know whether 500,000 lines of code are maintainable you have to have some kind of analysis. Maybe 400,000 LoC are generated with a DSL tool, maybe it's one big copy and past nightmare.

Saber Karmous - Tuesday, December 25, 2007 2:50:45 AM

Obviously the guys making recommendation to move large codebase to dynamic language are primarily static language guys with little experience in dynamic languages, or at least little experience with large projects in dynamic language.

I have been developing projects in python since early 90's and other dynamic languages before that. I can tell you from experience there are reasons why no huge project use these languages exclusively. There are many who prototype in such languages or implement certain aspects, but very few if any that strictly use dynamic language.

One reason is performance of interpreted language obviously. But also for large projects the strengths of dynamic language turn against you and end up making the code harder to maintain and code standards harder to enforce. If you have one or two developers this is not a huge problem, on a large project with many teams of developers it is an absolute nightmare.

Stefan Bollier - Tuesday, December 25, 2007 3:49:27 AM

Great post! Stevey's original blog entry got a pretty high rating what made me wonder if I was missing the point? Thanks for your common sense rebuttal.

A) "Codebase size" is not the enemy but the consequence of larger projects.

B) I can't see how you write a large project with a dynamic language (such as JavaScript!?). How are you going to memorize all the variables? Dynamic tying removes the information you need to correctly identify each and every variable.

C) Again: JavaScript does not have classes. Building objects is simply a nightmare. But hey, I saved a couple of lines of code. JavaScript and the other dynamic languages are probably more readable - but are they writable as well? It's not so much the reading but adding on to it that's the key!?

Christoph Aschwanden - Wednesday, December 26, 2007 12:15:42 AM

Thank you Frans for having the cohones to point out that some of the Emperors have no clothes (at least some of the time).

Mat Hobbs - Wednesday, December 26, 2007 5:11:28 AM

@Patrick: I calculated the lines of code with a simple line counter for textfiles. I know that the metric of re-calculating logical lines gives much less # of lines, but I find that a bit misleading. (VS.NET 2008 has the same metric btw). I use a lot of comments to make the code more clear so the reader has no room for guessing WHY a particular statement is there. I find that comments are part of the code so they should be counted with the code. Also, when reading the code, the text read by the developer is the text, not the set of logical lines.

But if you look solely at the # of lines you might have to change, then indeed the metric will give you the # of lines, which can (and probably will be) much lower than the textual measured # of lines :)

FransBouma - Wednesday, December 26, 2007 9:51:04 AM

Think before you code and capture the process of thought and outcomes in whatever form that's fits your environment. To structure this process do capture:

WHY?
WHAT?
HOW?
WITH WHAT?

LoC and languages are completely irrelevant if you're not paying attention to the basics. Frans you did a good job on pointing this out (again).

Paul Gielens - Wednesday, December 26, 2007 12:25:15 PM

Great post!
I can't hear anymore how dynamic language will change the world. It lacks good arguments. I think it won't change anything since, as you said, it doesn’t change the real problem.

Dals - Wednesday, December 26, 2007 7:44:41 PM

I'm sitting on top of 350kloc of injected ActiveRecord, I-test-therefore-I-am code right now. It's a unmaintainable steaming pile, the tests have fractional coverage, the cyclomatic averages +40 and visible intent is the exception not the rule. It's actually one of the more intriguing codebases I've worked with recently--they used the cool kid tools, they get the cool kid concepts, but they don't know the old boring maxim about making code lucid.

Which was what I took from your post Frans (and what I can confirm from having poked through the LLBLGen SDK code before and having fallen into goodness with the runtime...). I didn't take waterfall/bduf/etc., I took that you believe intent must be expressed and captured somewhere and nothing more than that will save you. And I absolutely agree with that. I work someplace where we routinely are dealing with very large C# codebases.

It's absolutely not the language that's the driving problem. It's people wielding to make big balls of mud.

I just got done wandering through an ASP classic app (ooh, it's not statically typed, that means it's cool, right?) with a few hundred forms. And yes, it was the exact same steaming pile most appreciable ASP classic apps ended up becoming.

I'm pretty sure I can take IronWhatever to some shops and some parts of the world and turn it into many thousands of lines of IronSteamingPile.

Also, Jeremy D. Whatever's alt.comment about whatever he was wrinkled up about was pretty offbase imo. I've been reading Frans' posts for a number of years now, I think the Dynamic SQL classics were how I originally found my way here. My take is that Frans has strong opinions, but I never took this for omniscience. Good post, Frans, both Atwood and Yegge are whiffing on the bigger picture.

Ruby will not save us.

Grant - Thursday, December 27, 2007 3:35:03 AM

This is a tangent, but caught my attention nonetheless..

"Decide per algorithm and model how to write it into code"

I'd like to call you out on modeling how to write it into code. At somepoint, I'd like to hear your concrete strategy of modelling something into code--no BS theorhetical bloviating.

Modeling is a very real practice, but it's a also a bit of a buzzword these days. There are a lot of people paying lipservice to the term, diluting the actual practice. Your usage feels like dilution to me. I'd like you to show me wrong or stop using the term. Consider this a friendly challenge. ;-)

Evan - Saturday, December 29, 2007 7:58:17 AM

"Modeling is a very real practice, but it's a also a bit of a buzzword these days. There are a lot of people paying lipservice to the term, diluting the actual practice. Your usage feels like dilution to me. I'd like you to show me wrong or stop using the term. Consider this a friendly challenge. ;-)"
Why should I stop using a term which has been around for decades, just because some MDA people think it's overused?

I used 'model' in the sense of an abstract entity model for example. That's a model, so if you don't like it being called a model, I'm sorry, but I can't do otherwise, it IS a model.

FransBouma - Saturday, December 29, 2007 9:37:37 AM

great post. loc should never be the measurement of anything, ever. good or bad. i still can't believe people try and claim it means anything.

silky - Saturday, January 5, 2008 4:28:14 AM

Comments have been disabled for this content.