"Re: I have a question about proving code correctness"

Friday, January 20, 2012

Recently I received an email with the following contents:

I recently started working at a startup. I'm learning lots of cool stuff, including unit testing, but as a math major something kinda nagged at me about them. I don't know if I could write a unit test that I really trust all that much. It's better than manually poking my code to be sure, but I felt there might be a better way. Fast forward to an article posted in ycombinator which gave me the knowledge that there are actually people who prove their code correct. I stumbled upon a blog you wrote, in 2009 and have found a few more from around the same time, but I was wondering if you were familiar with any literature on best practices, application, etc. I have some guesses as to how I personally would start to go about attempting proving code correctness, but I find it useful to have books written by people who implement this practice in their day to day work lives. Thank you for your time

Proving code correctness of a whole system is hard. It's so hard in fact that for a large system it's undoable. So should one give up on that altogether? Well... no not exactly, as there's a different approach to it: do it in two steps. It's based on the idea that a unit test actually tests at least two things, not one:

A unit test tests whether the algorithm implemented functions correctly with the input specified
A unit test tests whether the implementation of the algorithm is correct.

If a test fails, it can be because there's a bug in the code (e.g. the code uses the wrong index variable) or a bug in the algorithm (e.g. it fails to reach the desired outcome with a given input set) or both (and then you're in real trouble ;)). Because if a test fails it can mean more than one thing, it's good to weed out all possibilities but one: that way, when a test or verification fails, we know it's always because of the same reason and never because of another. So if we for example eliminate errors in the algorithm, we can be sure that if a test fails, the implementation of the algorithm is borked, the algorithm itself is OK.

The 'easiest' thing to eliminate from your tests is algorithm correctness testing. I say 'easiest', because formulating an algorithm is often easier than writing it down in an executable form, as that comes down to projecting it after you formally written it out in your head. To prove an algorithm to be correct is a task which is ranging from trivial to extremely difficult and it's part of what's called Formal Verification. Formal verification is, like its name suggests, rather formal, so if you're not into math that much, it can be daunting. So I'll describe a different approach which, in my humble opinion, is rooted onto the same ideas as formal verification, albeit not using formal math.

The overall idea is this:

For a given feature, a specification is written (be it a user story or a thick, 120 page document written by an over-paid consultant) which describes what the feature should do with what input.
To be able to realize the feature as stated in the specification, algorithms and data-structures on which they operate are designed. No coding is done yet. You'll see that during these first two steps, changes happen often and it's very easy to apply them, much easier than when altering code.
Using verification techniques, the algorithms are proven to be correct for the expected input.
Using Program derivation the algorithms are implemented into an executable form. As Program derivation can be very formal, it's often the case a developer falls back to Design by Contract or a less formal way to transform an algorithm into executable form.

The core idea is that if the algorithm is correct, a carefully constructed implementation of it as executable form is therefore correct too. At least for the algorithm part. The implementation of the algorithm can still contain bugs: wrong statements, wrong variable usage, boundaries of used constructs aren't taken into account etc.

One thing that is crucial is that testing and proving are both after the same goal: to make you realize whether what you've written is doing what it should (namely what the spec says!) or not. Using proving techniques is less time consuming in many cases as you can reason about all possible situations in one go, while with unit tests, you test for a subset of situations (the one implemented with the test) which can result in many many tests if there are many different situations the code has to work correctly with. On top of that, unit tests without algorithm proving can still fail for two reasons (algorithm error or code error) instead of one (code error).

How I use this in practice

I'm not mathematician, so I am not using math to prove whether my algorithms are correct. I use pre-/post conditions per step and formal algorithm descriptions to write down the algorithms. I also use per feature a description about what the feature has to do with what input. After all, if there's no description what a feature should do with what input, there's no way to say whether the code written does what it should do, as there's no description of what it should do in the first place.

On a whiteboard I use pre/post condition proving of the algorithm and change it if necessary. If I'm satisfied, I write down the algorithm steps in comments in the code and implement the data-structures required, or look for existing data-structures whether they fit the algorithm. After that I implement each step, either in-place where the comment is placed or in its own method/class.

When I'm done, and the compiler thinks I'm done too, I go back to what I've written and start reading the code again. This is something that should be taken very seriously: humans suck at reading code, so when you read code, it's essential you pay extra attention to what everything does, what the state of things is, etc. It helps to run the code through a debugger if you're not skilled in reading code (developers often 'glance' over code, not read it) to see what it does with each step.

While reading the code, I match the code parts of each step with the algorithm step it implements, check whether pre-/post conditions are indeed taken into account. In short this means: did I implement the algorithm step as intended or did I cut corners? Remember, this verification isn't done to check whether the algorithm works or not! It's meant to see whether the implementation of the algorithm is done correctly. We already proved the algorithm to be correct before we started typing one line of code.

It's important not to get cocky here: it's nice to show off your OO skills and how clever you can intertwine these silly loops, but doing so should never ruin the connection between algorithm description and code itself: it's very important for a person who reads the code to know what bizarre algorithm it implements, so changes can be made accordingly if the algorithm changes after a while.

At this point I have:

A description of the feature, which is the basis of proving whether the code implementing the feature does what it should
A series of algorithm descriptions, which are proven to be correct.
Code which implements the algorithms for the feature
Data-structures which are needed for the algorithms.

And now I run the code to see what I read and thought to be correct is actually also correct. Think of this as integration testing, as the code is ran in-system, with all the other code. Running the code can mean running a few unit tests, if the feature is a subsystem which is UI-less or which is used later on by other subsystems. The unit tests are focused on what the specification describes as valid inputs (so you also know what invalid inputs are).

"So your code is bug-free?"

In short: not always ;). As all these steps take time, it's sometimes not possible to do every step in the time it requires but has to be done in the time that's available. This leads to some corners being cut for things which seem trivial, but later on turn out to be less trivial. However what I see in our code base is that the parts on average are relatively bug free, and that with a minimum of unit tests. Our runtime test base is only a few thousand unit tests, which is rather small for a code base of about 100K lines. The designer, which is also over 100K lines has only 50 unit tests for core sub systems and has had only 25 bug fixes in the past year. We achieved this by using a solid foundation with proven code, namely our Algorithmia library (open source algorithm and data-structures library).

There are more aspects about making your code bug-free: separation of logic/code so changing one part doesn't affect everything else, clean code so it's easy to understand and maintain, logical placement of functionality, so you can easily find things back and see what affects what as understanding a code base makes making changes easier and less error prone.

There are always a chance that you break something when you make a change. For this, unit tests can help (but are not a way to be 100% certain!): write integration tests. You can design the tests specifically for detecting breaking changes. A lot of our runtime unit tests are designed for just that: fail if a sub-system got changed in such a way it affects the consumers of that sub-system.

Conclusion

I hope to have given some insight in how to write more solid code, namely by starting with a solid root where the code is coming from: what should it do? what algorithms are used and are these correct? Are the algorithms implemented correctly?

Many algorithms have already been proven, so you can pick these up and implement them right away (if someone hasn't done that for you as well). After all, if what you have to implement sucks, your program will never work correctly, even if the code doesn't contain a single programmer error. You can't avoid all mistakes, but it's doable to make the amount of bugs small.

Frans,
One problem I have with the agile mantra of 'Red, green, refactor' is that if you have a failing test but you don't know why it failed then it is useless.
If you write a method that fails your test because it gives you back a NotImplementedException then you have wasted your time because you haven't learnt anything.
But if your test fails for a reason that you weren't expecting then you have learnt something. Maybe your algorithm is incorrect.
Blindly following the agile method is pointless, IMHO. But using unit tests intelligently can teach you a lot about your code.
But, as you say, proving correctness is extremely difficult, if not impossible.

Richard Wright - Friday, January 20, 2012 2:03:47 PM

About "If you write a method that fails your test because it gives you back a NotImplementedException then you have wasted your time because you haven't learnt anything."

That's not quite true: checking that the unit test fails when the method is not implemented is also a way to make sure that the unit test is well written from the start, even in the simple of cases.

Also, it encourages the developer to write the test (executable specification) first, before the code is written, not the other way around, when he might be tempted to write a test that just reflects what the method is already doing, even if that is not the expected behavior.

So the general rule would be: never add/modify functionality, before writing at least a test first.

All unit tests don't prove in a mathematical way that the code is without bugs, just reduce that possibility.

Tudor - Friday, January 20, 2012 3:01:49 PM

"One problem I have with the agile mantra of 'Red, green, refactor' is that if you have a failing test but you don't know why it failed then it is useless."

Maybe worse than useless. If a test asserts failure without a message and without a clear test name its worse than useless. If a test blows up in an unexpected manner, not just a failed assert, it can be a good thing because it shows that you have a coverage problem.

In most of my day to day work, the idea of proofing is moot because the specifications are minimal or non-existent :P

A - Friday, January 20, 2012 3:16:02 PM

Frans,

I thought I have to jump in and leave some comments because your post includes some incorrect information. Unfortunately, this area is one of those areas where programmers and testers have a lot of misbeliefs because all they know is software testing.

You have to distinguish two different fields called Software Testing and Formal Methods. The former is relatively easily and includes all the things you know about unit testing and integration testing, and the latter is totally different and almost all the industrial programmers don't know about it because it's an advanced and complicated mechanism.

With software testing, what you're trying to achieve is improving the quality of software by reducing the number of bugs, exceptions, and errors that you can predict before a user catches them. With software testing what you can achieve at most is proving the presence of bugs not their absence. Even if you write millions of tests, you can't guarantee that there is no other bug or error.

Software testing works great for most industrial software systems that are not critical. At most, there is an incorrect output or an exception, but there are critical software systems that must be bug-free like those used in aviation, healthcare, or automotive.

Formal methods are a set of mathematical techniques to actually prove that a program runs without any bugs, errors, or exceptions in order to make sure that it's not going to kill somebody or make a catastrophic failure. Formal methods require a deep knowledge of mathematics and programming and it's the purpose of research at academia and industry for doctoral students and graduates (that's what I'm doing).

Formal methods like abstract interpretation or other techniques are used in systems like Astree (developed in France) that proves the absence of bugs in systems used in Airbus planes. There are also similar systems used in BMW cars and healthcare systems. There are also software verification tools used by NASA.

So unlike what you said, proving the correctness of a program doesn't involve those steps. It's a totally formal procedure that is hard to explain in a comment but you can read a book entitled Principles of Program Analysis on that.

You also mentioned that proving the correctness of a big software system is very difficult. It is, indeed, but that's done many times for many big systems and I mentioned some examples above.

I hope this helps clarifying the issue.

Keyvan Nayyeri - Friday, January 20, 2012 3:44:25 PM

@Keyvan Nayyeri

Thanks for your post :) As I (apparently not correctly) tried to explain in my post, I'm not a specialist in this field, so I tried to describe what I knew of software correctness proving and how I apply it in practice.

So what I tried to do was giving developers who have no notion of formal proving (as you said, it's a complicated field) some tools/ways of working that they can apply in their own daily work so they create more reliable / less buggy software. What you described so well 'software testing can at most only show there are bugs, not the absence of them', is unfortunately not well known among developers: what they most of the time think when they see all green unittests: "bugs are gone, ship it!". Some teams realize this though, and write a tremendous amount of tests, have rules like 'we need 100% code coverage' and other time wasters.

IMHO time not well spend. But as you said, it's very hard (complex, lot of work) to reach the goal of proving the complete system to be bugfree so some ways to get near that goal might already be helpful. That's what I tried to describe. Of course my post isn't a way to write software for a plane, it's not formal enough as you know.

FransBouma - Friday, January 20, 2012 3:58:55 PM

Great post. I hope this is the beginning of a series of posts :-) The question you are responding to in this post is a great question, one that I've asked myself. Also, while startups are usually very different one from the other, they usually have unique pressures, such as out-of-band deadlines, competing agendas sometimes by stakeholders, chaotic resourcing, and having to be a scientist in the middle of all of it can be grueling. It would be interesting to take this discussion to a team, and see how to make this work out in a team atmosphere (ie: roles and responsibilities, scientist v. engineer). I appreciated the statement "it's sometimes not possible to do every step in the time it requires", that's a breath of fresh air, when sleep is becoming rare :-). Now, I need to learn more about describing and designing algorithms (without code), and "Formal methods". Keyvan, can you write some more about that ;-)
Cheers.

Matt C. - Sunday, January 22, 2012 7:02:30 AM

How I use this in practice

"So your code is bug-free?"

Conclusion

6 Comments