Introduction to MSIL – Part 6 – Common Language Constructs

Thursday, September 23, 2004

In parts 1 through 5 of the Introduction to MSIL series we focused on MSIL and the constructs it provides for writing managed code. In the next few sections we are going to examine some common language features that are not intrinsic to instruction-based languages like MSIL. Understanding the intermediate language instructions that are generated for common, programming language statements is critical to acquiring an instinct for the performance characteristics of your code as well as to more easily track down subtle bugs.

In this part we are going to look at some common constructs that are present in many languages. I will be expressing simple examples in C# although the explanations are common to many programming languages that offer the same general constructs. As I am trying to explain a general principle, I may not always present the exact code that a given compiler may produce. The point is to learn more about what compilers generate in general. I encourage you to compare the instructions generated by different compilers using tools like ILDASM.

Programming languages like C and C++ enabled much more rapid development of software compared with classic assembly language programming. A number of rich constructs played a significant part in providing this productivity. Concepts like expressions, selection statements (e.g. if and switch), and iteration statements (e.g. while and for) are what makes these portable languages so powerful. Even more productivity was gained by introducing concepts like objects, but before objects were even considered, expressions and statements provided enormous benefit to the assembly language programmer struggling to build large applications and operating systems. Let’s look at some of these statements.

The if-else Statement

Consider the following simple method that checks that its argument is not null before proceeding.

void Send(string message)
{
    if (null == message)
    {
        throw new ArgumentNullException("message");
    }

    /* impl */
}

The if statement looks innocent enough. If the expression is true, execution enters the scope block and an exception is thrown. If the expression is false, execution jumps to the first statement following the if statement. There’s enough going on here that I will leave the else clause as an exercise for the reader.

Even if you have never used C# before, this code should be pretty clear (assuming you’re a programmer). So how does the compiler turn this innocent little if statement into something the runtime will understand? As with most programming tasks, there are a number of ways to solve the problem.

In many cases it is necessary for a compiler to create temporary objects. Expressions are a common source of temporary objects. In practice, compilers avoid creating temporaries as much as possible. Assuming a non-optimizing compiler you could imagine the compiler turning the code above into something like this.

bool isNull = null == message;

if (isNull)
{
throw new ArgumentNullException("message");
}

This is not to say that an optimizing compiler would not use temporary objects in this manner. Temporary objects can often help to generate more efficient code depending on the circumstance. Here the result of the expression is first evaluated to a boolean value. This value is then passed to the if statement. This is not all that different from the previous example, but after reading the last 5 parts of this series it should be clear why this is much more appealing to an instruction-based language. It’s all about breaking down statements and expressions into simple instructions that can be executed one by one. Let’s consider one implementation of the Send method.

.method void Send(string message)
{
    .maxstack 2

    ldnull
    ldarg message
    ceq

    ldc.i4.0
    ceq

    brtrue.s _CONTINUE

    ldstr "message"
    newobj instance void [mscorlib]System.ArgumentNullException::.ctor(string)
    throw

    _CONTINUE:

    /* impl */

    ret
}

If you’ve been following along with this series you should be able to understand much of this method declaration. Let’s step through it quickly. The temporary object that the compiler might generate silently is made explicit in MSIL, although it is not named and lives very briefly on the stack. (I guess whether it is actually explicit is debatable.) Can you spot it? First we compare message to null. The ceq instruction pops two values off the stack, compares them, and then pushes 1 onto the stack if they are equal or 0 if they are not (this is the temporary). The code may seem overly complicated. The reason is that MSIL does not have a cneq, or compare-not-equal-to, instruction. So we first compare message to null then compare the result to zero, effectively negating the first comparison. More on this in a moment.

Now that we’ve got a handle on the expression result, the brtrue.s instruction provides the conditional branching you would expect from an if statement. It transfers control to the given target if the value on the stack is non-zero, thereby skipping over the if clause’s logical scope block.

Of course there is more than one way to skin a cat. The example above is very similar to what the C# compiler that I am using generates, although it uses an explicit temporary local variable to store the result of the expression. This implementation seems a bit awkward. Ultimately it does not matter too much as the JIT compiler can probably optimize away any differences. Nevertheless it is a useful exercise to see how we can simplify this implementation. The first thing we can attempt is to reduce the number of comparisons. I mentioned that there is no matching not-equal-to version of the ceq instruction. On the other hand there is a branch-on-false version of the brtrue.s instruction. Quite predictably, it is called brfalse.s. Using the brfalse.s instruction completely removes the need for the second comparison.

Finally, as a C++ programmer you would expect the compiler to use the equality operator, since one of the operands is a System.String type which has an equality operator defined for it, and this is exactly what the C++ compiler does.

.method void Send(string message)
{
    .maxstack 2

    ldnull
    ldarg message
    call bool string::op_Equality(string, string)

    brfalse.s _CONTINUE

    ldstr "message"
    newobj instance void [mscorlib]System.ArgumentNullException::.ctor(string)
    throw

    _CONTINUE:

    /* impl */

    ret
}

Ultimately the JIT compiler optimizes the code as best it can, including inlining methods where appropriate. You should not be surprised if these two implementations result in the same machine code at the end of the day.

The for Statement

Before we delve into the implementation of the for statement, commonly referred to as for loops, lets quickly review how the for statement works. If your background is in Visual Basic or even C# you may not be all that familiar with this extremely useful statement. Even C++ programmers are encouraged to avoid the for statement for common iterations in favor of the safer std::for_each algorithm from the Standard C++ Library. There are however, many interesting applications of the for loop that have nothing to do with iterating over a container from beginning to end.

The following simple pseudo code illustrates the construction of a for statement.

for ( initialization expression ; condition expression ; loop expression )
{
statements
}

The for statement consists of three semi-colon delimited expressions followed by a scope block. (Don’t get me started on for statements without scope blocks.) You can use any expressions you wish as long as the condition expression results in a value that can be interpreted to mean either true or false. The initialization expression is executed first and exactly once. It is typically used to initialize loop indices for iteration or any other variables required by the for statement. The condition expression is executed next and is executed before each subsequent loop. If the expression evaluates to true, the scope block will be entered. If the expression evaluates to false, control is passed to the first statement following the for statement. The loop expression is executed after every iteration. This can be used to increment loop indices or move cursors or anything else that might be appropriate. Following the loop expression, the condition expression is evaluated again and so on and so forth. For a more complete description, please consult your programming language documentation. Here is a simple example that will write the numbers zero through nine to any trace listeners you have set up.

for (int index = 0; 10 != index; ++index)
{
Debug.WriteLine(index);
}

We have already spoken about how an if statement might be implemented in MSIL. Lets now make use of that knowledge to deconstruct the for statement into something simpler to interpret by a computer while still using C#.

    int index = 0;
    goto _CONDITION;

_LOOP:

    ++index;

_CONDITION:

    if (10 != index)
    {
        // for statements
        Debug.WriteLine(index);

        goto _LOOP;
    }

That looks nothing like the for statement! Here I use the infamous goto, more generally referred to as a branch instruction. Branching is inevitable in languages that don’t support selection and iteration statements. It serves no purpose in languages like C++ and C# other than to obfuscate the code. (If you’re a major supporter of the goto statement, please don’t feel the need to share that with me. Feel free to talk about its merits on your own blog.) Code generators however often use goto statements as it is a lot simpler for a code generator to use than to construct the expressions for a for statement for example.

By following the branching in the code example above you should be able to see how we construct the for statement semantics by executing the condition and then jumping up to the loop expression to conditionally loop again. Now let’s consider how we can implement this in MSIL.

    .locals init (int32 index)
    br.s _CONDITION

_LOOP:

    ldc.i4.1
    ldloc index
    add
    stloc index

_CONDITION:

    ldc.i4.s 10
    ldloc index
    beq _CONTINUE

    // for statements
    ldloc index
    box int32
    call void [System]System.Diagnostics.Debug::WriteLine(object)

    br.s _LOOP

_CONTINUE:

After initializing the index local variable we branch to the _CONDITION label. To evaluate the condition I pushed the value 10 onto the stack followed by the index value. The beq, or branch on equal, instruction pops two values off the stack and compares them. If they are equal it transfers control to the _CONTINUE label thereby ending the loop; otherwise control continues through the ‘statements’ in the for loop. To write the index to the debug trace listener I push the index onto the stack, box it and call the static WriteLine method on the System.Diagnostics.Debug reference type from the System assembly. Following this, the br.s instruction is used to transfer control to the loop expression for the next iteration.

Read part 7 now: Casts and Conversions

As always, wonderful work.

Reminds me of the days I coded 8086 assembler in school. Those were the days...

More! Soon! :D

If you want someone to help write more of these chapters, just let me know. I don't even want credit, but it's too interesting and I don't want to step into your territory by doing it on my own. :)

Omer van Kloeten - Thursday, September 23, 2004 10:22:00 PM

1 Comment