BasicParse, a parser/compiler for a pseudo Bind configuration file format.

Editorial Notes:

    1. Added section 5 on performing a small modification to the parser.  Pushed back all other sections after it.

Abstract:
This article is going to cover the process of converting tokens from the BasicLex functions into a compiled format.  The source language is a pseudo Bind configuration format that I pulled off a newsgroup I say pseudo, because I wrote the parser based on a sample config and I made some assumptions that make my parser incompatible with Bind configuration files.  I'll remedy this inconsistency in a second parser written in a later article.  We'll be covering the process of assigning semantic meaning to the tokens, performing token scanning to target language conversion while maintaining some basic ordering rules, and finally outputting the resulting compilation as an XmlDocument.

TOC:

    1. Taking a look at the configuration format
    2. Assigning semantic meaning to tokens
    3. Processing tokens, basic rules, and language conversion
    4. Examining the XML Output
    5. Adding a single feature, flagged fields.
    6. Conclusion

1.  Taking a look at the configuration format
Definitely need to give you some background here.  The configuration format came across the MS newsgroups and at the time I started writing the code for processing the configuration format, I didn't have the full scope of what it looked like or how it should be processed.  Because of this, my parser is only a subset of what is actually required by the format, but is also a full format in it's own right.

The first feature of the format we are going to parse is that it allows basic key-value pairs or nested complex types.  Nested complex types are built by having fields that are themselves key/value pairs.  The format supports strings as values (not as key's also known as ID's), requires statement terminators, and throws out carriage returns, line feeds, and tabs.  Here is a sample file that I'm using as test.

options {
    foo = "bar";
    bar = -1;
    baz {
        what = "hum";
        hum = "foo"         ;
    }
    nestedType1 {}
    nestedType2 {   }
}

darnit { foo = "bar"; }
lasttry {funky=-430895;}

The above configuration file covers a large number of gotchas, including liberal use of whitespace attacks, value ordering, etc...  I'll stop here and allow the sample to fully define the complexity and abilities of the language.  One thing that isn't apparent is the nesting ability.  You can recursively define as deep a nesting tree as possible and the parsing is done in a linear loop (no recursive attacks against the compiler or stack overflows).

2.  Assigning semantic meaning to tokens
First we need to grab some tokens.  You can determine the breaking tokens and toss-out tokens we'll be using in a call to BasicLex from the previous section and a quick look at the configuration file.  The following call will get you a bunch of Token objects that have no semantic meaning whatsoever (they do, but we'll talk about that in a second).

Token[] tokens = BasicLex.StringToTokens(source, " \n\r\t{}\"=;", "\t\r\n");

In some languages various tokens can have the same meaning depending on context.  In those situations assigning semantic meaning to the tokens can be difficult.  In the case of our parser, the breaking tokens are pretty much assigned 1:1 to a backing token type, and some tokens will be misassigned an appropriate token meaning because the parser is resilient enough to convert incorrectly typed tokens to correctly typed tokens based on scope (you'll see this when we have to reconstruct a string later in the parser).  We'll have 7 tokens total, with an 8th undefined token type.

public enum TokenType {
    Unknown = 0,
    Id = 1,
    Nest = 2,
    Assignment = 3,
    StringDelimiter = 4,
    StatementTerminator = 5,
    NestEnd = 6,
    Whitespace = 7
}

Most of the tokens are obvious, and as pointed out the mapping will be 1:1, but I want to store the enumeration value for each token on the Token itself.  We'll create a new class, TypedToken that stores a Token and the type.

public class TypedToken {
    public Token Token;
    public TokenType TokenType;
   
    public TypedToken(Token token, TokenType tokenType) { this.TokenType = tokenType; this.Token = token; }
}

Finally, we have to convert Token data into TypedToken data.  A simple switch statement will suffice in our case, and no semantic meaning will be assigned based on previous tokens.  This is one of the simplest parsers you could possibly define because of our lack of rules for assigning semantic meaning.  Don't worry about that, because more semantic meaning will be assigned later as we process the typed tokens.

/* The purpose of this method is to build context.  We are assigning meaning
   to tokens that have no meaning.  We could go a step further and compress
   the stack even further by combining whitespace tokens, converting string
   delimiter token sequences into strings, etc..., but we won't, I want this
   to be as simple as possible to demonstrate each step.
*/
public static TypedToken[] TokensToTypedTokens(Token[] tokens) {
    TypedToken[] typedTokens = new TypedToken[tokens.Length];
    for(int i = 0; i < tokens.Length; i++) {
        TokenType tokenType = TokenType.Unknown;

        switch(tokens[i].TokenData) {
            case "{":
                tokenType = TokenType.Nest;
                break;
            case "}":
                tokenType = TokenType.NestEnd;
                break;
            case "=":
                tokenType = TokenType.Assignment;
                break;
            case "\"":
                tokenType = TokenType.StringDelimiter;
                break;
            case ";":
                tokenType = TokenType.StatementTerminator;
                break;
            case " ":
                tokenType = TokenType.Whitespace;
                break;
            default:
                tokenType = TokenType.Id;
                break;
        }

        typedTokens[i] = new TypedToken(tokens[i], tokenType);
    }
    return typedTokens;
}

The comments at the top of the method help back up my previous statements.  We simply don't want to complicate the semantic processing stop, while still making the next step much easier on the eyes (you can process tokens and assign semantic meaning at the same time, we pre-process our tokens and assign meaning earlier in the process because it saves us some code later).

3.  Processing tokens, basic rules, and language conversion
Since our language is nested, we'll start by defining a context stack.  This stack is used to alert the program about our nested or recursive context.  The current parent node is always on top of the stack and statements that would cause us to leave the current parent will pop it off the stack.  Since every program has to have a default or global context, this will be the first item we push onto the stack.  Since we are creating a config->XML compiler, our parent context will come in the form of XmlNode objects.

Stack contextStack = new Stack();
       
XmlDocument xDoc = new XmlDocument();
xDoc.LoadXml("<configuration></configuration>");
contextStack.Push(xDoc.DocumentElement);

The ordering of our processing loop is very important because each step in the parsing process puts us into a state that is dependent upon the previous step.  By the time we get back to the beginning of the loop we have to be in a state that we'll call Open.  We aren't going to store any variables with the state, so the only place you'll see the word Open is when I'm describing the processing loop.  The Open state in our case means one of two things.  First, it means we are prepared to process either a new Id or NestEnd token.  These are the only two tokens we expect to see in an Open state (we'll be throwing out whitespace in many cases, so that is also allowed, but we won't talk about it much).  We also only expect to see a NestEnd if we are at least 2 deep on the context stack.  Remember that you can't leave the global scope.  Processing the Open state appears as the following:

if ( contextStack.Count > 1 ) {
    int scanNestEnd = ScanIgnoreWhitespace(TokenType.NestEnd, i, typedTokens);
    if ( scanNestEnd > -1 ) {
        Console.WriteLine("Executing Parser Loop: Scanned Nesting End");
        i = scanNestEnd;
        contextStack.Pop();
        continue;
    }
}

Console.WriteLine("Executing Parser Loop: Scanning ID");
int scanId = ScanId(i, typedTokens);
if ( scanId == -1 ) {
    break;
}

Processing a nest end will pop our current context and continue the loop.  This basically puts us back into the Open state.  We are using a helper function, ScanIgnoreWhitespace that allows us to scan for a token later in our stream while skipping all whitespace.  If the token type we are scanning for is not the next token on the stack then the function returns -1 and we know that it wasn't found.

/* Scans and ignores whitespace.  This saves us a lot of typing
*/
private static int ScanIgnoreWhitespace(TokenType type, int startToken, TypedToken[] typedTokens) {
    return ScanFor(
        type,
        new TokenType[] {
            TokenType.Whitespace },
        startToken,
        typedTokens);

}

If this isn't a nest end processing, then we need to look for an Id, because we are either processing an id-value pair or an id-nesting pair.  We'll find out which later, for now we just need to find out if the next token is an Id.  We do this using the ScanId function (we really didn't need this function, but I wrote it for later usage in another parser), which ignores whitespace and finds the next Id.  Again, we get -1 if it is not found.  In this case not finding an Id means we have an error and so we break out loop and stop.

/* Helper method for finding an Id for us */
private static int ScanId(int startToken, TypedToken[] typedTokens) {
    return ScanFor(TokenType.Id, new TokenType[] { TokenType.Whitespace }, startToken, typedTokens);
}

Both of our helper methods make use of a ScanFor method.  Guess we should take a look at that before we continue our processing loop.  ScanFor is a way of processing the token stream by eating tokens.  Tokens can be eaten because they are being ignored or because they are the target token type.  The sole purpose of the loop is to return the index of the next target token type.

/* This method knows how to get me a token that I'm looking for while skipping
   certain tokens that are Ok.
*/
private static int ScanFor(TokenType type, TokenType[] eat, int startToken, TypedToken[] typedTokens) {
    int currentToken = startToken;
    while(currentToken < typedTokens.Length && typedTokens[currentToken].TokenType != type) {
        bool canEat = false;

        for(int i = 0; i < eat.Length; i++) {
            if ( eat[i] == typedTokens[currentToken].TokenType ) {
                canEat = true;
                break;
            }
        }

        if ( !canEat ) {
            return -1;
        }

        currentToken++;
    }

    if ( currentToken < typedTokens.Length && typedTokens[currentToken].TokenType == type ) {
        return currentToken;
    }

    return -1;
}

I could explain all of the logic in the above, but I think you get the gist of what is going on.  We'll continue the processing loop.  Last thing we processed was an Id so we are now in the processing state I'll call the Type processing state.  We'll call it this because the Id we just process is our type and it is going to reference either an id-value mapping type or an id-nesting type.  The requirements for this state are that we create a new XmlNode to store the type and then differentiate between nesting types and value types.

XmlNode idNode = xDoc.CreateElement(typedTokens[scanId].Token.TokenData);
((XmlNode) contextStack.Peek()).AppendChild(idNode);

Console.WriteLine("Executing Parser Loop: Processing Nest Start or Value Start");
int scanNestStart = ScanIgnoreWhitespace(TokenType.Nest, scanId + 1, typedTokens);
if ( scanNestStart > -1 ) {
    Console.WriteLine("Executing Parser Loop: Processing Nest Start");
    contextStack.Push(idNode);
    i = scanNestStart;
} else {

Well that was easy.  Basically, we create a new element, and add it to the current context node.  Then we look for a nesting token.  If we find a nesting token then our type is obvious, and we take the appropriate action for a nesting type which is to make it the current context node, update our scan location, and go back into the Open state.  You can't see the transition back into the open state because we haven't issue a continue, but the if...else statement is the end of the processing loop (which you can't see) and so we enter back into the Open state through that logic.  The step of updating the scanning location is something you have to watch out for because we don't do it for you, and you could have eaten any number of tokens through loss of whitespace elements.

Our next state is the Value state.  The Value state is well structured and requires that several key steps succeed in order for a value to be set.  First we'll eat an Assignment token, followed by the value, and finally ending with a StatementTerminator token.  The first and last steps are easy and quite basic, while the process of obtaining the value in one of two formats is actually much more difficult.  We'll start by simply grabbing the Assignment operator and then figuring out what type of value we are going to work with.

Console.WriteLine("Executing Parser Loop: Processing Value Start");
// Not starting a nesting?  We must be a value;
int scanAssignment = ScanIgnoreWhitespace(TokenType.Assignment, scanId + 1, typedTokens);
if ( scanAssignment == -1 ) {
    // Error
    break;
}
Console.WriteLine("Executing Parser Loop: Scanned Assignment");

int idOrString = ScanIdOrString(scanAssignment + 1, typedTokens);
if ( idOrString == -1 ) {
    // Error
    break;
}
Console.WriteLine("Executing Parser Loop: Scanned ID Or String");

Again, not finding any of our expected tokens means an error condition and so we break.  A good compiler might throw an exception letting you know, but our compiler will simply give you what it has parsed so far.  I don't see this as a problem, because I don't feel like writing a bunch of custom exceptions to let the user know exactly what went wrong.  This process of writing error support can be complex and extremely in-depth in more complex compilers, while it wouldn't be very difficult in ours because of our language's simplicity.

If we get this far, we either have an Id which can be processed directly into a value, or we have a StringDelimiter and we'll need to enter a String state where we process a string.  You'll recall that I mentioned our parser would assign additional semantic meaning to tokens during the parsing loop and that is what is happening here.  We are converting Id tokens into values, and in the case of a StringDelimiter, we are converting all of the tokens between two StringDelimiter tokens into a value.  This is semantic translation and could have been done in a previous step so that in place of complex parsing routines here we just looked for Assignment,Value,StatementTerminator tokens.  That doesn't buy us much, so we'll just do it inline.  Note that we have a new ScanString method that basically looks for a StringDelimiter token while eating all other tokens.  Our strings are capable of having any breaking tokens embedded within them and they can span multiple lines (though we'll lose the carriage return/line feeds/tabs that we ate in the lexer).

int valueTerminator = -1;
string value = string.Empty;
if ( typedTokens[idOrString].TokenType == TokenType.StringDelimiter ) {
    Console.WriteLine("Executing Parser Loop: Building Value From String");
    // We need to build a string now
    valueTerminator = ScanString(idOrString + 1, typedTokens);
    if ( valueTerminator == -1 ) {
        // Error
        break;
    }

    for(int concat = idOrString + 1; concat < valueTerminator; concat++) {
        value += typedTokens[concat].Token.TokenData;
    }
} else {
    Console.WriteLine("Executing Parser Loop: Building Value From ID");
    valueTerminator = idOrString;
    value = typedTokens[valueTerminator].Token.TokenData;
}
Console.WriteLine("Executing Parser Loop: Our Value {0}", value);

You see why it didn't buy us much to perform this action earlier in the process?  We only expend a few lines of extra code to build the string versus simply grabbing the value out of the token.  After the value is processed, we are either still in the Value state, or we return to the Value state from the String state.  We now need to finalize our value and exit back into the Open state and continue our loop.  Our parser is almost done.  If we find the StatementTerminator, we'll push the value onto our XmlNode object as an attribute.  I could have made it a nested element, but nested elements in our case are only for nesting types or complex types while attributes are used to store values on types.  This will come in handy later once I write the full Bind configuration parser.

int endStatement = ScanIgnoreWhitespace(TokenType.StatementTerminator, valueTerminator + 1, typedTokens);
if ( endStatement == -1 ) {
    // Error
    break;
}
Console.WriteLine("Executing Parser Loop: Scanned End Statement");

XmlAttribute nodeAttr = xDoc.CreateAttribute("Value");
nodeAttr.Value = value;
idNode.Attributes.SetNamedItem(nodeAttr);
i = endStatement;

The assignment of our attribute value and the setup of the token stream pointer puts us at the end of this parser.  You might be able to find this useful in your own applications or you might not.  Hopefully this was a learning experience and you can extend this knowledge to write more complex parsers for your own configuration files or possibly your own programming languages.

4.  Examining the XML output
The XML output is basic.  We store all of our types inside of a special document element called <configuration>.  Nested or complex types are realized as XML elements that themselves have elements.  These nested or complex types have no attribute values.  Simple types or name/value pairs are realized by an element with a single attribute, called Value, where the value of the attribute is the value of the type.  I could have done some better naming of things to make that less confusing, but hey, this is supposed to be the easy part.  Here is the output of our compiler over the configuration file we specified back in the first section.

<configuration>
  <options>
    <foo Value="bar" />
    <bar Value="-1" />
    <baz>
      <what Value="hum" />
      <hum Value="foo" />
    </baz>
    <nestedType1 />
    <nestedType2 />
  </options>
  <darnit>
    <foo Value="bar" />
  </darnit>
  <lasttry>
    <funky Value="-430895" />
  </lasttry>
</configuration>

5.  Adding a single feature, flagged fields
Per a request on the newsgroup, there was interest in adding an additional ability and so I figured I'd cover the process of adding a feature to the parser.  While the original article was quite complete and informational, examining how we add a feature can definitely aid in more fully understanding the code and the process we are performing.  The new feature looks as such:

forwarders { 192.168.0.1; 192.168.0.2; }

From what you can see, we now have assignment statements that have no value or assignment tokens.  Instead, we want to create elements nested inside of the <forwarders> element.  These elements are complete in that their name gives all of the information needed and no Value attribute need be assigned.  I call these flagged fields, since another use might be to specify the flagged bit values of an enumeration.

The first step in our process will be to examine how our processing logic changes.  By the time we get to the token “192.168.0.1“, we just processed an Id.  Once the Id is processed, we make a switch between entering the Nested context or Value context operation.  The nested context remains the same as before, but as soon as we enter the Value state things have to change.  We now have a conditional switch on the assignment operator.  Before, lack of an assignment meant failure and our algorithm would tank out.  Now lack of an assignment operator simply means we skip over the process of obtaining a value.  Let's see how this works.

} else {
    Console.WriteLine("Executing Parser Loop: Processing Value Start");
    // Not starting a nesting?  We must be a value;
    string value = null;
    int valueTerminator = scanId;

    int scanAssignment = ScanIgnoreWhitespace(TokenType.Assignment, scanId + 1, typedTokens);
    if ( scanAssignment > -1 ) {
        Console.WriteLine("Executing Parser Loop: Scanned Assignment");

        int idOrString = ScanIdOrString(scanAssignment + 1, typedTokens);
        if ( idOrString == -1 ) {
            // Error
            break;
        }
        Console.WriteLine("Executing Parser Loop: Scanned ID Or String");

        valueTerminator = -1;
        value = string.Empty;
        if ( typedTokens[idOrString].TokenType == TokenType.StringDelimiter ) {
            Console.WriteLine("Executing Parser Loop: Building Value From String");
            // We need to build a string now
            valueTerminator = ScanString(idOrString + 1, typedTokens);
            if ( valueTerminator == -1 ) {
                // Error
                break;
            }

            for(int concat = idOrString + 1; concat < valueTerminator; concat++) {
                value += typedTokens[concat].Token.TokenData;
            }
        } else {
            Console.WriteLine("Executing Parser Loop: Building Value From ID");
            valueTerminator = idOrString;
            value = typedTokens[valueTerminator].Token.TokenData;
        }
        Console.WriteLine("Executing Parser Loop: Our Value {0}", value);
    }

    int endStatement = ScanIgnoreWhitespace(TokenType.StatementTerminator, valueTerminator + 1, typedTokens);
    if ( endStatement == -1 ) {
        // Error
        break;
    }
    Console.WriteLine("Executing Parser Loop: Scanned End Statement");

    if ( value != null ) {
        XmlAttribute nodeAttr = xDoc.CreateAttribute("Value");
        nodeAttr.Value = value;
        idNode.Attributes.SetNamedItem(nodeAttr);
    }
    i = endStatement;
}

That is a lot of code there, but most of it you've already seen.  The important parts are that we conditionally enter the value processing section only if an Assignment token is present.  If one is not present, then we fall through and look for the StatementTerminator.  This doesn't have a huge impact on our code because we are simply removing an error condition and then skipping a complex portion of our parsing code.  We also special case the hook-up of the Value attribute on the XML output to be dependent on whether or not we tried to parse a value.  Easy stuff.  A sample XML output of the above change would be:

<forwarders>
  <192.168.0.1 />
  <192.168.0.2 />
</forwarders>

We have a small problem.  That isn't valid XML because an element name can't start with a digit.  We have to encode the creation of our element names in order to be fully XML compliant.  This also means you'll need to decode element names when you finally work over the XML output in your program.  Encoding the element name isn't hard, and the XmlConvert class provides us with an EncodeName and DecodeName that will suffice.

XmlNode idNode = xDoc.CreateElement(XmlConvert.EncodeName(typedTokens[scanId].Token.TokenData));
((XmlNode) contextStack.Peek()).AppendChild(idNode);

Now, when we create our nodes they will be fully XML compliant.  You've just created your first modification to the parser and it wasn't that hard.  You can see that any changes to the processing state require changes to the switching/branching logic.  This is normally handled in a paser by conceptualizing the process into states and transitions.  Each state is assigned a processing function and the result of the function is the next state for the compiler.  You can see that our code already does this form of processing, but in a fully inlined format.  If we want to make anything more complex, we'll need to use a better design.

6.  Conclusion
What did we learn here?  Well, not very much about professional compiler design I'm afraid.  But we did learn quite a bit about hobbyist compiler design.  You see, it isn't that hard to write a basic compiler for converting between various formats, compiling a simple language, and the lexer is surprisingly flexible at tokenizing just about anything.  I want to take this a bit further, so I'll attempt to fully implement the Bind configuration file format in another future article.  It appears to have a number of structures that would increase the number of parsing states, and so we'll have to rethink our design a bit if we want the parser to remain elegant.  I'll cross that bridge when I come to it, so I'm outtie for now.  Enjoy!

Full Source (C#): Code-Only: BasicLex and BasicParser with sample runner program for the psuedo Bind language.

Published Sunday, May 16, 2004 1:44 AM by Justin Rogers

Comments

No Comments