BasicParse with State/Transition graphing, a much more robust parser.

See Also:
BasicParse, a parser/compiler for a pseudo Bind configuration file format.

Abstract:
The previous parser was a linear parser.  All state was handled in a single parsing function with some stream operation helper functions for working with the token stream.  This isn't necessarily the most effective method in generating a parser, though it does lead to relatively short development times.  In the end the parser feels and looks hacked together, so we are going to use a different approach in this article. 

A parser can be defined by a State/Transition graph.  Basic states allow us to select functions to continue with the parsing of the underlying stream in a more meaningful or contextual manner.  Transitions are changes from one state to another.  The result of each transition should be the new state into which the parser is going to enter.  The end result of the process will be a far more modularized, and possibly a more compact form parser as the project grows in complexity and size.

TOC:

    1. Examining our State/Transition graph
    2. Creating a TokenStream for processing tokens
    3. Defining the compiler method using State/Transition graphing
    4. Defining our ParseState transition methods
    5. Conclusion

1.  Examining our State/Transition graph
To fully define our parser we have to understand the flow of the language we are parsing.  I've discussed this in cusory detail before, but now I'll more thoroughly define the states and their transitions.  While there are additional states that we won't back with methods, since they don't exist long enough within the parser to contain their own parsing logic, we'll include them in the state enumeration nonetheless.  The states that we will examine are as follows:

State List
    Open
    ParseMember
    Assignment
    ParseValue
    ParseString
    StatementEnd

Our parsing begins in the Open state.  The Open state is a safe state, a basic state that exists to denote we aren't within any parsing context.  During this state we are looking for something to do.  From the Open state we can either re-enter the Open state, or move into the ParseMember state.  ParseMember is where we looking for nesting statements, assignment operators, and statement terminators.  During this process we are figuring out how to process the member in question.  The transition list up to now looks as follows:

Open            -> Open
Open            -> ParseMember

ParseMember     -> Open
ParseMember     -> Assignment

Note we are already beginning to skip some states.  For instance, the ParseMember -> Open transition occurs when we parse the nesting token.  There is a short period of time where we are in the NestStart state, but our parser skips over this state as there is no additional processing that must occur.  We could be more fine-grained and build this state into the parser, but there really isn't any need, except for purposes of additional debugging.

To continue, depending on the current token, we may also jump into the Assignment state, followed by either a StatementEnd or ParseString.  Again, we are skipping some intermediate state in some cases (ParseValue -> StatementEnd) because we are on top of the token within the parser and can parse it immediately and pass onto the following state.  Here are the final few transitions:

Assignment      -> StatementEnd;
Assignment      -> ParseString;

ParseString     -> StatementEnd;

StatementEnd    -> Open

That ends our State/Transition mapping.  Our final ParseState enumeration is actually quite simple, and I've added in the intermediate states that the parser never enters:

public enum ParseState {
    Error = 0,
    Open = 1,
    NestStart = 2,
    NestEnd = 3,
    ParseMember = 4,
    Assignment = 5,
    ParseValue = 6,
    ParseString = 7,
    StatementEnd = 8
}

2.  Creating a TokenStream for processing tokens
The last parser had a very crude token processing algorithm.  We actually enumerated over top of the array of tokens.  Further, we processed token semantics at the very beginning so we have to pass over our tokens twice.  The new TokenStream prevents us from having to store pointers to the current token and prevents the large amount of token pointer offset calculations.  We also turned all of our smaller helper functions into a single function that can scan past or eat certain types of tokens.

public class TokenStream {
    private Token[] tokens;
    private int streamOffset;
    private TypedToken current;
   
    public TokenStream(Token[] tokens) {
        this.tokens = tokens;
        this.streamOffset = -1;
    }
   
    public void Reset() {
        this.streamOffset = -1;
    }
   
    public bool MoveNext() {
        if ( !EndOfStream() ) {
            this.streamOffset++;
            this.current = null;
        }
       
        return (!EndOfStream());
    }
   
    public TypedToken Current {
        get {
            if ( this.streamOffset == -1 ) {
                throw new Exception("You must call MoveNext() before using the stream");
            }
            if ( this.current == null && !EndOfStream() ) {
                this.current = ConvertTokenToType(this.tokens[this.streamOffset]);
            }
           
            return this.current;
        }
    }
   
    public void ScanPast(TokenType tokenType) {
        ScanPast(new TokenType[] { tokenType });
    }
   
    public void ScanPast(TokenType[] tokenTypes) {
        while(!EndOfStream()) {
            for(int i = 0; i < tokenTypes.Length; i++) {
                if ( Current.TokenType == tokenTypes[i] ) {
                    MoveNext();
                    break;
                }
                if ( i == (tokenTypes.Length - 1) ) {
                    return;
                }
            }
        }
    }
   
    public bool EndOfStream() {
        return (this.streamOffset >= tokens.Length);
    }

    private TypedToken ConvertTokenToType(Token token) {
        TokenType tokenType = TokenType.Unknown;
        switch(token.TokenData) {
            case "{":
                tokenType = TokenType.Nest;
                break;
            case "}":
                tokenType = TokenType.NestEnd;
                break;
            case "=":
                tokenType = TokenType.Assignment;
                break;
            case "\"":
                tokenType = TokenType.StringDelimiter;
                break;
            case ";":
                tokenType = TokenType.StatementTerminator;
                break;
            case " ":
                tokenType = TokenType.Whitespace;
                break;
            default:
                tokenType = TokenType.Id;
                break;
        }
        return new TypedToken(token, tokenType);
    }
}

All of the code is basic stream parsing logic.  A linear one way pass through the token stream is all we really need to make a much more effective and compact parser.

3.  Defining the compiler method using State/Transition graphing
The new compiler is highly simplified compared to our last.  We now store a parse state as we continue through the stream, parse completion is determined solely by the state.  All of this is handled with a single switch statement.

public XmlDocument CompileToXml() {
    this.tokenStream.Reset();
    this.parseState = ParseState.Open;
    this.currentDocument = new XmlDocument();
    this.currentDocument.LoadXml("<configuration></configuration>");
    this.contextStack = new Stack();
    this.contextStack.Push(this.currentDocument.DocumentElement);

    while(tokenStream.MoveNext()) {
        Console.WriteLine("Parsing State: {0}", this.parseState);
        switch(this.parseState) {
            case ParseState.Open:
                this.parseState = TransitionOpenState();
                break;
            case ParseState.ParseMember:
                this.parseState = TransitionParseMemberState();
                break;
            case ParseState.Assignment:
                this.parseState = TransitionAssignmentState();
                break;
            case ParseState.ParseString:
                this.parseState = TransitionParseStringState();
                break;
            case ParseState.StatementEnd:
                this.parseState = TransitionStatementEndState();
                break;
            default:
                throw new Exception("There was a stream error");
        }
    }

    return this.currentDocument;
}

That looks a lot more compact than the last parser.  We set up some initial state, and immediately start parsing through our token stream.  Depending on the return result of our transition methods, a new parse state is entered, and the parsing continues.  Left to it's own design, the parser would begin at the start of the stream and continue to the end, automatically eating tokens as it goes.  The transition methods would simply need to set some valid state or perhaps even do nothing at all.  However, since we want our compiler to populate the XML document with data, we need to implement each of the transition methods in full. 

4.  Defining our ParseState transition methods
The states and transition methods loosely match our token types.  You'll notice that each transition method may do some initial whitespace processing, but eventually each method looks for particular token types.  If they aren't found, then we return an error.  Since our first parse state is the Open state, we'll implement the transition method for that first.  Most of the logic from the original parser can be copied into place with minimal changes.  We are using our new TokenStream methods such as ScanPast and the Current property for grabbing the current token in the stream.  After processing, we return a new parse state depending on what happens.

public ParseState TransitionOpenState() {
    this.tokenStream.ScanPast(TokenType.Whitespace);

    if ( this.tokenStream.EndOfStream() ) {
        return ParseState.Open;
    }

    if ( this.contextStack.Count > 1 && this.tokenStream.Current.TokenType == TokenType.NestEnd ) {
        this.contextStack.Pop();
        return ParseState.Open;
    } else if ( this.tokenStream.Current.TokenType == TokenType.Id ) {
        this.currentNode = this.currentDocument.CreateElement(XmlConvert.EncodeName(this.tokenStream.Current.Token.TokenData));
        ((XmlNode) this.contextStack.Peek()).AppendChild(this.currentNode);

        return ParseState.ParseMember;
    }

    return ParseState.Error;
}

Note, that since this is a compiler as well as a parser, we will be interacting with the XmlDocument as we go along.  In a larger project, we would append an output module that would work over top of an abstract parse tree, so there would be an additional format that we worked with during the parse phase and then an additional phase where the output was generated.  Not being such a project, that only involves an extra level of complexity that we don't really need.

Continuing to process our states in order, we next implement our ParseMember transition.  I'm also going to include the rest of the transitions.  There really isn't much to talk about, since all of the logic here is the same as in the previous article, only reformatted to be read more easily.

public ParseState TransitionStatementEndState() {
    this.tokenStream.ScanPast(TokenType.Whitespace);

    if ( this.tokenStream.Current.TokenType == TokenType.StatementTerminator ) {
        if ( this.currentValue != null ) {
            XmlAttribute nodeAttr = this.currentDocument.CreateAttribute("Value");
            nodeAttr.Value = this.currentValue;
            this.currentNode.Attributes.SetNamedItem(nodeAttr);

            this.currentValue = null;
        }

        return ParseState.Open;
    }

    return ParseState.Error;
}

public ParseState TransitionParseStringState() {
    this.currentValue = string.Empty;
    while(!this.tokenStream.EndOfStream() && this.tokenStream.Current.TokenType != TokenType.StringDelimiter) {
        this.currentValue += this.tokenStream.Current.Token.TokenData;
        this.tokenStream.MoveNext();
    }

    if ( this.tokenStream.Current.TokenType == TokenType.StringDelimiter ) {
        return ParseState.StatementEnd;
    }

    return ParseState.Error;
}

public ParseState TransitionAssignmentState() {
    this.tokenStream.ScanPast(TokenType.Whitespace);

    if ( this.tokenStream.Current.TokenType == TokenType.Id ) {
        this.currentValue = this.tokenStream.Current.Token.TokenData;
        return ParseState.StatementEnd;
    } else if ( this.tokenStream.Current.TokenType == TokenType.StringDelimiter ) {
        return ParseState.ParseString;
    }

    return ParseState.Error;
}

public ParseState TransitionParseMemberState() {
    this.tokenStream.ScanPast(TokenType.Whitespace);

    if ( this.tokenStream.Current.TokenType == TokenType.Nest ) {
        contextStack.Push(this.currentNode);
        return ParseState.Open;
    } else if ( this.tokenStream.Current.TokenType == TokenType.Assignment ) {
        return ParseState.Assignment;
    } else if ( this.tokenStream.Current.TokenType == TokenType.StatementTerminator ){
        return ParseState.Open;
    }

    return ParseState.Error;
}

5.  Conclusion
This type of format is going to allow for much more complex parsers.  You'll notice that movement through the token stream is very quick and efficient while allowing us to easily eat unused tokens.  The transition functions are compact and easy to read, even if the hierarchy of the State/Transition graph is large.  We can easily add additional transitional logic for new features with minor changes to the underlying parser and without impacting other areas of the parser.  Since we are still state driven we aren't worrying about recursion attacks and stack overflows.

I'd still call this a hobbyist parser by all meaning of the word.  We are getting closer and closer to a real design though.  At some point maybe we'll build an auto-gen'ed parser for the same language and see how well it performs.  That would definitely give us an idea of how close our hand-gen'ed parser comes to what you might call a mainstream parser.  Enjoy!

Download Source Project (C#): Lexer, 2 Parsers, and a Test Program

Published Sunday, May 16, 2004 8:08 PM by Justin Rogers

Comments

Monday, May 17, 2004 6:33 AM by Darren Neimke

# re: BasicParse with State/Transition graphing, a much more robust parser.

Very nice Justin - do you have a full code dump of this stuff yet or will you post something when you have it completed?

Also, what language is this parser built for? I presume that it is built for a subset of a C-like language?
Monday, May 17, 2004 1:54 PM by William Stacey

# re: BasicParse with State/Transition graphing, a much more robust parser.

Very good Justin. I am still studying it but looks like your really moving on it. Cheers!
Monday, May 17, 2004 2:49 PM by Ron

# re: BasicParse with State/Transition graphing, a much more robust parser.

Jason,

A comment on the switch in ConvertTokenToType.

Should the switch on token.TokenData evaluate case " ": (space)? Didn't the StringToTokens method toss the spaces? If you're coding for the possibility someone called StringToTokens without tossing anything, shouldn't you also test for tabs, carriage returns, and line feeds to mark them as white space?
Monday, May 17, 2004 2:58 PM by Justin Rogers

# re: BasicParse with State/Transition graphing, a much more robust parser.

[Darren] I have a full code dump for this. I'll get it up shortly.

[Ron] The Lexer and Parser are two different modules. My switch statement is not more complete because my input to the lexer is always explicit. The easy override in the lexer that just takes a source input is only there as a demonstration of how you might use it. The switch statement for that case would be much different.
Tuesday, May 25, 2004 8:31 AM by Darren Neimke

# re: BasicParse with State/Transition graphing, a much more robust parser.

Thanks Justin... A quick tip: I found it useful to expose the this.mStreamOffset from the TokenStream class. I exposed it as a readonly property named "Position".
Tuesday, May 25, 2004 12:57 PM by Justin Rogers

# re: BasicParse with State/Transition graphing, a much more robust parser.

I could see that. Another useful feature is to not eat \n and have the TokenStream count line numbers and line character positions. I'll look at adding this meta-data into the classes.