Adding keywords to BasicParse, keeping it short and sweet.

Darren was curious about the tokenizer and about how to add keywords to the BasicParse language.  He had some thoughts on the idea, so I'll help him out and post a somewhat full example.  His truncated comments:

I was going to ask you how you would add keywords into your BasicParse language. Up until now you have only parsed single char tokens and strings.

This is both true and false.  My lexer is very basic in that it splits on breaking tokens.  To add keywords, you might think of working at the lexer level.  However, since I've combined portions of the lexer and parser, we can update the TokenStream class to assign semantic meaning to identifiers that should be keywords.  We'll start by adding that symbol table.  This is a simple class, backed by a Hashtable where we map strings to the TokenType enumeration.  We could put some additional meta-data on the symbol table (which we'll do later), but for now it is a basic string->TokenType mapping.  We add strings by calling AddString and the SymbolTable automatically checks to see if it has already interned the string, and if so, returns the cached TokenType, else it sets a default TokenType.  We add keywords when we first create the SymbolTable and before our TokenStream class begins to use it by calling AddKeyword and specifying both the string and the TokenType to be used.

public class SymbolTable {
    private Hashtable symbolTable = new Hashtable();
    private TokenType identifierType;
   
    public SymbolTable(TokenType identifierType) {
        this.identifierType = identifierType;
    }
   
    public void AddKeyword(string keyword, TokenType type) {
        this.symbolTable[keyword] = type;
    }
   
    public TokenType AddString(string identifier) {
        if ( !this.symbolTable.ContainsKey(identifier) ) {
            this.symbolTable[identifier] = this.identifierType;
        }
        return (TokenType) this.symbolTable[identifier];
    }
   
    public void ClearIdentifiers() {
        Hashtable newSymbolTable = new Hashtable();
        foreach(string key in symbolTable.Keys) {
            if ( ((TokenType) this.symbolTable[key]) != this.identifierType ) {
                newSymbolTable[key] = this.symbolTable[key];
            }
        }
        this.symbolTable = newSymbolTable;
    }
}

This is going to be a patch of the state graphing sample, since it allows us to easily add this feature.  I don't want to break legacy apps, so our new code-path will allow for key/value pairs to be preceded by var, but they don't have to be.  We'll start with our constructor.  We now create a symbol table, add the single keyword “var” as TokenType.Var, and then create our TokenStream.

public STPWithSymbolTable(Token[] tokens) {
    this.symbols = new SymbolTable(TokenType.Id);
    this.symbols.AddKeyword("var", TokenType.Var);
    this.tokenStream = new TokenStream((Token[]) tokens.Clone(), this.symbols);
}

The TokenStream will store the symbol table, and use it later.  The major change to TokenStream is a change to the default: clause in ConvertTokenToType.  Instead of returning TokenType.Id all the time, we'll return the results of AddString.  Remember, keywords that have already been added to the symbol table will return their stored TokenType, while new strings will get the default TokenType.Id as before.

default:
    tokenType = syms.AddString(token.TokenData);
    break;

This change is fairly superfluous to the rest of the parser, because we are only adding an optional keyword that we can throw out because it has no impact on the final compilation.  In our Open state we now look for the Var token, and enter a ParseVar state if we find it.  Again, I'm logically breaking the parser down into extremely small pieces, and since Var is such a superfluous element, you might find other faster ways than setting state and continuing the loop.  The only sticky item is that we do need a new state (can't just jump back into the Open state, but that would be sweet), since a var standing by itself should probably be invalid.  Jumping back into the Open state would allow us to process endless streams of var keywords without error.

// Open state changes
} else if ( this.tokenStream.Current.TokenType == TokenType.Id ) {
    AllocateNode();
    return ParseState.ParseMember;
} else if ( this.tokenStream.Current.TokenType == TokenType.Var ) {
    return ParseState.ParseVar;
}

// Allocate node method, since two areas now create a node and we don't want to repeat the code.
private void AllocateNode() {
    this.currentNode = this.currentDocument.CreateElement(XmlConvert.EncodeName(this.tokenStream.Current.Token.TokenData));
    ((XmlNode) this.contextStack.Peek()).AppendChild(this.currentNode);
}

// Our ParseVar Transition method
public ParseState TransitionParseVar() {
    this.tokenStream.ScanPast(TokenType.Whitespace);

    if ( this.tokenStream.Current.TokenType == TokenType.Id ) {
        AllocateNode();
        return ParseState.ParseMember;
    }

    return ParseState.Error;
}

// And finally adding the ParseVar state to our state switch in the main loop.
case ParseState.ParseVar:
    this.parseState = TransitionParseVar();
    break;

To finalize things a bit, keywords aren't part of the breaking behavior of the lexer.  Instead, a keyword is nothing more than an identifier with some additional amount of state surrounding it.  If we take away the state of “var” so it becomes an Id, we'll see that it behaves as an Id would.  If we add state for some other identifier, then we can make it a keyword instead.  I mentioned in my previous posting that changing the symbol tables provided to the lexer can drastically change the underlying language.  I alluded that there might be some interesting patching behvior you could add to a language using this.

#patch u = using; f = for; pu = public; pr = private; cl = class;
pu cl STPWithSymbolTable {
    ...
    u(StreamReader sr as new StreamReader()) { f(int i = 0; i < 10; i++) { sr.ReadLine(); } sr.Close(); }

The above is not meant for macro expansion, not in the slightest.  Instead it tells the lexer that when the identifier u is found, it is given the semantic meaning of using.  This is different from the name replacement or aliasing available in C#.  That form of aliasing actually replaces the new name with the old name during the symbol resolve step.  A neat feature of remapping might be to more explicitly define mathematical behavior.

#patch add = +; minus = -; mult = *; div = /;
int result = (10 add 5 minus 3) div 10 mult 4;

Macro expansion is immensely powerful and somewhat dwarfs the use of patching for these tasks within a mainstream language.  This is clearly in the scope of more specialized or higher level languages where it is common for people to want to work with the compiler in their own language.

Published Tuesday, May 18, 2004 1:20 PM by Justin Rogers

Comments

Tuesday, May 18, 2004 7:38 PM by Darren Neimke

# re: Adding keywords to BasicParse, keeping it short and sweet.

Thanks Justin, that's really cool - I think that I'm beginning to *get* symbol tables now. Just one thing, I think that you have an error in your TransitionParseVar() method. Shouldn't the test:

if ( this.tokenStream.Current.TokenType == TokenType.Id )
...be written as:

if ( this.tokenStream.Current.TokenType == TokenType.Var )

Tuesday, May 18, 2004 10:14 PM by Justin Rogers

# re: Adding keywords to BasicParse, keeping it short and sweet.

Nope, we've already Eaten that token. Remeber that every time control is passed back to the state loop, we advance by one token. That is why in the Open state I check for Var, and then return ParseState.ParseVar. By the time I enter the ParseVar transition method, I'm already on to the next token, which by all accounts should be Id.
Wednesday, May 19, 2004 11:48 AM by William Stacey

# re: Adding keywords to BasicParse, keeping it short and sweet.

Very Cool Justin. Is this complete now for the bind language more steps to go? Cheers!
Wednesday, May 19, 2004 5:32 PM by Justin Rogers

# re: Adding keywords to BasicParse, keeping it short and sweet.

William, the bind language has a separate language definition from this language, though they do seem similar. I'll code up a parser for precisely the bind scheme this evening, I've kind of gotten off track a bit in writing about other areas of the compiler that are more interesting to me.

Things that need to be added in order for the bind scheme to work are:

1. TokenStream patching to allow include files (these are really cool and I'll talk about them since it allows processing of include files inline without preprocessing).
2. Changes to the language definition based on the way bind works. Bind has more structure than just a basic key/value system. They allow complex inline types with multiple name/value pairs that appear in a structured manner.

For the full bind configuration compiler, I think that a keyword based language will be important, and the parsing will be much more strict. Arbitrary identifiers won't be allowed, since bind has no use for arbitrary identifiers. On the back-end, I'll map these fixed format types directly to class definitions, so the final compilation format will be a configuration object you can use directly. This compiler will be more an interpreter, since it is going to translate, at run-time, a given configuration into some other more usable format.
Wednesday, May 19, 2004 7:06 PM by William Stacey

# re: Adding keywords to BasicParse, keeping it short and sweet.

Your the man. Are you a Mensa member? You do more stuff in an hour then I can do in a month. Cheers!
Wednesday, May 19, 2004 7:22 PM by William Stacey

# re: Adding keywords to BasicParse, keeping it short and sweet.

BTW. If/when you get a generic french brace parser to parse any config with key words (maybe your are there already), would you care it I "wrap" it in a MSH "cmdlet" and put it up at the beta place for Monad shell? Will provide proper credit.

$obj = get-config -file msomefile.txt
echo $obj.Var1
ecjp $obj.Var2
$obj | set-config file mynewfile.txt

Need to think more on property access, etc. Could also gen a Provider for access like:

>cd ./myconfig ## "CD" to Config Provider.
>ls ## get-childitems in container.
Item Value type
---- -------- -------
var1 "string" string
var2 32 int
var3 "one", "two" string[]
...
TIA
Thursday, May 20, 2004 1:11 AM by Justin Rogers

# re: Adding keywords to BasicParse, keeping it short and sweet.

We need to define semantics for a *generic* french brace parser. The reason for this is that the Bind format that started this entire discussion, has items such as the following:

acl "alias for acl list" {
...
}

How does a generic french brace parser perform on the above? Do we create an acl element, then give it an Id, or do we create an acl element then immediately nest another element based on the string (aka, default nesting of identifiers) and put any values into this second nested element.

What about strings of identifiers that exist in bind, such as:

controls {
inet * port 50 allow { address_list } keys { keys_list }
}

In the bind specification the inet command is a single command, and multiple instances of said command can exist within the controls section. Portions are optional.

Now, if I throw out the assertation that a statement terminator must exist my generic parser can turn the above into:

<controls>
<inet Value="*" />
<port Value="50" />
<allow></allow>
<keys></keys>
</controls>

Multiple instances would simply turn into linearly defined items. Another option is to demand statement terminators, and parse all you can. I could turn the above then into:

<controls>
<inet Value="*">
<port Value="50">
<allow></allow>
<keys></keys>
</inet>
</controls>

Things start to get really sticky when you parse based on arbitrary grammars. You start to lose all of the meaning between Item/Value/Type association. That means there can't truly be a 100% generic french brace parser unles you are willing to accept the results in a very non-determinate form.
Thursday, May 20, 2004 10:33 AM by William Stacey

# re: Adding keywords to BasicParse, keeping it short and sweet.

ahh. Ok, thanks. Guess I will stick with the Bind parser for now. Which would be fine if you can tell it what vars are valid. Not sure on this point. Do you have to "bake" in the grammer, or can you supply some array structure to the parser to parse out the language that is valid and know when something is not valid (i.e. invalid syntax, invalid keyword, etc.)

Leave a Comment

(required) 
(required) 
(optional)
(required)