Adding keywords to BasicParse, keeping it short and sweet.
Darren was curious about the tokenizer and about how to add keywords to the BasicParse language. He had some thoughts on the idea, so I'll help him out and post a somewhat full example. His truncated comments:
I was going to ask you how you would add keywords into your BasicParse language. Up until now you have only parsed single char tokens and strings.
This is both true and false. My lexer is very basic in that it splits on breaking tokens. To add keywords, you might think of working at the lexer level. However, since I've combined portions of the lexer and parser, we can update the TokenStream class to assign semantic meaning to identifiers that should be keywords. We'll start by adding that symbol table. This is a simple class, backed by a Hashtable where we map strings to the TokenType enumeration. We could put some additional meta-data on the symbol table (which we'll do later), but for now it is a basic string->TokenType mapping. We add strings by calling AddString and the SymbolTable automatically checks to see if it has already interned the string, and if so, returns the cached TokenType, else it sets a default TokenType. We add keywords when we first create the SymbolTable and before our TokenStream class begins to use it by calling AddKeyword and specifying both the string and the TokenType to be used.
public class SymbolTable {
private Hashtable symbolTable = new Hashtable();
private TokenType identifierType;
public SymbolTable(TokenType identifierType) {
this.identifierType = identifierType;
}
public void AddKeyword(string keyword, TokenType type) {
this.symbolTable[keyword] = type;
}
public TokenType AddString(string identifier) {
if ( !this.symbolTable.ContainsKey(identifier) ) {
this.symbolTable[identifier] = this.identifierType;
}
return (TokenType) this.symbolTable[identifier];
}
public void ClearIdentifiers() {
Hashtable newSymbolTable = new Hashtable();
foreach(string key in symbolTable.Keys) {
if ( ((TokenType) this.symbolTable[key]) != this.identifierType ) {
newSymbolTable[key] = this.symbolTable[key];
}
}
this.symbolTable = newSymbolTable;
}
}
This is going to be a patch of the state graphing sample, since it allows us to easily add this feature. I don't want to break legacy apps, so our new code-path will allow for key/value pairs to be preceded by var, but they don't have to be. We'll start with our constructor. We now create a symbol table, add the single keyword “var” as TokenType.Var, and then create our TokenStream.
public STPWithSymbolTable(Token[] tokens) {
this.symbols = new SymbolTable(TokenType.Id);
this.symbols.AddKeyword("var", TokenType.Var);
this.tokenStream = new TokenStream((Token[]) tokens.Clone(), this.symbols);
}
The TokenStream will store the symbol table, and use it later. The major change to TokenStream is a change to the default: clause in ConvertTokenToType. Instead of returning TokenType.Id all the time, we'll return the results of AddString. Remember, keywords that have already been added to the symbol table will return their stored TokenType, while new strings will get the default TokenType.Id as before.
default:
tokenType = syms.AddString(token.TokenData);
break;
This change is fairly superfluous to the rest of the parser, because we are only adding an optional keyword that we can throw out because it has no impact on the final compilation. In our Open state we now look for the Var token, and enter a ParseVar state if we find it. Again, I'm logically breaking the parser down into extremely small pieces, and since Var is such a superfluous element, you might find other faster ways than setting state and continuing the loop. The only sticky item is that we do need a new state (can't just jump back into the Open state, but that would be sweet), since a var standing by itself should probably be invalid. Jumping back into the Open state would allow us to process endless streams of var keywords without error.
// Open state changes
} else if ( this.tokenStream.Current.TokenType == TokenType.Id ) {
AllocateNode();
return ParseState.ParseMember;
} else if ( this.tokenStream.Current.TokenType == TokenType.Var ) {
return ParseState.ParseVar;
}
// Allocate node method, since two areas now create a node and we don't want to repeat the code.
private void AllocateNode() {
this.currentNode = this.currentDocument.CreateElement(XmlConvert.EncodeName(this.tokenStream.Current.Token.TokenData));
((XmlNode) this.contextStack.Peek()).AppendChild(this.currentNode);
}
// Our ParseVar Transition method
public ParseState TransitionParseVar() {
this.tokenStream.ScanPast(TokenType.Whitespace);
if ( this.tokenStream.Current.TokenType == TokenType.Id ) {
AllocateNode();
return ParseState.ParseMember;
}
return ParseState.Error;
}
// And finally adding the ParseVar state to our state switch in the main loop.
case ParseState.ParseVar:
this.parseState = TransitionParseVar();
break;
To finalize things a bit, keywords aren't part of the breaking behavior of the lexer. Instead, a keyword is nothing more than an identifier with some additional amount of state surrounding it. If we take away the state of “var” so it becomes an Id, we'll see that it behaves as an Id would. If we add state for some other identifier, then we can make it a keyword instead. I mentioned in my previous posting that changing the symbol tables provided to the lexer can drastically change the underlying language. I alluded that there might be some interesting patching behvior you could add to a language using this.
#patch u = using; f = for; pu = public; pr = private; cl = class;
pu cl STPWithSymbolTable {
...
u(StreamReader sr as new StreamReader()) { f(int i = 0; i < 10; i++) { sr.ReadLine(); } sr.Close(); }
The above is not meant for macro expansion, not in the slightest. Instead it tells the lexer that when the identifier u is found, it is given the semantic meaning of using. This is different from the name replacement or aliasing available in C#. That form of aliasing actually replaces the new name with the old name during the symbol resolve step. A neat feature of remapping might be to more explicitly define mathematical behavior.
#patch add = +; minus = -; mult = *; div = /;
int result = (10 add 5 minus 3) div 10 mult 4;
Macro expansion is immensely powerful and somewhat dwarfs the use of patching for these tasks within a mainstream language. This is clearly in the scope of more specialized or higher level languages where it is common for people to want to work with the compiler in their own language.