Introducing symbol tables with a C# example using 'get' as the point of interest.

Now that I'm on my compiler kick, I'll try to add some features to the compiler I've already implemented to make it more mainstream and useful as a more complete sample.  To start, I want to introduce the concept of the symbol table and it's importance to the compiler, and more importantly, how it changes the meaning of various tokens, and in some cases, turns them into something they are not.

The most important item to note is the get/set keywords you use in a property statement.  In fact they aren't keywords at all or at least not reserved words.  They only have meaning when found inside of a property definition block, and no place else.  This means you can use them as the names for local variables, field names, even the name of a property.  The following is a valid C# program that makes use of the token 'get' multiple times in multiple contexts.

using System;

public class UseGet {
    public int get {
        get {
            int get = 6;
            return get;
        }
    }

    private static void Main(string[] args) {
        int get = 10;
        Console.WriteLine(get);
        Console.WriteLine((new UseGet()).get);
    }
}

We'll count the 'get' tokens starting from the top.  First, we have a property named get, then a get accessor.  In the second case, get becomes a keyword while in the property definition block (or does it?  They could be doing token substitution in the symbol table and renaming get to get_get, but we'll get to that in a minute).  Inside of the get accessor, we have a function local named get, and finally get as the operand to the return statement, counting 4 so far.  We have an additional function local named get in the Main method, followed by get as a parameter, and finally an access of the get property on our class, counting a total of 7.  In each case, the usage of 'get' has a different meaning or changes the meaning of a previously defined 'get'.

Time to introduce the symbol table.  Whenever our tokenizer processes a new token, some heuristics are performed to find out of what type the token is.  These heuristics determine whether a token might be a number, a string, a keyword, or an identifier.  An identifier is a name that we use to identify a variable, type, etc..., and probably accounts for the majority of the tokens that get processed.  Each token that gets pulled, once it is classified as some form of identifier, is looked up in a symbol table.  Those that don't exist, get added.  Scope is extremely important in this process, as we'll see as we build our symbol tables.

The first time we find get, we are in the body of the class or type.  However, the lexer doesn't get to assign semantic meaning, and we don't know that we are in the class body.  Or does it?  We can always swap out the current symbol table depending on our context.  At this time, the symbol table for the class members is the active symbol table, and so looking for get, we are looking for a member named get.  One isn't found, so it gets added.  The second get, I'm not sure how they resolve.  We could probably look at the csharp compiler available through rotor, and I'll probably do that since they might use an interesting method.  I can say there is a symbol table method, using a symbol table patch, where we add the get/set identifiers as keywords so they temporarily resolve when in the property declaration block.

The next get we encounter isn't looked up in the member symbol table, instead it gets looked up in the function local symbol table, again, our parser is probably closely linked with the lexer so we are in the appropriate symbol table.  The third get is probably complicated, since it could either reference the local get or the member get.  Since a local get does exist, then that is used, but in the case it didn't, we would reference our member get (and create a recursive property that would eventually fail).  How do things like that get handled?  Well, we never actually use two symbol tables (at least I don't think so), instead we use modes on the symbol table to change how symbols are searched for and inserted.  This allows us to easily search multiple levels of symbols and control which levels we search through simple flags.  Another method is the symbol table patch which allows us to temporarily add a symbol table on top of a higher level symbol table while within a particular scope.  The nice thing about the patch is that we can throw it away if we are emitting code incrementally.  After all, you don't need local information inside of a function that has already been emitted.

Three get identifiers left.  The first get is again added to the function local symbol table.  Note that in this case we have to distinguish between referencing the member symbol table versus the local symbol table.  After all, there is a property with the same name.  Again, contextual information from the parser is important in this case.  The next two are pretty simple.  Since they are representing variable access, we'll look for the first get in the local and then the member symbol table looking for elements with certain properties (the member symbol table access would only resolve static fields and properties, since we are inside a static method).  The final get statement is resolved for us explicitly because it is preceded by a type reference (context from the parser).

To go through some quick notes and ask/answer some questions you might note that you can use a symbol that isn't yet defined.  This is called an unresolved symbol.  Depending on your language there are different times that you can resolve symbols.  You can either explicitly enter symbols into the table as being defined versus being accessed (contextually when an ID is found you know the difference) then search for symbols that have been used, but not defined later by a quick search of the symbol table.  Another method would be to add our symbols, then worry about resolving them in a later step.  Some portions of this method would be dependent on how you parse your tree.

Symbol tables are actually pretty important to compiler development.  I tried to make this a simple examination of how they are used to provide specific resolution features, but it turns out there are many ways other than symbol tables to perform the same operations.  The feature that I want you to walk away with is the possibilities in symbol table patching for changing grammars.  Grammars are based on abstract things like the enumeration value TokenType.ForLoop defines a for loop node in the tree, while source input is based on things like “for” maps to the lexan TokenType.ForLoop.  If the mapping is slightly changed then you can substitute any word or words and change the source language.  The C# using statement can do this through aliasing, but they stop short of the full possibilities.  They don't allow you to change keywords or make typedefs, turn keywords off, lots of stuff is possble.

Time to go add a reason for symbol tables into the BasicParse language.

Published Tuesday, May 18, 2004 12:27 AM by Justin Rogers

Comments

Tuesday, May 18, 2004 5:01 AM by Justin Rogers

# re: Introducing symbol tables with a C# example using 'get' as the point of interest.

To shed some light on the method C# uses to process the get accessor of a property I offer up the following:

1. It appears all of C#'s symbol resolution happens after the parse tree has been created. They do have a name manager in their lexer/parser, but I think it is more to save space and resolve keywords than any other purpose.

2. For the get accessor they use a special entry in this name manager, but when the lexical token for get is returned it is returned as an identifier no matter what the scope. So they don't use a symbol table for conditional processing during the lexer/parser phase.
Tuesday, May 18, 2004 8:24 AM by Darren Neimke

# re: Introducing symbol tables with a C# example using 'get' as the point of interest.

Most of this stuff went whooooshing over my head because I haven't really done anything on symbol tables before - they are a wierd beast!

I was going to ask you how you would add keywords into your BasicParse language. Up until now you have only parsed single char tokens and strings.

I was wondering what would happen to the lexer if you add a keyword such as "var" to your grammar. Would you simply add the chars 'v', 'a', 'r' to your terminal lookup table such that it would now become:

public static Token[] StringToTokens(string tokenString) {
return StringToTokens(tokenString, " \n\r\t{}\"=;.()[],var", " \t\r\n");
}
Tuesday, March 10, 2009 7:12 PM by ...

# re: Introducing symbol tables with a C# example using 'get' as the point of interest.

Dies ist ein gro�er Ort. Ich m�chte hier noch einmal.

Saturday, May 16, 2009 10:28 PM by nick_linoou

# re: Introducing symbol tables with a C# example using 'get' as the point of interest.

Sunday, November 20, 2011 2:12 AM by gradz

# re: Introducing symbol tables with a C# example using 'get' as the point of interest.

nore igrace <a href=www.drustvo-nasi.si/>Otroska igrala za prireditve</a> za sprostitev.

Leave a Comment

(required) 
(required) 
(optional)
(required)