May 2004 - Posts
I've posted quite a few items covering the process of decomposition. Most of the examinations are textual only and cover the process of generating decompositions by hand. I've since issued a challenge for the creation of an algorithm that would automatically handle a given range and return the decomposition in regular expression format that could be used to validate input in the range. I talked through a sample algorithm that had some issued, then covered these issues, and finally posted my own algorithm in C# that creates a matching alternation group that isn't minimally sized, but works for the give range. Have a look.
The Beginning: Value range parsing using Regular Expressions. Breaking down the dotted decimal IP byte processing.
Follow-up on ranges: 0 through N ranges are interesting, but what about M through N ranges.
Some Oddities that are hard to handle: That range algorithm was more difficult than I originally pointed out due to some strange oddities.
The post reference the article and algorithm: An Algorithm has been prepared, I'm linking it into an article so you don't have to look at it if you don't want. It is a spoiler.
Let's detail a problem set. You have a nice TreeView with a backing XML data store. Now, the user can update this data store using node changes, drag drops, or methods that create new nodes. Keeping up with the TreeView used to be a real pain in the neck, but the XPathDocument comes to the rescue. The code for each of these problems now becomes quite light. Let's check out some of the features.
We start by assuming that all tree nodes that we create on the Windows Forms side contain some information linking them back into the XML document. I do this by creating what I call the NavigatorTreeNode. This node basically props all of it's information based on the underlying XPathNavigator pointer. Now, I have no clue about the performance of creating hundreds of these little XPathNavigator objects, but I'm hoping they are lightweight. Here would be a sample tree node.
public class NavigatorTreeNode : TreeNode
{
private XPathNavigator navigator = null;
public XPathNavigator Navigator
{
get
{
return this.navigator;
}
set
{
this.navigator = value;
}
}
}
The first operation we'll define is an insert. Now that Whidbey has node-clicks you can store the currently clicked node whenever a context-menu is shown. With the current node, we have enough information to edit the underlying document and insert a new node. This can easily be done using:
XPathEditableNavigator editor = xPathDocument.CreateEditor();
editor.MoveTo(parentNode.Navigator); // Move to the current mark
editor = editor.AppendChild(“<Element />“);
editor.CreateAttribute(““, “NewAttribute“, ““, “Name of new Element“);
myNewNode.Navigator = editor.CreateNavigator();
The explanation for the above is to create a new editable navigator, move it to the point where we recorded reading our node, and finally begin to insert a new node and value representing our tree node that will be added to parent node. The next step is working with AfterLabelEdit so we can update our nodes whenever they change. AfterLabelEdit returns the node changed, and the label so not much work has to be done.
NavigatorTreeNode tn = e.Node as NavigatorTreeNode;
if (tn != null)
{
XPathEditableNavigator editor = xPathDocument.CreateEditor();
editor.MoveTo(tn.Navigator);
if (editor.MoveToAttribute("NewAttribute", ""))
{
editor.SetValue(e.Label);
}
}
Moving nodes around starts to get a bit more complicated, but not by much. You can use some overrides to make things really easy. Basically, you use the XPathNavigator of your node as a mark, then create an editor for the location you need the node inserted. Using these you can easily insert your old node path right into the new location, then delete you old node.
XPathEditableNavigator editor = xPathDocument.CreateEditor();
editor.MoveTo(insertBeforeMe.Navigator);
editor = editor.InsertBefore(nodeToMove);
XPathEditableNavigator deleteEditor = xPathDocument.CreateEditor();
editor.MoveTo(nodeToMove.Navigator);
editor.DeleteCurrent();
nodeToMove.Navigator = editor.CreateNavigator();
Think that covers a couple of options for sync'ing your tree nodes with your XPathDocument. I'm thinking there are probably some easier methods for doing a couple of these things. For instance, I'm surprised that I have to create an editor every time and move it to the navigator location. I could jus store editor's instead of navigators I guess, but I don't know about the overhead of using editor's in place of navigators everywhere. It's very odd that it takes two steps and I'd think XPathDocument should have some overrides that take XPathNavigator marks. That would cut out at least a single line of code.
Moving elements within the tree should be easier as well. A copy followed by a delete doesn't seem very cool and doesn't appear very transactional. I'm hoping they add the ability to move all elements from some XPathNavigator to somewhere else in the tree, immediately updating the XPathNavigator to the new location. That would take care of some of the work involved for sure.
A programming challenge? Maybe a bit of one. If nothing else it is a neat algorithm and examination of generating capture groups that will handle numbers within ranges. The first posting I did on this I examined the generation of 0 through N ranges and ended with the examination of parsing 0-255 or a byte value. In this posting I examine using a lower bound that is not 0, how many decompositions are required to validate such a range, and a general purpose algorithm for doing so.
0 through N ranges are interesting, but what about M through N ranges.
Just code today my friends. The design is simple, create a framework that allows running an asynchronous delegate on the thread pool in a manner where it can be cancelled and any exceptions can be handled. We use the same BeginInvoke/EndInvoke model as the rest of the asynchronous programming models under the framework. Our generic classes allow for the creation of a new generic asynchronous result object for each piece of work we want to do. To fully define a generic async operation we need to specify some from of input, the type of the output, and we can even strongly type the async state object.
ar = GenericCancellableOperation<RegexInput, int, Match>.BeginInvoke(
new RegexInput(regex, "555"),
new GenericWorkDelegate<Match, RegexInput>(RegexWorkDelegate),
new AsyncCallback(MatchComplete),
counter++
);
The above is a sample usage that does the same thing as my previously defined asynchronous regular expression runner. We can no longer specify entire strings of input parameters, so we have to wrap them all in a single object. Here I've used RegexInput to define the Regex and input to match against. We still use the old AsyncCallback model, even though we could strongly type that as well. In fact, BeginInvoke and EndInvoke still use IAsyncResult instead of more strongly typed equivalents. The GenericWorkDelegate will always match the input and return types for the operation generic, basically allowing us to delegate whatever work needs to occur back to your code. It gets wrapped with all of the protective stuff automatically. You still can't run arbitrary code, because any old fool could wrap a try...catch in their delegate and handle the ThreadAbortException that would normally terminate the operation.
using System;
using System.Threading;
public delegate ReturnType GenericWorkDelegate<ReturnType, InputType>(InputType input);
public class GenericCancellableOperation<InputType, AsyncStateType, ReturnType> {
public class GenericAsynchronousResult : IAsyncResult {
// Generic State Parameters
internal AsyncStateType asyncState = AsyncStateType.default;
internal InputType input = InputType.default;
internal ReturnType output = ReturnType.default;
internal GenericWorkDelegate<ReturnType, InputType> workCallback = GenericWorkDelegate<ReturnType, InputType>.default;
// Non generic state
internal bool complete = false;
internal bool completedSynchronously = false;
internal bool cancelled = false;
internal ManualResetEvent waitHandle = null;
internal AsyncCallback callback = null;
internal Thread currentThread = null;
internal Exception innerException = null;
internal Object lockObject = new Object();
internal GenericAsynchronousResult(InputType input, GenericWorkDelegate<ReturnType, InputType> workCallback, AsyncCallback callback, AsyncStateType state) {
this.asyncState = state;
this.input = input;
this.callback = callback;
this.workCallback = workCallback;
this.waitHandle = new ManualResetEvent(false);
}
public object AsyncState { get { return this.asyncState; } }
public bool CompletedSynchronously { get { return this.completedSynchronously; } }
public bool IsCompleted { get { return this.complete; } }
public WaitHandle AsyncWaitHandle { get { return this.waitHandle; } }
public AsyncStateType TrueAsyncState { get { return this.asyncState; } }
public InputType Input { get { return this.input; } }
public ReturnType Output { get { return this.output; } }
public bool Cancel() {
lock(lockObject) {
if ( this.cancelled || this.complete ) {
return false;
}
this.cancelled = true;
if ( this.currentThread != null ) {
this.currentThread.Abort();
}
}
this.waitHandle.Set();
return true;
}
internal void Complete() {
lock(lockObject) {
if ( this.complete || this.cancelled ) {
return;
}
this.complete = true;
}
this.waitHandle.Set();
if ( callback != null ) {
callback(this);
}
}
}
public static GenericAsynchronousResult BeginInvoke(
InputType input,
GenericWorkDelegate<ReturnType, InputType> workCallback,
AsyncCallback callback,
AsyncStateType state)
{
GenericAsynchronousResult result = new GenericAsynchronousResult(input, workCallback, callback, state);
if ( !result.complete ) {
ThreadPool.QueueUserWorkItem(new WaitCallback(ThreadPool_WaitCallback), result);
}
return result;
}
private static void ThreadPool_WaitCallback(object state) {
GenericAsynchronousResult result = state as GenericAsynchronousResult;
if ( result != null && !result.cancelled ) {
result.currentThread = Thread.CurrentThread;
try {
result.output = result.workCallback(result.input);
} catch (ThreadAbortException) {
Thread.ResetAbort();
} catch (Exception exc) {
result.innerException = exc;
} finally {
result.currentThread = null;
result.Complete();
}
}
}
public static ReturnType EndInvoke(IAsyncResult ar) {
GenericAsynchronousResult result = ar as GenericAsynchronousResult;
if ( result != null ) {
result.AsyncWaitHandle.WaitOne();
if ( result.innerException != null ) {
throw result.innerException;
}
return result.output;
} else {
throw new Exception("EndInvoke was called with the wrong async result");
}
}
}
I still don't find this as useful as the regex only version. I think adapting the code to the particular asynchronous operation you are working on is most sufficient. In the case of regular expressions, you could even embed the ability to abort the problem within the lexical scanner, thus removing the necessity of thread aborts and exceptions. In fact for something like an interpreter aborting as an embedded instruction is the best way to go about things. For now, we don't have the ability to modify the construction of regular expressions under .NET and we can't embed such abort instructions, so we'll have to live with little hacks.
There are probably numerous ways to shoot holes in this class, but this is a 30 minute attempt at providing an asynchronous framework for running regular expressions. The crux of the model is to allow for cancellation of long-running expressions that might otherwise have a negative impact on your application or web server.
You need to start with an IAsyncResult implementation that supports the general asynchronous infrastructure. We'll also have a bunch of extra information for the regular expression work, namely an expression and input string. What I've come up with so far supports all of the features I need including cancellation, full reporting of passed in procedures on the async object for later processing of the work item, and fairly decent protection against the internal members controlling the asynchronous operation.
public class AsynchronousRegexResult : IAsyncResult {
internal object asyncState = null;
internal Regex expression = null;
internal Match match = null;
internal string inputString = null;
internal bool complete = false;
internal bool completedSynchronously = false;
internal bool cancelled = false;
internal ManualResetEvent waitHandle = null;
internal AsyncCallback callback = null;
internal Thread currentThread = null;
internal AsynchronousRegexResult(Regex innerExpression, string inputString, AsyncCallback callback, object state) {
this.asyncState = state;
this.expression = innerExpression;
this.inputString = inputString;
this.callback = callback;
if ( this.expression == null || this.inputString == null ) {
this.complete = true;
this.completedSynchronously = true;
}
this.waitHandle = new ManualResetEvent(this.completedSynchronously);
}
public object AsyncState { get { return this.asyncState; } }
public bool CompletedSynchronously { get { return this.completedSynchronously; } }
public bool IsCompleted { get { return this.complete; } }
public WaitHandle AsyncWaitHandle { get { return this.waitHandle; } }
public Regex Expression { get { return this.expression; } }
public string Input { get { return this.inputString; } }
public Match Result { get { return this.match; } }
public bool Cancel() {
if ( this.cancelled || this.complete ) {
return false;
}
this.cancelled = true;
if ( this.currentThread != null ) {
this.currentThread.Abort();
}
return true;
}
internal void Complete() {
this.complete = true;
this.waitHandle.Set();
if ( callback != null ) {
callback(this);
}
}
}
Three additional methods are needed to make this work. First, we need a BeginInvoke to launch our operation into the thread pool and an EndInvoke to retrieve the results. We don't really need the EndInvoke, but it builds in semantics for waiting on results if they aren't already available. Finally, the ThreadPool_WaitCallback will be the protected method that enables the true cancellation semantics and the ability to run our expressions.
using System;
using System.Threading;
using System.Text.RegularExpressions;
public sealed class AsynchronousRegex {
// Removed result class for brevity, but it is a public nested class.
public static IAsyncResult BeginInvoke(Regex innerExpression, string inputString, AsyncCallback callback, object state) {
#if DEBUG
Console.WriteLine("AsynchronousRegex.BeginInvoke::Enter");
#endif
AsynchronousRegexResult result = new AsynchronousRegexResult(innerExpression, inputString, callback, state);
if ( !result.complete ) {
#if DEBUG
Console.WriteLine("AsynchronousRegex.BeginInvoke::ThreadPool Queue");
#endif
ThreadPool.QueueUserWorkItem(new WaitCallback(ThreadPool_WaitCallback), result);
}
#if DEBUG
Console.WriteLine("AsynchronousRegex.BeginInvoke::Leave");
#endif
return result;
}
private static void ThreadPool_WaitCallback(object state) {
#if DEBUG
Console.WriteLine("AsynchronousRegex.ThreadPool_WaitCallback::Enter");
#endif
AsynchronousRegexResult result = state as AsynchronousRegexResult;
if ( result != null && !result.cancelled ) {
#if DEBUG
Console.WriteLine("AsynchronousRegex.ThreadPool_WaitCallback::Setting Current Thread");
#endif
result.currentThread = Thread.CurrentThread;
#if DEBUG
Console.WriteLine("AsynchronousRegex.ThreadPool_WaitCallback::Gathering Match");
#endif
try {
result.match = result.expression.Match(result.inputString);
} catch (ThreadAbortException) {
#if DEBUG
Console.WriteLine("AsynchronousRegex.ThreadPool_WaitCallback::Thread Aborted, Resetting");
#endif
Thread.ResetAbort();
} finally {
result.currentThread = null;
}
#if DEBUG
Console.WriteLine("AsynchronousRegex.ThreadPool_WaitCallback::Completing Asynchronous Result");
#endif
result.Complete();
}
#if DEBUG
Console.WriteLine("AsynchronousRegex.ThreadPool_WaitCallback::Leave");
#endif
}
public static Match EndInvoke(IAsyncResult ar) {
AsynchronousRegexResult result = ar as AsynchronousRegexResult;
if ( result != null ) {
result.AsyncWaitHandle.WaitOne();
return result.match;
} else {
throw new Exception("EndInvoke was called with the wrong async result");
}
}
}
As mentioned this class is not 100% thread-safe, but will work for the most part. I ran it for a couple of hours and didn't run into any troubles. Not a multi-threaded system though so concurrent access issues wouldn't surface easily (context switching might, but in most cases no).
I think that you need the correct site for each type of content in many cases. I've posted a few articles now on language processing and regular expressions, but my category list for weblogs is getting quite long. I'm definitely in search of a set of places to host much of my content separately. I think WebLogs is still a great place for most of my Windows Forms, Terrarium, CLR stuff, but I want a site for my Whidbey postings, and now RegexAdvice.com is my place for regular expressions. Enjoy!
My new Regular Expressions Host: http://blogs.regexadvice.com/
My first posting on decomposition of numeric value ranges in regular expressions: http://blogs.regexadvice.com/justin_rogers/archive/2004/05/21/1139.aspx
Oh, I'll still cross post I guess, but I'll try to keep it to a minimum and keep it short using linkies. So follow the linkies friends!
HTML always seems a popular format for a processor/compiler of some type or another. HTML really is only useful in either it's abstract form, as the HTML, or in a more well-formed syntax. XML is always nice because you can easily process it with the DOM and get any information you'd want. A while back, Chris Lovett wrote an SGML parser that is really good. For a more complete sample I recommend going there. He parses using various HTML DTD definitions, and so you can enforce any HTML rules you'd like. He has both a loose and a strict version of the DTD so you can parse well formed for very poorly formed HTML.
My sample revolves around the concepts of recursive descent parsing. Darren was telling me that determining the code-path to take when you approach a LeftAngleBracket wasn't at all easy. I've taken the steps to demonstrate this process using my framework, as well as using the SymbolTable in order to define custom tag entries in TokenType. These extra entries allow you to easily provide conditional processing for different tag types. For instance the following HTML is valid and parsed in a specific way. Each <li> self terminates when it reaches the next. This is different than if we had used say <span> tags which would have recursively nested.
<ul><li>foo<li>bar<li>baz</ul> <!-- Non Nesting -->
<td><span>foo <span>bar <span> baz</td> <!-- Nesting -->
To make the HTML pretty, well-formed, and do tag completion, you have to use various rules based on the element type which is derived from the name of the element. IE does a lot of this on the back-end and I can imagine the parsing code to make documents render correctly is quite complex and rightfully so. If you need something a bit simpler then a basic recursive descent parser might be just what you need. Code-Only: A super basic HTML style parser using BasicLex and a SymbolTable
Darren was curious about the tokenizer and about how to add keywords to the BasicParse language. He had some thoughts on the idea, so I'll help him out and post a somewhat full example. His truncated comments:
I was going to ask you how you would add keywords into your BasicParse language. Up until now you have only parsed single char tokens and strings.
This is both true and false. My lexer is very basic in that it splits on breaking tokens. To add keywords, you might think of working at the lexer level. However, since I've combined portions of the lexer and parser, we can update the TokenStream class to assign semantic meaning to identifiers that should be keywords. We'll start by adding that symbol table. This is a simple class, backed by a Hashtable where we map strings to the TokenType enumeration. We could put some additional meta-data on the symbol table (which we'll do later), but for now it is a basic string->TokenType mapping. We add strings by calling AddString and the SymbolTable automatically checks to see if it has already interned the string, and if so, returns the cached TokenType, else it sets a default TokenType. We add keywords when we first create the SymbolTable and before our TokenStream class begins to use it by calling AddKeyword and specifying both the string and the TokenType to be used.
public class SymbolTable {
private Hashtable symbolTable = new Hashtable();
private TokenType identifierType;
public SymbolTable(TokenType identifierType) {
this.identifierType = identifierType;
}
public void AddKeyword(string keyword, TokenType type) {
this.symbolTable[keyword] = type;
}
public TokenType AddString(string identifier) {
if ( !this.symbolTable.ContainsKey(identifier) ) {
this.symbolTable[identifier] = this.identifierType;
}
return (TokenType) this.symbolTable[identifier];
}
public void ClearIdentifiers() {
Hashtable newSymbolTable = new Hashtable();
foreach(string key in symbolTable.Keys) {
if ( ((TokenType) this.symbolTable[key]) != this.identifierType ) {
newSymbolTable[key] = this.symbolTable[key];
}
}
this.symbolTable = newSymbolTable;
}
}
This is going to be a patch of the state graphing sample, since it allows us to easily add this feature. I don't want to break legacy apps, so our new code-path will allow for key/value pairs to be preceded by var, but they don't have to be. We'll start with our constructor. We now create a symbol table, add the single keyword “var” as TokenType.Var, and then create our TokenStream.
public STPWithSymbolTable(Token[] tokens) {
this.symbols = new SymbolTable(TokenType.Id);
this.symbols.AddKeyword("var", TokenType.Var);
this.tokenStream = new TokenStream((Token[]) tokens.Clone(), this.symbols);
}
The TokenStream will store the symbol table, and use it later. The major change to TokenStream is a change to the default: clause in ConvertTokenToType. Instead of returning TokenType.Id all the time, we'll return the results of AddString. Remember, keywords that have already been added to the symbol table will return their stored TokenType, while new strings will get the default TokenType.Id as before.
default:
tokenType = syms.AddString(token.TokenData);
break;
This change is fairly superfluous to the rest of the parser, because we are only adding an optional keyword that we can throw out because it has no impact on the final compilation. In our Open state we now look for the Var token, and enter a ParseVar state if we find it. Again, I'm logically breaking the parser down into extremely small pieces, and since Var is such a superfluous element, you might find other faster ways than setting state and continuing the loop. The only sticky item is that we do need a new state (can't just jump back into the Open state, but that would be sweet), since a var standing by itself should probably be invalid. Jumping back into the Open state would allow us to process endless streams of var keywords without error.
// Open state changes
} else if ( this.tokenStream.Current.TokenType == TokenType.Id ) {
AllocateNode();
return ParseState.ParseMember;
} else if ( this.tokenStream.Current.TokenType == TokenType.Var ) {
return ParseState.ParseVar;
}
// Allocate node method, since two areas now create a node and we don't want to repeat the code.
private void AllocateNode() {
this.currentNode = this.currentDocument.CreateElement(XmlConvert.EncodeName(this.tokenStream.Current.Token.TokenData));
((XmlNode) this.contextStack.Peek()).AppendChild(this.currentNode);
}
// Our ParseVar Transition method
public ParseState TransitionParseVar() {
this.tokenStream.ScanPast(TokenType.Whitespace);
if ( this.tokenStream.Current.TokenType == TokenType.Id ) {
AllocateNode();
return ParseState.ParseMember;
}
return ParseState.Error;
}
// And finally adding the ParseVar state to our state switch in the main loop.
case ParseState.ParseVar:
this.parseState = TransitionParseVar();
break;
To finalize things a bit, keywords aren't part of the breaking behavior of the lexer. Instead, a keyword is nothing more than an identifier with some additional amount of state surrounding it. If we take away the state of “var” so it becomes an Id, we'll see that it behaves as an Id would. If we add state for some other identifier, then we can make it a keyword instead. I mentioned in my previous posting that changing the symbol tables provided to the lexer can drastically change the underlying language. I alluded that there might be some interesting patching behvior you could add to a language using this.
#patch u = using; f = for; pu = public; pr = private; cl = class;
pu cl STPWithSymbolTable {
...
u(StreamReader sr as new StreamReader()) { f(int i = 0; i < 10; i++) { sr.ReadLine(); } sr.Close(); }
The above is not meant for macro expansion, not in the slightest. Instead it tells the lexer that when the identifier u is found, it is given the semantic meaning of using. This is different from the name replacement or aliasing available in C#. That form of aliasing actually replaces the new name with the old name during the symbol resolve step. A neat feature of remapping might be to more explicitly define mathematical behavior.
#patch add = +; minus = -; mult = *; div = /;
int result = (10 add 5 minus 3) div 10 mult 4;
Macro expansion is immensely powerful and somewhat dwarfs the use of patching for these tasks within a mainstream language. This is clearly in the scope of more specialized or higher level languages where it is common for people to want to work with the compiler in their own language.
Now that I'm on my compiler kick, I'll try to add some features to the compiler I've already implemented to make it more mainstream and useful as a more complete sample. To start, I want to introduce the concept of the symbol table and it's importance to the compiler, and more importantly, how it changes the meaning of various tokens, and in some cases, turns them into something they are not.
The most important item to note is the get/set keywords you use in a property statement. In fact they aren't keywords at all or at least not reserved words. They only have meaning when found inside of a property definition block, and no place else. This means you can use them as the names for local variables, field names, even the name of a property. The following is a valid C# program that makes use of the token 'get' multiple times in multiple contexts.
using System;
public class UseGet {
public int get {
get {
int get = 6;
return get;
}
}
private static void Main(string[] args) {
int get = 10;
Console.WriteLine(get);
Console.WriteLine((new UseGet()).get);
}
}
We'll count the 'get' tokens starting from the top. First, we have a property named get, then a get accessor. In the second case, get becomes a keyword while in the property definition block (or does it? They could be doing token substitution in the symbol table and renaming get to get_get, but we'll get to that in a minute). Inside of the get accessor, we have a function local named get, and finally get as the operand to the return statement, counting 4 so far. We have an additional function local named get in the Main method, followed by get as a parameter, and finally an access of the get property on our class, counting a total of 7. In each case, the usage of 'get' has a different meaning or changes the meaning of a previously defined 'get'.
Time to introduce the symbol table. Whenever our tokenizer processes a new token, some heuristics are performed to find out of what type the token is. These heuristics determine whether a token might be a number, a string, a keyword, or an identifier. An identifier is a name that we use to identify a variable, type, etc..., and probably accounts for the majority of the tokens that get processed. Each token that gets pulled, once it is classified as some form of identifier, is looked up in a symbol table. Those that don't exist, get added. Scope is extremely important in this process, as we'll see as we build our symbol tables.
The first time we find get, we are in the body of the class or type. However, the lexer doesn't get to assign semantic meaning, and we don't know that we are in the class body. Or does it? We can always swap out the current symbol table depending on our context. At this time, the symbol table for the class members is the active symbol table, and so looking for get, we are looking for a member named get. One isn't found, so it gets added. The second get, I'm not sure how they resolve. We could probably look at the csharp compiler available through rotor, and I'll probably do that since they might use an interesting method. I can say there is a symbol table method, using a symbol table patch, where we add the get/set identifiers as keywords so they temporarily resolve when in the property declaration block.
The next get we encounter isn't looked up in the member symbol table, instead it gets looked up in the function local symbol table, again, our parser is probably closely linked with the lexer so we are in the appropriate symbol table. The third get is probably complicated, since it could either reference the local get or the member get. Since a local get does exist, then that is used, but in the case it didn't, we would reference our member get (and create a recursive property that would eventually fail). How do things like that get handled? Well, we never actually use two symbol tables (at least I don't think so), instead we use modes on the symbol table to change how symbols are searched for and inserted. This allows us to easily search multiple levels of symbols and control which levels we search through simple flags. Another method is the symbol table patch which allows us to temporarily add a symbol table on top of a higher level symbol table while within a particular scope. The nice thing about the patch is that we can throw it away if we are emitting code incrementally. After all, you don't need local information inside of a function that has already been emitted.
Three get identifiers left. The first get is again added to the function local symbol table. Note that in this case we have to distinguish between referencing the member symbol table versus the local symbol table. After all, there is a property with the same name. Again, contextual information from the parser is important in this case. The next two are pretty simple. Since they are representing variable access, we'll look for the first get in the local and then the member symbol table looking for elements with certain properties (the member symbol table access would only resolve static fields and properties, since we are inside a static method). The final get statement is resolved for us explicitly because it is preceded by a type reference (context from the parser).
To go through some quick notes and ask/answer some questions you might note that you can use a symbol that isn't yet defined. This is called an unresolved symbol. Depending on your language there are different times that you can resolve symbols. You can either explicitly enter symbols into the table as being defined versus being accessed (contextually when an ID is found you know the difference) then search for symbols that have been used, but not defined later by a quick search of the symbol table. Another method would be to add our symbols, then worry about resolving them in a later step. Some portions of this method would be dependent on how you parse your tree.
Symbol tables are actually pretty important to compiler development. I tried to make this a simple examination of how they are used to provide specific resolution features, but it turns out there are many ways other than symbol tables to perform the same operations. The feature that I want you to walk away with is the possibilities in symbol table patching for changing grammars. Grammars are based on abstract things like the enumeration value TokenType.ForLoop defines a for loop node in the tree, while source input is based on things like “for” maps to the lexan TokenType.ForLoop. If the mapping is slightly changed then you can substitute any word or words and change the source language. The C# using statement can do this through aliasing, but they stop short of the full possibilities. They don't allow you to change keywords or make typedefs, turn keywords off, lots of stuff is possble.
Time to go add a reason for symbol tables into the BasicParse language.
Though most of the code was in the article, you couldn't easily build a complete sample from for the state/transition compiler. I've uploaded a project that contains a build process for the lexer, both parsers that have been produced up to now, and a test harness with sample configuration file for processing.
Lexer, 2 Parsers, and a Test Program
More Posts
Next page »