A 39 line generic lexer. Lexing is always the easy part, but this guy is pretty sweet for quick and dirty token parsing.

[Edit] There is some redundant code in this revision.  I realized I posted the wrong version of the lexer as I was driving down the road on my way to a dinner engagement.  I'll post a follow-up lexer (though this one does work), that is more compact as soon as I get home.

Probably wondering what the purpose is behind this little guy.  A member of one of the MS newsgroups was curious how hard it would be to process a configuration file that was in a non XML format.  The format is actually somewhat popular and uses the french brace nesting, key=value; format.  This gives us plenty of well-formedness to work with since we have statement terminators for key=value pairs, and nesting for complex data types (I wouldn't call them complex data types, really, rather named configuration sections.)

Below is the lexer we'll be using.  Really simple, doesn't do much.  It allows for single character token delimiters.  It also allows us to toss certain breaking characters out.  For instance, the override I've created will toss out spaces, tabs, carriage returns, and linefeeds.  When I actually create the parser I'll be using a slightly different lexer, since I need to keep whitespace, but we'll get to that later.  For now, enjoy!

public class Token {
    public string TokenData;
    public Token(string tokenData) { TokenData = tokenData; }
}

public class BasicLex {
    public static Token[] StringToTokens(string tokenString) {
        return StringToTokens(tokenString, " \n\r\t{}\"=;.()[],", " \t\r\n");
    }
   
    public static Token[] StringToTokens(string tokenString, string breakers, string toss) {
        ArrayList tokens = new ArrayList();
   
        int tokenStart = 0, tokenPointer = 0;
       
        TOKENLOOP: while(tokenPointer < tokenString.Length) {
            for(int i = 0; i < breakers.Length; i++) {
                if ( breakers.IndexOf(tokenString[tokenPointer]) > -1 ) {
                    if ( tokenStart != tokenPointer ) {
                        tokens.Add(new Token(tokenString.Substring(tokenStart, tokenPointer - tokenStart)));
                    }
                    if ( toss.IndexOf(tokenString[tokenPointer]) == -1 ) {
                        tokens.Add(new Token(tokenString.Substring(tokenPointer, 1)));
                    }
                   
                    tokenStart = ++tokenPointer;
                    goto TOKENLOOP;
                }
            }
           
            tokenPointer++;
        }
        if ( tokenStart != tokenPointer ) {
            tokens.Add(new Token(tokenString.Substring(tokenStart, tokenPointer - tokenStart)));
        }
       
        return (Token[]) tokens.ToArray(typeof(Token));
    }
}

Published Saturday, May 15, 2004 5:43 PM by Justin Rogers

Comments

Friday, May 15, 2009 11:43 AM by nick_ladeld

# re: A 39 line generic lexer. Lexing is always the easy part, but this guy is pretty sweet for quick and dirty token parsing.

Wednesday, April 06, 2011 3:05 AM by Home Security Monitoring system

# re: A 39 line generic lexer. Lexing is always the easy part, but this guy is pretty sweet for quick and dirty token parsing.

Good ' I should definitely pronounce, impressed with your web site. I had no trouble navigating through all tabs as well as related info ended up being truly easy to do to access. I recently found what I hoped for before you know it at all. Reasonably unusual. Is likely to appreciate it for those who add forums or anything, website theme . a tones way for your client to communicate. Nice task..

<b><a href="hlurb.gov.ph/.../member.php

">internet Home Security Monitoring

<a/><b/>

Leave a Comment

(required) 
(required) 
(optional)
(required)