Attention: We are retiring the ASP.NET Community Blogs. Learn more >

Justin's language processor series : Reuseable Lexer and TokenEnumerator

Yesterday I mentioned the series of articles which Justin has been writing on language processing.  This series has really opened my eyes to a few techniques which will improve several of my parsing techniques and provide me with the ability to produce high-performance language parsers in a shorter period of time.

Firstly, Justin introduced a basic lexer which will tokenize a source input into a fairly generic array of tokens.  He then implemented several enumerator classes to read through a token stream and return application specific, typed tokens.  Moving forward these 2 objects will be the staple items of all my future language processors; the lexer probably wouldn't change much and the enumerator would only require a minor alteration to the kinds of typed tokens which it returns.

By the time you get to writing a parser you already have all of the common functions for reading and moving through input stored away in these 2 classes allowing you to focus on implementing the grammar parsing and state transition algorithms without having to bother with the plumbing.  The added bonus is that, while the common plumbing functionality - such as ReadNextChar(), IsSpecialChar() and IsEndOfStream() - are monotonous they are undoubtedly a potential source of errors; having common classes to carry forward (the lexer and enumerator classes) means that future parsers are working atop tested code, thus eliminating those silly errors - such as "out by one's" and stuff.

My key learnings to this point have been:

  •      Lexing into tokens simplifies the amount of parsing logic
  •      Implementing the enumerator pattern is an elegant way of reading through an input stream
  •      Separating the functions of language processing can improve performance, re-use, maintainability and performance

I'm still working through some of the more recent articles which deal with state transition graphs, state machine algorithms and recursive descent, top down parsing.  I'll post another summary about the learnings and cool features of this stuff when I've covered it well enough.

2 Comments

  • For key point 1, I think that lexing is an integral part of any language processing. I've seen numerous parser implementations where the parser made heavy use of character scanning, but these truly aren't scalable. Recognizing this is hard in a little language because project vision or the ability to quickly and easily see a full code-path is enticing. When your lexer fails, it is often a track to find the bug. When your character scanning method which is one step from your parser fails the failure is more easily seen. As Darren points out the error is also easier to make.



    Enumerators are extremely important in my opinion. A large number of languages support the semantics of forward and backward scanning prior to eating tokens and this can be much more easily done in a well written enumerator. To point out, enumerators working over a token array perform the backward/scanning logic much more easily as they allow random access into the array. When thinking about a good model for input chunking (partial reads from files), input buffer scanning, providing prior buffer access, and other features the stream concept gets more complex. Thank god for little languages and the fact we don't have to worry about these issues.



    On point 3 I can ONLY agree. Logical separation of processing regions is important. More important is to logically think about how you are going to break the problem before starting. Any compiler/parser book will talk about the various portions of a compiler (upwards of 11 modules in some designs) and how often times many of these modules can be combined. The goal is to make the proper trade-off between compactness, readability, speed, scalability, and finally flexibility.

  • I agree together with Nabeel these are having again for the next discharge on some of all these characteristics, if they release a newer variation in the near future they will bring in more money.

Comments have been disabled for this content.