The lost art of using regular expressions for parsing?

Note: this entry has moved.

Regular expressions are really powerful and very cool. Most people think of them as just a validation mechanism. They are missing a big scenario enabled by regexes: parsing.

Some other people think that if you're doing any parsing, you **have** to use parser generator tools (i.e. yacc/lex, antlr, coco/r, etc), build a formal grammar of your language, etc. But do you really **need** to get into that? Do you want proof that you can achieve the same goal with regular expressions? The ASP.NET page parser is built with regular expressions, and not only the v1.x, but the Whidbey version too.
Wanna confirm? Fire up Reflector, search for the TemplateParser class in the System.Web.UI namespace, and look at the ParseStringInternal method. There you will see how the BaseParser class is being used to parse the page source, which contains all the regular expressions for the several pieces of a page.

I've build a number of parsers with regexes, from simple expression parsers (i.e. a more flexible and powerful expression format than DataBinder.Eval, for example) to full template file parsing (i.e. templates with ASP-like syntax for codegen, in the spirit of CodeSmith, NVelocity, etc.). And it works very well. And your code using very complex regular expressions doesn't have to be a cryptic-impossible to read-never ending-line of almost garbage that only you can understand.

Bottom-line: learn regular expression. There're a lot of very real problems that you can solve SO easily with them...


  • I agree only in part. It's true that you can avoid using parser generators for some cases, but you cannot use regular expressions to parse a context-free grammars (programming languages).

    Regular expressions are only useful to define regular grammars, but for context-free grammars you need to use BNF and use pushdown automatas as the parsing mechanism, like parser generator tools does.

    The example you give about ASP.NET parsing is only to separate code blocks from HTML ones.

  • Sure thing. A full-blown programming language cannot be parsed with regular expressions. But given the limited number of programming languages (successful ones, let's say), how big do you think is the niche for getting proficient with those tools/techniques?

    There is an inmensely bigger amount of common problems and small parsing needs that are very cost-effectively solved with regular expressions. For example, you don't need much more than that to parse XML, XPath, XPointer, DataBinder.Eval-like .NET expressions, templates, MSBuild property references, Postbuild commands, etc etc etc. So becoming proficiend with regexes is much more important and relevant to solve day to day problems than mastering BNF, lex/yacc, or any other full-blown parsing tecniques/tools, IMO.

  • As much as I like regular expressions, using them for parsing has some shortcomings, e.g. the inability to easily treat nested expressions. Hence infamous ASP .Net parsing bugs like when trying to parse

    <asp:textbox runat="server" text="<%# Container.DataItem("toto") %>" />

Comments have been disabled for this content.