Why not an Expression Query Language?

Regular Expressions are extremely powerful and ugly as all hell.[1] Even with comments and a good RegEx IDE like the Regulator , they're total gibberish. Why not a RegEx 2006 with a more readable syntax?

For instance, take a look this recent code snip from Eric Gunnerson's recent RegEx 101 article[2]:

^         # beginning of string
\d{3}     # three digits
-         # literal '-'
\d{2}     # two digits
-         # literal '-'
\d{4}     # four digits
$         # end of string

The comments are helpful, but why couldn't those comments be the regular expression? They exactly describe the pattern we're matching, so there's not real reason the parser couldn't compile those comments, or at least those comments be converted to the regex behind the scenes.

The Regulator has a cool Regex Analyzer feature that does something similar; here's what it does with "^\d{3}-\d{2}-\d{4}$ ":

^ (anchor to start of string)
Any digit 
Exactly 3 times
-
Any digit 
Exactly 2 times
-
Any digit 
Exactly 4 times
$ (anchor to end of string)

This, again, shows exactly what we want to match, but in a more human readable form. There's no reason this couln't be the expression itself. Now, of course, it's easier to include a one line regex inline with your code, but I don't think that's worth the tradeoff. A more verbose Expression Query Language could be included inline, and would be much more readable. If needed, it could be a separate file - we've got piles of xml, xsd, config, resx, etc. files now, and a regex file or two that was actually readable would be much simpler than including cryptic strings in our code. Why don't we treat these things like small stored procedures?

I found a thread on the Python newsgroups discussing an improved RegEx syntax. One interesting idea is RegEx Builder (RXB) - it lets you build RegEx's using verbose language:
digit + some(whitespace) + exactly('example')  which would generate to \d\s+example.

Wrappers, utility classes, and copious comments are a step in the right direction, but magic strings like "\w?<\s?\/?[^\s>]+(\s+[^"'=]+(=("[^"]*")|('[^\']*')|([^\s"'>]*))?)*\s*\/?>" shouldn't be anywhere near professional development languages circa 2005, especially when compilers are capable of doing things like LINQ. We need an Expression Query Language. How about Language Integrated Expressions (LINE)?

[1] Yes, Jeff, that's an intentional GoogleBomb.
[2] That's a simple RegEx for the point of illustration. Read Jeff's post on RegEx Abuse if you don't see the problem. I've written my share of complex regex's and I bet you have, too, if you've read this far. Sure, we can write code in assembly language, but it's not productive or maintainable.

6 Comments

  • &gt; The devs simply hand off a request and get a query back. We could start having the &quot;regex guys&quot;. You hand them a request and they hand you back an assembly with a precompiled regex. ;)



    In my experience, this is a recipe for development pain. Instead of fixing problems with the app, it becomes &quot;another guy's problem&quot; because &quot;we have no idea how to do that.&quot;



    Bad, bad, BAD idea on so many levels.



    Also, if you &quot;need&quot; to write massively complicated SQL or regex, you're probably doing something wrong.

  • &gt; and in this case readable code requires a better language.



    And that language is VB.NET.



    While I'm no fan of ultra complicated regex, I think once you learn the basic syntax, it's no harder to understand than C#'s crazy-ass | and &amp; and { and } and ;



    Given a choice, I'll take Or, And, Begin If, End If, BitConverter.* and ENTER over all those any day ;)

  • hey jon, you know what? for that human readable form to work you'll need a regular expression to parse your regular expression :)

  • cosmin -

    Sure, I thought about that. Parsers run on BNF which is pretty much a bunch of regex statements.



    But that doesn't really prove anything. C++ compilers are written in C++, right? Dogfood, yum!



    Plus, there could be two flavors - verbose and terse, or verbose could compile to terse if needed.

  • &gt; it's hard to figure out what and why it's set up that way.



    That's because it's an unnecessarily complex regex. I see these kind of regexes all the time, and unless I know the author's *intent*, I can't reformulate it to something simpler. Intent is far more difficult to figure out, particularly if you're really proposing regex comments like &quot;now match the number 0-9 followed by a period&quot;. That's nice, but it doesn't tell me a damn thing about WHY you're matching that.



    &gt; since you can read &quot;SELECT blah FROM blurg WHERE zathura&quot; a lot easier than &quot;\s{1,3}[\d]*&lt;?:([a-z])&gt;...&quot;



    And VB.NET code is a lot more readable than C# code for the exact same reason. It's more verbose!



    Does that mean it's better? Not really, but I guess it depends on your perspective. Once you learn that &quot;\s&quot; means &quot;a whitespace character&quot;, is that really any different than learning &quot;}&quot; means &quot;End If&quot;?

  • Jeff -

    Point 1 (logic vs. intent):

    True, although it's easier to guess at intent with a more transparent language. I think I'd have an easier time scanning &quot;match any number (one or more times) followed by a period&quot; than the equivalent regex fragment when included in a lengthy regex.



    Point 2 (c# vs. vb)

    Okay, I see where you're going with that, but I think the scale's quite a bit different. C# is maybe 20% less verbose / less readable than VB.NET. RegEx is probably 500%+ less verbose / less readable than SQL, which I consider to be a similar type of language.



    A better comparison would be C# and IL.

Comments have been disabled for this content.