The Dark Art of Regular Expressions

I've been avoiding regular expressions for a long time. Between the cryptic nature of the expressions to the differences in implementations among languages, I've always been able to limp along by reading a short tutorial here or there to simply "get the job done". That's not an option anymore.

My latest assignment at work requires us to work through a significant amount of text files for search and replace as well as large-scale text transformation. Therefore, I am diving head first into Mastering Regular Expressions in an effort to truly grasp the art and science of regular expressions.

After spending a few days with the book there are a few items that I think are worth mentioning.

Differences in Flavors is a Big Deal

As stated before, one of the tedious aspects of regular expressions is that the rules changes a bit depending on what context you are running the expression. For instance, the super-amazing website regex101 gives you the option of being able to execute any expression you write in either PHP, JavaScript or Python.

Note: The regex101 website does much more than just give you a chance to test out your regular expressions. My favorite aspect of this site is the natural language feedback it gives based on expressions you provide. If you're doing anything with regular expressions this tool is invaluable.

One of the first differences I encountered today was the difference between the word boundary identifier in JavaScript vs. egrep (which Fiedl uses extensively in the book's examples).

Consider an expression that is meant to match a word. Just to keep things simple I'll create an expression that only matches on the word "a". The book's example (which uses egrep) uses \< and \> as the word boundary metacharacters. For instance, in egrep the expression is written as \<a\>, but in JavaScript the same expression is written as \ba\b where \b is used as the word boundary metacharacters.

Word separation metacharacters are just one place where the expression syntax can vary, so it is very important that you know a bit about your environment and language as you work with expressions in your applications.

Build Robust Test Strings

Perhaps equally as important is writing the right expression is the need to come up with a test string that represents what you may be likley to encounter in the real world. The book demonstrates how to build an expression to find repeated words in a series of text. The expression looks like this:

// JavaScript expression
/\b([a-zA-Z]+) +?\1\b/gm

...which will match repeated words in this test string:

This is a test of the the emergency broadcast system system.

While this works great there is one flaw with this expression. Should the repeated word wrap to a new line, the expression no longer matches the repeated word. For instance if you tested with this string:

This is a test of the 
the emergency broadcast system 
system.

The previously noted expression would not find matches that wrap on the next line. The fix for this expression is relativley easy. All you need to do is introduce an optional line break into the expression like this:

// JavaScript Expression
/\b([a-zA-Z]+) +(\n)?\1\b/gm

The point is that while writing regular expressions can be difficult enough, you really have to think through what kind of test strings you need in order to ensure you end up with the right expression for your application.

What About You?

What sort of tips, tricks or gotchas have you encountered while writing (and maintaining) regular expressions?

No Comments