Making RegEx more readable

Note: this entry has moved.

Compare the following code statements defining the same regular expression in .NET: static readonly Regex ParameterReference = new Regex(@"(?<empty>\<\>)|\<(?<parameter>[^\<\>]+)\>|(?<open>\<[^\<\>]*(?!\>))", RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace);
static readonly Regex ParameterReference = new Regex(@" # Matches invalid empty brackets # (?<empty>\<\>)| # Matches a valid parameter reference # \<(?<parameter>[^\<\>]+)\>| # Matches opened brackes that are not properly closed # (?<open>\<[^\<\>]*(?!\>))", RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace);

While the former is still understandable for a fairly regex-aware developer, the later is far more explicit about the purpose of each part of it. The ability to place comments inside the expression is enabled by the RegexOptions.IgnorePatternWhitespace, which is not used enough by developers. In the case of this pretty simple expression this may seem unnecessary, but imagine a regex-based parser that processes (CodeSmith-like) template files:

static Regex CodeExpression = new Regex(@" # First match the full directives # <\#\s*@\s+(?<directive>\w*)(?<attributes>.*?)\#\/>(?:\W*\n)?| # Match open tag # (?<open><\#)| # Match close tag # (?<close>\#\/>)| # This is a simple expression that is outputed as-is to output.Write(<output>); # (?:=)(?<output>.*?)(?<badmultiple>;.*?)?(?=\#\/>)| # Anything previous or after a code tag # (?<code>.*?)(?=<\#|\#\/>)| # Finally, match everything else that is written as-is # (?<snippet>.*[\r\n]*)", RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled | RegexOptions.Singleline);

It's pretty obvious that not commenting such complex expressions makes them almost unreadable except for the guy who wrote them (and even to him after some time!). Bottom line: ALWAYS comment your expressions in-line!!!

No Comments