Matching Balanced Constructs with .NET Regular Expressions
Brief Computer Science Theory Background
In computer science a formal language is a set of finite character strings that are created by some finite alphabet. There exist four major formal language classes as defined by the Chomsky hierarchy.
- Recursively enumerable - All languages that are recognized by a Turing machine with infinite tape.
- Context-sensitive - All languages that are recognized by a Turing machine with finite tape.
- Context-free - All languages that are recognized by a Pushdown automaton (PDA).
- Regular - All languages that are recognized by a Finite state automaton (FSA).
Most programming language syntax can be described by a context-free language and can be recognized by a PDA. A PDA can be thought of as a FSA or FSM that can use a stack to store data. Regular languages are used to describe simple string patterns such as program identifiers. Regular expressions are strings that describe a particular regular language. Regular languages cannot recognize any string that requires any sort of counting. One classic language anbn where n > 0 is not a regular language because it cannot be recognized by a FSA. It is a context-free language and can be recognized by a PDA. Whenever an a is read then an a is pushed onto the stack and then whenever a b is read an a is popped off the stack and if the stack is empty after reading all the characters of the string then the string is accepted. Similarly properly balanced constructs such as balanced parentheses need a PDA to be recognized and thus cannot be represented by a regular expression.
.NET Regular Expression Engine
As described above properly balanced constructs cannot be described by a regular expression. However, the .NET regular expression engine provides a few constructs that allow balanced constructs to be recognized.
- (?<group>) - pushes the captured result on the capture stack with the name group.
- (?<-group>) - pops the top most capture with the name group off the capture stack.
- (?(group)yes|no) - matches the yes part if there exists a group with the name group otherwise matches no part.
These constructs allow for a .NET regular expression to emulate a restricted PDA by essentially allowing simple versions of the stack operations: push, pop and empty. The simple operations are pretty much equivalent to increment, decrement and compare to zero respectively. This allows for the .NET regular expression engine to recognize a subset of the context-free languages, in particular the ones that only require a simple counter. This in turn allows for the non-traditional .NET regular expressions to recognize individual properly balanced constructs.
.NET Regular Expression Examples
The classic anbn example.
Regex re = new Regex(@"^ (?<N>a)+ # For every a push N on capture stack (?<-N>b)+ # For every b pop N from capture stack (?(N)(?!)) # If N exists on stack then fail (?!) $", RegexOptions.IgnorePatternWhitespace);
This regular expression recognizes any number of a's followed by the same number of b's. Essentially for every a it adds a named group N to the capture stack and then for every b it removes a named group N from the capture stack. Once it gets past the last b it checks to see if the named group N exists on the capture stack and if it does then there were more a's then b's and so it forces a failure by matching (?!) (this is a negative lookahead with no expression which is a guaranteed failure). It is worth mentioning that if no named group N exists when trying to pop (<-N>) then it will fail and thus this prevents accepting strings where there are more b's then a's.
Balanced Parentheses.
Jeffrey Friedl provides the following example in his excellent book Mastering Regular Expressions, 2nd Edition.
Dim R As Regex = New Regex(" \( " & _ " (?> " & _ " [^()]+ " & _ " | " & _ " \( (?<DEPTH>) " & _ " | " & _ " \) (?<-DEPTH>) " & _ " )* " & _ " (?(DEPTH)(?!)) " & _ " \) ", _ RegexOptions.IgnorePatternWhitespace)
Now this expression works just fine for matching properly-nested parentheses but its layout doesn't work for matching nested constructs which are more then an single character such as XML tags for example. Here is another regular expression for matching parentheses that can be expanded easily for other multi-character delimiters.
Regex re = new Regex(@"^ (?> \( (?<LEVEL>) # On opening paren push level | \) (?<-LEVEL>) # On closing paren pop level | (?! \( | \) ) . # Match any char except ( or ) )+ (?(LEVEL)(?!)) # If level exists then fail $", RegexOptions.IgnorePatternWhitespace);
This expression also matches properly-nested parentheses. The biggest difference here is that instead of matching a character class of [^()]+ it uses a negative lookahead to ensure that the character is not a paren. It also only captures one character instead of one or more. For a single character delimiter like the parentheses a lookahead may be more than what is needed but it is needed in the next example.
Balanced XML tags.
Regex re = new Regex(@"^ (?> <tag> (?<LEVEL>) # On opening <tag> push level | </tag> (?<-LEVEL>) # On closing </tag> pop level | (?! <tag> | </tag> ) . # Match any char unless the strings )+ # <tag> or </tag> in the lookahead string (?(LEVEL)(?!)) # If level exists then fail $", RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase);
This expression matches the properly-nested XML tags <tag> and </tag>. The only change from the parentheses expression is to replace ( with <tag> and ) with </tag>. This can be generalized such that all that is needed is the regular expressions for the opening and closing delimiters. The next example shows how one could use this expression in a more general fashion.
General version of Balanced constructs (HTML Anchor tags used in example).
Regex re = new Regex(string.Format(@"^ (?> {0} (?<LEVEL>) # On opening delimiter push level | {1} (?<-LEVEL>) # On closing delimiter pop level | (?! {0} | {1} ) . # Match any char unless the opening )+ # or closing delimiters are in the lookahead string (?(LEVEL)(?!)) # If level exists then fail $", "<a[^>]*>", "</a>"), RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase);
Now this expression uses a simple string format to replace the opening and closing delimiters in the expression string. In this case a simplistic version of the opening and closing HTML anchor tags are used. In general any opening and closing delimiters can be provided to this expression to create a .NET regular expression to match properly balanced constructs.
Retrieving data between delimiters where there are possible nested delimiters
One application commonly needed is the ability to retrieve the text between a set of tags when there is the possibility of the nesting. If there were no nested tags then this regular expression would be rather simple but since there are one essentially needs to wrap the expression from above with the set of outer tags and then capture the inner text.
Regex re = new Regex(string.Format(@"^ {0} # Match first opeing delimiter (?<inner> (?> {0} (?<LEVEL>) # On opening delimiter push level | {1} (?<-LEVEL>) # On closing delimiter pop level | (?! {0} | {1} ) . # Match any char unless the opening )+ # or closing delimiters are in the lookahead string (?(LEVEL)(?!)) # If level exists then fail ) {1} # Match last closing delimiter $", "<quote>", "</quote>"), RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase);
re.Match("<quote>inner text</quote>").Groups["inner"].Value == "inner text" re.Match("<quote>a<quote>b</quote>c</quote>").Groups["inner"].Value == "a<quote>b</quote>c"
This example strips off the outer most <quote> tags and stores the inner text result in the named-capture group inner.
Matching multiple balanced constructs
The original intent of this example was to show how to match multiple properly balanced tags with a single expression. However, after creating the expression and testing it an interesting problem cropped up. For example to make sure () and [] are properly nested individually is easy as shown above but to make sure they are properly nested together is not possible with .NET regular expressions. To better understand the problem consider the following improperly nested examples ([)] or [(()]). They are individually properly-nested but improperly-nested when considering them together. Here is an expression that could potentially recognize this:
Regex re = new Regex(@"^ (?> (?<LEVEL> \() # On opening paren capture ( on stack | (?(\k<LEVEL>=\() # Make sure the top of stack is ( (?<-LEVEL> \) )) # On closing paren pop ( off stack | (?<LEVEL> \[ ) # On opening bracket capture [ on stack | (?(\k<LEVEL>=\]) # Make sure the top of stack is [ (?<-LEVEL> \] )) # On closing bracket pop [ off stack | (?! \( | \) | \[ | \] ) . # Match any char except (, ), [ or ] )+ (?(LEVEL)(?!)) # If level exists then fail $", RegexOptions.IgnorePatternWhitespace);
THIS REGULAR EXPRESSION DOES NOT WORK IT IS ONLY USED AS A DEMONSTRATION
The captured value on the top of the stack can be retrieved by using a backreference \k<LEVEL> but there is no way to test the value. The above expression doesn't work because of (?(\k<LEVEL>=\() and (?(\k<LEVEL>=\]) they try to match the string literals "(=)" or "[=]". What really needs to happen is the value on the top of stack needs to be compared to ( or [ however this is not possible with .NET regular expressions. This is an example of a context-free language that cannot be recognized by a simple counter.
Conclusion
Hopefully this article has provided a better understanding of how and why the .NET regular expression engine can recognize individually properly balanced constructs.