Matching Balanced Constructs with .NET Regular Expressions
Brief Computer Science Theory Background
In computer science a formal language is
a set of finite character strings that are created by some finite
alphabet. There exist four major formal language classes
as defined by the Chomsky hierarchy.
Most programming language syntax can be described by a context-free
language and can be recognized by a PDA. A PDA can be thought of
as a FSA or FSM that can use a
stack to store data. Regular languages are
used to describe simple string patterns such as program identifiers. Regular
expressions are strings that describe a particular regular language. Regular languages
cannot recognize any string that requires any sort of counting. One classic
language anbn where n > 0 is not a regular language
because it cannot be recognized by a FSA. It is a context-free
language and can be recognized by a PDA. Whenever an
a is read then an a is pushed onto the stack and then whenever
a b is read an a is popped off the stack and if the stack is
empty after reading all the characters of the string then the string is
accepted. Similarly properly balanced constructs such as
balanced parentheses need a PDA to be recognized
and thus cannot be represented by a regular expression.
.NET Regular Expression Engine
As described above properly balanced constructs cannot be described by a
regular expression. However, the .NET regular expression engine provides a few
constructs that allow balanced constructs to be recognized.
- (?<group>)
- pushes the captured result on the capture stack with the name
group.
- (?<-group>)
- pops the top most capture with the name group off the
capture stack.
- (?(group)yes|no)
- matches the yes part if there exists a group with the name group
otherwise matches no part.
These constructs allow for a .NET regular expression to emulate a restricted
PDA by essentially
allowing simple versions of the stack operations: push, pop and empty. The
simple operations are pretty much equivalent to increment, decrement and compare
to zero respectively. This allows for the .NET regular expression engine to
recognize a subset of the context-free
languages, in particular the ones that only require a simple counter. This in
turn allows for the non-traditional .NET regular expressions to recognize
individual properly balanced constructs.
.NET Regular Expression Examples
The classic anbn example.
Regex re = new Regex(@"^
(?<N>a)+ # For every a push N on capture stack
(?<-N>b)+ # For every b pop N from capture stack
(?(N)(?!)) # If N exists on stack then fail (?!)
$", RegexOptions.IgnorePatternWhitespace);
This regular expression recognizes any number of a's followed by the
same number of b's. Essentially for every a it adds a named
group N to the capture stack and then for every b it removes a
named group N from the capture stack. Once it gets past the
last b it checks to see if the named group N exists on
the capture stack and if it does then there were more a's then
b's and so it forces a failure by matching (?!)
(this is a negative lookahead with no expression which is a guaranteed
failure). It is worth mentioning that if no named group N exists when
trying to pop (<-N>) then it will fail and
thus this prevents accepting strings where there are more b's then
a's.
Balanced Parentheses.
Jeffrey Friedl provides the following example in his excellent book Mastering Regular
Expressions, 2nd Edition.
Dim R As Regex = New Regex(" \( " & _
" (?> " & _
" [^()]+ " & _
" | " & _
" \( (?<DEPTH>) " & _
" | " & _
" \) (?<-DEPTH>) " & _
" )* " & _
" (?(DEPTH)(?!)) " & _
" \) ", _
RegexOptions.IgnorePatternWhitespace)
Now this expression works just fine for matching properly-nested parentheses
but its layout doesn't work for matching nested constructs which are more then
an single character such as XML tags for example. Here
is another regular expression for matching parentheses that can be
expanded easily for other multi-character delimiters.
Regex re = new Regex(@"^
(?>
\( (?<LEVEL>) # On opening paren push level
|
\) (?<-LEVEL>) # On closing paren pop level
|
(?! \( | \) ) . # Match any char except ( or )
)+
(?(LEVEL)(?!)) # If level exists then fail
$", RegexOptions.IgnorePatternWhitespace);
This expression also matches properly-nested parentheses. The biggest
difference here is that instead of matching a character class of [^()]+ it uses a negative lookahead to
ensure that the character is not a paren. It also only captures one character
instead of one or more. For a single character delimiter like
the parentheses a lookahead may be more than what is needed but it is needed in
the next example.
Balanced XML tags.
Regex re = new Regex(@"^
(?>
<tag> (?<LEVEL>) # On opening <tag> push level
|
</tag> (?<-LEVEL>) # On closing </tag> pop level
|
(?! <tag> | </tag> ) . # Match any char unless the strings
)+ # <tag> or </tag> in the lookahead string
(?(LEVEL)(?!)) # If level exists then fail
$", RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase);
This expression matches the properly-nested XML tags <tag> and </tag>. The only change from the
parentheses expression is to replace ( with
<tag> and ) with </tag>. This can be generalized such that
all that is needed is the regular expressions for the opening and closing
delimiters. The next example shows how one could use this expression in a more
general fashion.
General version of Balanced constructs (HTML Anchor tags used in
example).
Regex re = new Regex(string.Format(@"^
(?>
{0} (?<LEVEL>) # On opening delimiter push level
|
{1} (?<-LEVEL>) # On closing delimiter pop level
|
(?! {0} | {1} ) . # Match any char unless the opening
)+ # or closing delimiters are in the lookahead string
(?(LEVEL)(?!)) # If level exists then fail
$", "<a[^>]*>", "</a>"),
RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase);
Now this expression uses a simple string format to replace the opening and
closing delimiters in the expression string. In this case a simplistic version
of the opening and closing HTML anchor tags are used. In general any opening and
closing delimiters can be provided to this expression to create a .NET regular
expression to match properly balanced constructs.
Retrieving data between delimiters where
there are possible nested delimiters
One application commonly needed is the ability to retrieve the text
between a set of tags when there is the possibility of the nesting. If
there were no nested tags then this regular expression would be rather
simple but since there are one essentially needs to wrap
the expression from above with the set of outer tags and then capture
the inner text.
Regex re = new Regex(string.Format(@"^
{0} # Match first opeing delimiter
(?<inner>
(?>
{0} (?<LEVEL>) # On opening delimiter push level
|
{1} (?<-LEVEL>) # On closing delimiter pop level
|
(?! {0} | {1} ) . # Match any char unless the opening
)+ # or closing delimiters are in the lookahead string
(?(LEVEL)(?!)) # If level exists then fail
)
{1} # Match last closing delimiter
$", "<quote>", "</quote>"),
RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase);
re.Match("<quote>inner text</quote>").Groups["inner"].Value == "inner text"
re.Match("<quote>a<quote>b</quote>c</quote>").Groups["inner"].Value == "a<quote>b</quote>c"
This example strips off the outer most <quote> tags and stores the inner text result in
the named-capture group inner.
Matching multiple balanced constructs
The original intent of this example was to show how to match multiple
properly balanced tags with a single expression. However, after creating
the expression and testing it an interesting problem cropped up. For example to
make sure () and [] are properly nested individually is easy as shown
above but to make sure they are properly nested together is not possible with
.NET regular expressions. To better understand the problem consider the
following improperly nested examples ([)]
or [(()]). They are individually
properly-nested but improperly-nested when considering them together. Here is an
expression that could potentially recognize this:
Regex re = new Regex(@"^
(?>
(?<LEVEL> \() # On opening paren capture ( on stack
|
(?(\k<LEVEL>=\() # Make sure the top of stack is (
(?<-LEVEL> \) )) # On closing paren pop ( off stack
|
(?<LEVEL> \[ ) # On opening bracket capture [ on stack
|
(?(\k<LEVEL>=\]) # Make sure the top of stack is [
(?<-LEVEL> \] )) # On closing bracket pop [ off stack
|
(?! \( | \) | \[ | \] ) . # Match any char except (, ), [ or ]
)+
(?(LEVEL)(?!)) # If level exists then fail
$", RegexOptions.IgnorePatternWhitespace);
THIS REGULAR EXPRESSION DOES NOT WORK IT IS ONLY USED AS
A DEMONSTRATION
The captured value on the top of the stack can be retrieved by using a backreference
\k<LEVEL> but there is no way to test the
value. The above expression doesn't work because of (?(\k<LEVEL>=\() and (?(\k<LEVEL>=\]) they try to match the string
literals "(=)" or "[=]". What really needs to happen is the value on the
top of stack needs to be compared to ( or [ however this is not possible with .NET regular
expressions. This is an example of a context-free
language that cannot be recognized by a simple counter.
Conclusion
Hopefully this article has provided a better understanding of how and why the
.NET regular expression engine can recognize individually properly balanced
constructs.