Matching Balanced Constructs with .NET Regular Expressions

Brief Computer Science Theory Background

In computer science a formal language is a set of finite character strings that are created by some finite alphabet. There exist four major formal language classes as defined by the Chomsky hierarchy.

Most programming language syntax can be described by a context-free language and can be recognized by a PDA. A PDA can be thought of as a FSA or FSM that can use a stack to store data. Regular languages are used to describe simple string patterns such as program identifiers. Regular expressions are strings that describe a particular regular language. Regular languages cannot recognize any string that requires any sort of counting. One classic language anbn where n > 0 is not a regular language because it cannot be recognized by a FSA. It is a context-free language and can be recognized by a PDA. Whenever an a is read then an a is pushed onto the stack and then whenever a b is read an a is popped off the stack and if the stack is empty after reading all the characters of the string then the string is accepted. Similarly properly balanced constructs such as balanced parentheses need a PDA to be recognized and thus cannot be represented by a regular expression.

.NET Regular Expression Engine

As described above properly balanced constructs cannot be described by a regular expression. However, the .NET regular expression engine provides a few constructs that allow balanced constructs to be recognized. 

  • (?<group>) - pushes the captured result on the capture stack with the name group.
  • (?<-group>) - pops the top most capture with the name group off the capture stack.
  • (?(group)yes|no) - matches the yes part if there exists a group with the name group otherwise matches no part.

These constructs allow for a .NET regular expression to emulate a restricted PDA by essentially allowing simple versions of the stack operations: push, pop and empty. The simple operations are pretty much equivalent to increment, decrement and compare to zero respectively. This allows for the .NET regular expression engine to recognize a subset of the context-free languages, in particular the ones that only require a simple counter. This in turn allows for the non-traditional .NET regular expressions to recognize individual properly balanced constructs.

.NET Regular Expression Examples

The classic anbn example.

Regex re = new Regex(@"^
  (?<N>a)+    # For every a push N on capture stack
  (?<-N>b)+   # For every b pop N from capture stack
  (?(N)(?!))  # If N exists on stack then fail (?!)
  $", RegexOptions.IgnorePatternWhitespace);

This regular expression recognizes any number of a's followed by the same number of b's. Essentially for every a it adds a named group N to the capture stack and then for every b it removes a named group N from the capture stack. Once it gets past the last b it checks to see if the named group N exists on the capture stack and if it does then there were more a's then b's and so it forces a failure by matching (?!) (this is a negative lookahead with no expression which is a guaranteed failure). It is worth mentioning that if no named group N exists when trying to pop (<-N>) then it will fail and thus this prevents accepting strings where there are more b's then a's.

Balanced Parentheses.

Jeffrey Friedl provides the following example in his excellent book Mastering Regular Expressions, 2nd Edition.

Dim R As Regex = New Regex(" \(                   " & _
                           "   (?>                " & _
                           "       [^()]+         " & _
                           "     |                " & _
                           "       \( (?<DEPTH>)  " & _
                           "     |                " & _
                           "       \) (?<-DEPTH>) " & _
                           "   )*                 " & _
                           "   (?(DEPTH)(?!))     " & _
                           " \)                   ", _
       RegexOptions.IgnorePatternWhitespace)

Now this expression works just fine for matching properly-nested parentheses but its layout doesn't work for matching nested constructs which are more then an single character such as XML tags for example. Here is another regular expression for matching parentheses that can be expanded easily for other multi-character delimiters.

Regex re = new Regex(@"^
  (?>
      \( (?<LEVEL>)   # On opening paren push level
    |    
      \) (?<-LEVEL>)  # On closing paren pop level
    |
      (?! \( | \) ) . # Match any char except ( or ) 
  )+
  (?(LEVEL)(?!))      # If level exists then fail
  $", RegexOptions.IgnorePatternWhitespace);

This expression also matches properly-nested parentheses. The biggest difference here is that instead of matching a character class of  [^()]+ it uses a negative lookahead to ensure that the character is not a paren. It also only captures one character instead of one or more. For a single character delimiter like the parentheses a lookahead may be more than what is needed but it is needed in the next example.

Balanced XML tags.

Regex re = new Regex(@"^
  (?>
      <tag>  (?<LEVEL>)      # On opening <tag> push level
    | 
      </tag> (?<-LEVEL>)     # On closing </tag> pop level
    |
      (?! <tag> | </tag> ) . # Match any char unless the strings   
  )+                         # <tag> or </tag> in the lookahead string
  (?(LEVEL)(?!))             # If level exists then fail
  $", RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase);

This expression matches the properly-nested XML tags <tag> and </tag>. The only change from the parentheses expression is to replace ( with <tag> and ) with </tag>. This can be generalized such that all that is needed is the regular expressions for the opening and closing delimiters. The next example shows how one could use this expression in a more general fashion.

General version of Balanced constructs (HTML Anchor tags used in example).

Regex re = new Regex(string.Format(@"^
  (?>
      {0} (?<LEVEL>)      # On opening delimiter push level
    | 
      {1} (?<-LEVEL>)     # On closing delimiter pop level
    |
      (?! {0} | {1} ) .   # Match any char unless the opening   
  )+                      # or closing delimiters are in the lookahead string
  (?(LEVEL)(?!))          # If level exists then fail
  $", "<a[^>]*>", "</a>"), 
  RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase);

Now this expression uses a simple string format to replace the opening and closing delimiters in the expression string. In this case a simplistic version of the opening and closing HTML anchor tags are used. In general any opening and closing delimiters can be provided to this expression to create a .NET regular expression to match properly balanced constructs.

Retrieving data between delimiters where there are possible nested delimiters

One application commonly needed is the ability to retrieve the text between a set of tags when there is the possibility of the nesting. If there were no nested tags then this regular expression would be rather simple but since there are one essentially needs to wrap the expression from above with the set of outer tags and then capture the inner text. 

Regex re = new Regex(string.Format(@"^
  {0}                       # Match first opeing delimiter
  (?<inner>
    (?>
        {0} (?<LEVEL>)      # On opening delimiter push level
      | 
        {1} (?<-LEVEL>)     # On closing delimiter pop level
      |
        (?! {0} | {1} ) .   # Match any char unless the opening   
    )+                      # or closing delimiters are in the lookahead string
    (?(LEVEL)(?!))          # If level exists then fail
  )
  {1}                       # Match last closing delimiter
  $", "<quote>", "</quote>"), 
  RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase);
re.Match("<quote>inner text</quote>").Groups["inner"].Value == "inner text" re.Match("<quote>a<quote>b</quote>c</quote>").Groups["inner"].Value == "a<quote>b</quote>c"

This example strips off the outer most <quote> tags and stores the inner text result in the named-capture group inner.  

Matching multiple balanced constructs

The original intent of this example was to show how to match multiple properly balanced tags with a single expression. However, after creating the expression and testing it an interesting problem cropped up. For example to make sure () and [] are properly nested individually is easy as shown above but to make sure they are properly nested together is not possible with .NET regular expressions. To better understand the problem consider the following improperly nested examples ([)] or [(()]). They are individually properly-nested but improperly-nested when considering them together. Here is an expression that could potentially recognize this:

Regex re = new Regex(@"^
  (?>
      (?<LEVEL> \()                 # On opening paren capture ( on stack
    | 
      (?(\k<LEVEL>=\()              # Make sure the top of stack is (
      (?<-LEVEL> \) ))              # On closing paren pop ( off stack
    |
      (?<LEVEL> \[ )                # On opening bracket capture [ on stack
    |
      (?(\k<LEVEL>=\])              # Make sure the top of stack is [
      (?<-LEVEL> \] ))              # On closing bracket pop [ off stack
    |
      (?! \( | \) | \[ | \] ) .     # Match any char except (, ), [ or ]
  )+
  (?(LEVEL)(?!))                    # If level exists then fail
  $", RegexOptions.IgnorePatternWhitespace);

THIS REGULAR EXPRESSION DOES NOT WORK IT IS ONLY USED AS A DEMONSTRATION

The captured value on the top of the stack can be retrieved by using a backreference \k<LEVEL> but there is no way to test the value. The above expression doesn't work because of (?(\k<LEVEL>=\() and (?(\k<LEVEL>=\]) they try to match the string literals "(=)" or "[=]". What really needs to happen is the value on the top of stack needs to be compared to ( or [ however this is not possible with .NET regular expressions. This is an example of a context-free language that cannot be recognized by a simple counter.

Conclusion

Hopefully this article has provided a better understanding of how and why the .NET regular expression engine can recognize individually properly balanced constructs.

Published Sunday, February 20, 2005 3:48 PM by puzzlehacker

Comments

# re: Matching Balanced Constructs with .NET Regular Expressions

> Most programming languages are context-free languages and can be
> recognized by PDA.

No, most programming languages have approximations (where "approximation" is not a formally defined term ^_^) for which the approximations are CFLs and can be recognized by PDAs. Most actual programming languages are really CSLs or harder, because of semantic issues such as verifying that variables have been declared and that their types are compatible.

I didn't investigate which class C fell into after "typedef" was added to the language, but maybe that doesn't matter because I think the preprocessing step requires a full Turing machine.

Sunday, February 20, 2005 10:58 PM by Norman Diamond

# re: Matching Balanced Constructs with .NET Regular Expressions

Norman, Thanks for the correction. What I meant to say was the syntax of most programming languages can be recognized by a PDA (I reworded this in the text). The semantics are a different story.

Sunday, February 20, 2005 11:13 PM by Wes Haggard

# re: Matching Parenthesis Pairs Using Regular Expressions

Thursday, February 24, 2005 1:31 PM by TrackBack

# re: Balanced matching using .NET Regex

Sunday, June 19, 2005 2:36 PM by TrackBack

# re: Matching Balanced Constructs with .NET Regular Expressions

Nice and very informative post ....

Thursday, November 09, 2006 7:04 AM by Ankit Mehta

# Simon&#8217;s Blog &raquo; ???????????? &raquo; ???????????????????????????-30???????????????????????????

# re: Matching Balanced Constructs with .NET Regular Expressions

I think, the example after:

'can be expanded easily for other multi-character delimiters.'

is missing the \( and \) at the end of the expression

Thursday, December 21, 2006 9:44 AM by h

# [转]正则表达式30分钟入门教程

正则表达式30分钟入门教程。当然,如果你看完了这篇教程之后,发现自己明白了很多,却又几乎什么都记不得,那也是很正常的——我认为,没接触过正则表达式的人在看完这篇教程后,能把提到过的语法记住80%以上的可能性为零。除了作为入门教程之外,本文还试图成为可以在日常工作中使用的正则表达式语法参考手册。

Tuesday, May 29, 2007 2:00 AM by wuhran

# 正则表达式(详)

正则表达式30分钟入门教程。版本:v2.2 (2007-5-28) 作者:deerchao 来源:unibetter大学生社区 转载请注明来源目录本文目标 如何使用本教程 正则表达式到底是什么?当然,如果你看完了这篇教程之后,发现自己明白了很多,却又几乎什么都记不得,那也是很正常的——我认为,没接触过正则表达式的人在看完这篇教程后,能把提到过的语法记住80%以上的可能性为零。

Tuesday, June 05, 2007 4:08 AM by hanpoyang

# 正则表达式30分钟入门教程

Sunday, June 10, 2007 9:54 PM by 傲衣华少

# 正则表达式30分钟入门教程

目录 本文目标 如何使用本教程 正则表达式到底是什么? 入门 测试正则表达式 元字符 字符转义 重复 字符类 反义 替换 分组 后向引用 零宽断言 负向零宽断言 注释 贪婪与懒惰 处理选项 平衡组/递归匹配

Monday, August 13, 2007 7:35 AM by code about asp.net (c#)

# 正则表达式30分钟入门教程

转自: www.unibetter.com/.../zhengzhe-biaodashi-jiaocheng-se.htm 本文目标 30分钟内让你明白正则表达式是什么,并对它有一些基本的了解

Monday, August 13, 2007 10:05 AM by wsliu

# re: Matching Balanced Constructs with .NET Regular Expressions

Can .NET Regular Expressions execute code in a regular expression like perl?

Friday, November 02, 2007 3:57 AM by hanmin wang

# ??????????????? &raquo; Blog Archive &raquo; ???????????????30??????????????????

Pingback from  ???????????????  &raquo; Blog Archive   &raquo;  ???????????????30??????????????????

# 正则表达式30分钟入门教程

30分钟内让你明白正则表达式是什么,并对它有一些基本的了解,让你可以在自己的程序或网页里使用它。

Thursday, January 24, 2008 8:26 AM by Jason

# ??????????????????????????? | Richie's Blog

Pingback from  ??????????????????????????? | Richie's Blog

Sunday, January 27, 2008 9:13 PM by ??????????????????????????? | Richie's Blog

# 转:正则表达式30分钟入门教程

作者:deerchao来源:unibetter大学生社区转载请注明来源

目录 本文目标

如何使用本教程

什么是正则表达式?

入门

测试正则表...

Monday, January 28, 2008 9:25 PM by oiea

# 正则表达式30分钟入门教程

目录 本文目标

如何使用本教程

正则表达式到底是什么?

入门

测试正则表达式

元字符

字符转义

重复

字符...

Monday, February 18, 2008 4:04 AM by Eric.Chai

# ????????????????????????????????????????????? | | ??????????????????????????????

Pingback from  ????????????????????????????????????????????? |  | ??????????????????????????????

# 转载:正则表达式30分钟入门教程

正则表达式30分钟入门教程

版本:v2.21(2007-8-3)作者:deerchao来源:unibetter大学生社区转载请注明来源

目录 本文目标

如何使用本教...

Sunday, March 09, 2008 12:00 PM by Samita

# 30分钟学习正则表达式的使用

这是一篇很好的教学指南,从基础上认识正则表达式,精通正则表达式需要多练习使用,总结经验。正则在文章查找,字符匹配,输入验正有很好的作用。

Saturday, March 15, 2008 9:43 PM by c_spark

# 正则表达式30分钟入门教程

版本:v2.21(2007-8-3)作者:deerchao来源:unibetter大学生社区转载请注明来源目录本文目标如何使用本教程正则表达式到底是什么?入门测试正则表达式元字符字符...

Monday, March 17, 2008 2:04 AM by looping

# 正则表达式30分钟入门教程

www.unibetter.com/.../zhengzhe-biaodashi-jiaocheng-se.htm 正则表达式30分钟入门教...

Thursday, March 27, 2008 10:35 PM by 程序缘

# 正则表达式30分钟入门教程

regular expression

Friday, March 28, 2008 4:52 AM by henry_dx

# 正则表达式30分钟入门教程(第二版)

正则表达式30分钟入门教程(第二版)

Friday, April 04, 2008 8:04 PM by 编程浪子涛

# regex

正则表达式30分钟入门教程

版本:v2.21(2007-8-3)作者:deerchao来源:unibetter大学生社区转载请注明来源

目录 本文目标

如何使用本教...

Monday, April 07, 2008 9:22 PM by hq2008

# 正则表达式30分钟入门教程 v2.1

正则表达式30分钟入门教程 v2.1

Friday, April 18, 2008 3:34 AM by lisong58420

# ???????????????30??????????????????

Pingback from  ???????????????30??????????????????

Friday, May 02, 2008 8:48 AM by ???????????????30??????????????????

# re: Matching Balanced Constructs with .NET Regular Expressions

I want split a multiline string on \r\n if they are not within quotation marks

bla;"bla\r\nbla";blablabla\r\nbla;"bla\r\nbla\r\nblabla";blablabla\r\n

should result to:

bla;"bla\r\nbla";blablabla

bla;"bla\r\nbla\r\nblabla";blablabla

Sunday, May 04, 2008 5:16 AM by Günter Zöchbauer

# ??????????????? | C's Blog

Pingback from  ??????????????? | C's Blog

Wednesday, May 28, 2008 9:48 AM by ??????????????? | C's Blog

# ???????????????30??????????????????[???]

Pingback from  ???????????????30??????????????????[???]

Friday, August 29, 2008 4:16 PM by ???????????????30??????????????????[???]

# .net 正则表达式 30分钟入门教程

BODY{

FONT-SIZE:100%

}

H1{

TEXT-ALIGN:center

}

H2{

CLEAR:both;BORDER-RIGHT:gray1pxs...

Friday, September 19, 2008 9:51 AM by 偶然微笑

# 正则表达式30分钟入门教程(转)

快速学习正则表达式,看了以后发现写的很好。。。特贴出来 。

Thursday, October 09, 2008 9:58 PM by 小菜猪

# ???????????????30?????????????????? | ???????????????

Pingback from  ???????????????30?????????????????? | ???????????????

Thursday, November 06, 2008 2:44 PM by ???????????????30?????????????????? | ???????????????

# 正则表达式30分钟入门教程

最重要的是——请给我30分钟,如果你没有使用正则表达式的经验,请不要试图在30秒内入门——除非你是超人 :)

Thursday, March 12, 2009 5:13 AM by 查小广

# re: Matching Balanced Constructs with .NET Regular Expressions

Hi,

very good articles, nice explanation

thanks

Thursday, March 26, 2009 9:10 AM by Moumen

# 正则表达式如何处理嵌套结构

1,正则表达式如何处理嵌套结构

a.

Saturday, May 09, 2009 11:25 AM by 周骏

# - Vea&#8217;s Galaxy

Pingback from  - Vea&#8217;s Galaxy

Monday, June 15, 2009 10:12 AM by - Vea’s Galaxy

# [转]正则表达式30分钟入门教程(第二版)

本文目标30分钟内让你明白正则表达式是什么,并对它有一些基本的了解,让你可以在自己的程序或网页里使用它。如何使用本教程别被下面那些复杂的表达式吓倒,只要跟着我一步一步来,你会发现正则表达式其实并不像...

Sunday, June 21, 2009 8:21 AM by 天歌

# ???????????????30?????????????????? - ??? ??????????????? ????????? - ?????? - ?????? BLOG

Pingback from  ???????????????30?????????????????? - ??? ??????????????? ????????? - ?????? - ?????? BLOG

Leave a Comment

(required) 
(required) 
(optional)
(required)