Jeff and .NET

The .NET musings of Jeff Putz

Sponsors

News

My Sites

Archives

Revenge of the greedy RegEx

Naturally, when I get my text parsing right for the forums and release it, I realize that my test coverage wasn't as good as I had hoped. I've got a spot of code that looks for pseudo-code quote tags for conversion to an HTML blockquote. The expression looks like this:

(\[quote\])(.*)(\[/quote\])

Works like a champ... sort of. The middle group in the Match object (match.Groups[2]) then gets parsed to look for nested quotes. This works well to.

The problem is that it counts this as one entire match:

[quote]quote 1[/quote][quote]quote 2[/quote]

Where I'm troubled is that I'm not sure which group I should be altering. The middle "meat" of the quote I need to access via the Groups collection so I can parse for nested quotes, and every attempt I've made to alter this breaks the proper nesting.

I hate it that this one area causes me so much grief. I'm not brilliant, but I would think I'd be smart enough to keep this from eluding me! Help is appreciated.
Posted: Oct 25 2004, 09:43 PM by Jeff | with 12 comment(s)
Filed under: ,

Comments

Dean Harding said:

Unfortunately, there's nothing you can do. Regular expressions are not able to match balanced expressions (e.g. things like matching braces, or matching HTML tags). You can do what you've got there, which is make it so that it supports only one set of nested elements, or multiple sets of non-nested elements (by using (.*?) as your "meat" - i.e. a non-greedy search) but without some wacky extension to the regular expression grammer itself, you can't have the best of both worlds.

I think the best thing you can do is structure your forums so that you can simply replace all [quote] text with a single construct, and all [/quote] with a single construct - i.e. don't rely on them being matched together. Then you can perhaps put some custom logic in to make sure there's an equal number of [quote] and [/quote] tags.

For example, if you simply replace all instances of [quote] with <blockquote> and all instances of [/quote] with </blockquote>, then (apart from validating the input) you don't really need to know that it's all matched properly.
# October 25, 2004 10:41 PM

Wes said:

Try:
(\[quote\])(.*?)(\[/quote\])

From msdn:
*? - Specifies the first match that consumes as few repeats as possible (equivalent to lazy *).
# October 25, 2004 10:44 PM

Jeff said:

I do that for all of the non-block elements, but it's a lot harder to do it with block elements like this one, and p tags (which come from line breaks).
# October 25, 2004 10:45 PM

Jeff said:

Wes: That breaks the nesting. It grabs the inner-most [/quote] instead of the outside one.
# October 25, 2004 10:49 PM

Dean Harding said:

Holy crap!

I did a bit more research, and it seems Microsoft did add methods for matching nested constructs! Go Microsoft!!

There's a chapter in "Mastering Regular Expressions" (http://www.amazon.co.uk/exec/obidos/ASIN/0596002890/026-9448710-1434859) on how to do it. Luckily, O'Reily have made the chapter in question available for download from: http://www.oreilly.com/catalog/regex2/chapter/ch09.pdf - see the "Matching Nested Constructs" section.

There are some proposed extensions to Perl to allow for similar constructors (see, for example: http://www.puffinry.freeserve.co.uk/regex-extension.html) but they're not in perl per-se yet.
# October 25, 2004 11:00 PM

Wes said:

I have been looking at this expression from Ch9 of the book Dean mentioned above. With this expression syntax you can match nested items but I think it is still greedy. I'm not sure we can do what you want with .Net regular expressions.
# October 25, 2004 11:15 PM

Dean Harding said:

It should work because it only matches *balanced* expressions. So for example, if you have (assuming we're matching parenthases here):

(The (quick) brown fox) jumped (over the (lazy dog))

You'll get two matches:

(The (quick) brown fox)
and
(over the (lazy dog))

Which is what Jeff wants...
# October 28, 2004 1:30 AM

Wes said:

Dean you are correct. Matching ()'s is a little different then matching [quote][/quote] but not much harder, just takes a little different angle. After examining more closely I was able to figure out how to do what Jeff wants.

Jeff if you are still interested try:
(?:\[quote\])(?<quote>(?>(?:\[quote\])(?<DEPTH>)|(?:\[/quote\])(?<-DEPTH>)|.)+)(?(DEPTH)(?!))(?:\[/quote\])

Test string:
[quote]quote 1 [quote]quote 2[/quote] after quote 2[/quote][quote]quote 2[/quote]

Matches
1) [quote]quote 1 [quote]quote 2[/quote] after quote 2[/quote]
"quote" 1) quote 1 [quote]quote 2[/quote] after quote 2

2)[quote]quote 2[/quote]
"quote" 2) quote 2

I believe that is what you were after Jeff.
# November 1, 2004 12:23 AM

TrackBack said:

# November 22, 2004 11:18 AM

TrackBack said:

# November 22, 2004 11:18 AM

TrackBack said:

# May 16, 2005 8:45 PM

TrackBack said:

# May 16, 2005 8:45 PM
Leave a Comment

(required) 

(required) 

(optional)

(required)