Revenge of the greedy RegEx

Tuesday, October 26, 2004

Naturally, when I get my text parsing right for the forums and release it, I realize that my test coverage wasn't as good as I had hoped. I've got a spot of code that looks for pseudo-code quote tags for conversion to an HTML blockquote. The expression looks like this:

(\[quote\])(.*)(\[/quote\])

Works like a champ... sort of. The middle group in the Match object (match.Groups[2]) then gets parsed to look for nested quotes. This works well to.

The problem is that it counts this as one entire match:

[quote]quote 1[/quote][quote]quote 2[/quote]

Where I'm troubled is that I'm not sure which group I should be altering. The middle "meat" of the quote I need to access via the Groups collection so I can parse for nested quotes, and every attempt I've made to alter this breaks the proper nesting.

I hate it that this one area causes me so much grief. I'm not brilliant, but I would think I'd be smart enough to keep this from eluding me! Help is appreciated.

7 Comments

Unfortunately, there's nothing you can do. Regular expressions are not able to match balanced expressions (e.g. things like matching braces, or matching HTML tags). You can do what you've got there, which is make it so that it supports only one set of nested elements, or multiple sets of non-nested elements (by using (.*?) as your "meat" - i.e. a non-greedy search) but without some wacky extension to the regular expression grammer itself, you can't have the best of both worlds.

I think the best thing you can do is structure your forums so that you can simply replace all [quote] text with a single construct, and all [/quote] with a single construct - i.e. don't rely on them being matched together. Then you can perhaps put some custom logic in to make sure there's an equal number of [quote] and [/quote] tags.

For example, if you simply replace all instances of [quote] with <blockquote> and all instances of [/quote] with </blockquote>, then (apart from validating the input) you don't really need to know that it's all matched properly.

Dean Harding - Monday, October 25, 2004 10:41:00 PM

Try:

(\[quote\])(.*?)(\[/quote\])

From msdn:

*? - Specifies the first match that consumes as few repeats as possible (equivalent to lazy *).

Wes - Monday, October 25, 2004 10:44:00 PM

I do that for all of the non-block elements, but it's a lot harder to do it with block elements like this one, and p tags (which come from line breaks).

Jeff - Monday, October 25, 2004 10:45:00 PM

Wes: That breaks the nesting. It grabs the inner-most [/quote] instead of the outside one.

Jeff - Monday, October 25, 2004 10:49:00 PM

I have been looking at this expression from Ch9 of the book Dean mentioned above. With this expression syntax you can match nested items but I think it is still greedy. I'm not sure we can do what you want with .Net regular expressions.

Wes - Monday, October 25, 2004 11:15:00 PM

It should work because it only matches *balanced* expressions. So for example, if you have (assuming we're matching parenthases here):

(The (quick) brown fox) jumped (over the (lazy dog))

You'll get two matches:

(The (quick) brown fox)

and

(over the (lazy dog))

Which is what Jeff wants...

Dean Harding - Thursday, October 28, 2004 1:30:00 AM

Dean you are correct. Matching ()'s is a little different then matching [quote][/quote] but not much harder, just takes a little different angle. After examining more closely I was able to figure out how to do what Jeff wants.

Jeff if you are still interested try:

(?:\[quote\])(?<quote>(?>(?:\[quote\])(?<DEPTH>)|(?:\[/quote\])(?<-DEPTH>)|.)+)(?(DEPTH)(?!))(?:\[/quote\])

Test string:

[quote]quote 1 [quote]quote 2[/quote] after quote 2[/quote][quote]quote 2[/quote]

Matches

1) [quote]quote 1 [quote]quote 2[/quote] after quote 2[/quote]

"quote" 1) quote 1 [quote]quote 2[/quote] after quote 2

2)[quote]quote 2[/quote]

"quote" 2) quote 2

I believe that is what you were after Jeff.

Wes - Sunday, October 31, 2004 11:23:00 PM

Comments have been disabled for this content.