Darren's "Killer Reason for LookAround" posting has been haunting me
Darren's Post: http://blogs.regexadvice.com/dneimke/archive/2004/08/03/1432.aspx
Every time he comes up with a posting like the above I get an instant message of some sort or another with a quizzical question, “Hehe, you want to be unproductive today!” or something similarly intriguing. That is how I was introduced to the killer reason for a lookaround challenge. I have to say, I can't find one. At this point it is apparent that look-around expressions might just be a shortcut for a more verbose expression. I've managed to convert nearly all of my library and many other's libraries of look-around expressions to something either simpler/more complex that was not a look-around expression, but instead a normal expression not containing assertions.
Based on these concepts we kind of changed the rules a bit. The new rules stated that we needed to find a place where assertions would provide better performance. And so I started to extrapolate some conditions that would help increase our chances of finding a great case for assertions.
- Create a simple bounding match with a start marker and an end marker where marker stands for some pattern or literal.
- Between the start and end markers create a relatively complex matching pattern that in many cases would take a long time to execute.
- Now assure that the most probably input scenario will match the start marker, and most or all of the complex expression, but not the end marker.
Hopefully you see where I'm going here. Rather than having a huge inline expression to match the conditional rules in the middle in a scenario where the end tag is more likely to cause the pattern to fail, instead apply the complex conditional rules only after matching the start and end tags. We still can't think of anything really stellar, but at least now we have some patterns that do provide better performance using assertions over a normal expression.
I also have another pattern that I don't think can really be done without assertions. However, they can't be done because an assertion is the only way to introduce the (NOT) operator in regular expressions. Well, not the only way, because you can have negative character classes that have a kind of (NOT) operator but they don't work at the same level. They work on characters and can't work on patterns of characters. Here are the two input strings the embody my pattern:
[foo][bar][/bar][/foo]
[foo][bar][/foo][/bar]
Basically, I want to look for start and end groups that cross one another. This pattern could be used to find errors in some proprietary sound mixing format where each sound sample should start and end, but never overlap in time sequence with another. Maybe this is a valdiator tool that gets run at the end of the entire process but allowing the mixing of sounds early on is something that has to be accounted for. Anyway, I want to be able to match one of the patterns, the second one, and ensure that I'm not matching an extended case of the first.
How can I say match [foo][bar][/foo], but ensure that [/bar] doesn't occur preceeding [/foo]? Well that is an assertion. A look-behind assertion to be precise. Then how do I say match [/bar] following [/foo]. Well that is easy, no assertions are needed there, just a normal match literal. Note: This is a simplification of the pattern. There may be any number of whitespace or other characters between some of the tags which is why the look-behinds are necessary.
"\\[(\\w+?)\\]\\[(\\w+?)\\](?<!\\[/\\2\\])\\[/\\1\\]\\[/\\2\\]"
Now, the astute user might point out with the two input strings, becuase we have the following [/\2] pattern the first string won't match. Remember we might be running this over a string of the form [foo][bar][/bar][/foo][/bar] for all we know. In other words we want the pattern to be more resilient to failure. Even with this type of pattern, there is always something else you can use in place of the assertions. Conditional matches for instance could be used. If I match the second end tag before the first end tag then I could conditionally choose to search for the beginning of the string as the next character versus the second end tag. This basically forces a failure.
"\\[(\\w+?)\\]\\[(\\w+?)\\](\\[/\\2\\])?\\[/\\1\\](?(3)^|\\[/\\2\\])"
If you think you might have some patterns that are impossible without assertions then toss them into my comments. I'd be glad to play around and see about converting them to not use assertions. So far everything I've come up with is bust. Some other popular regex bloggers are coming to the same conclusions as well. Wayne King had an <img> matcher that relied on look-behind but then realized a simple expression would do. Darren has been wracking his brains for days as have I.