Regex: Functionality about named/numbered groups everyone should know.

We have a cool new tool amongst us over on http://www.regexlib.com/RETester.aspx. It is already more functional than the previous tool there that was receiving quite a bit of traffic and attention for quickly testing expressions. The new tool really focuses on debug and diagnostic information that can be used to help tune an expression. More will be added for certain, but just the past day a more detailed group capture view was added that presented a few problems.

Explicitly Numbered Groups
This one isn't strange at all, in fact it is only because of extensions by the .NET expression engine that they might be confusing. Numbered groups allow you to change the index of a given capture. If you are working with the GroupCollection you might write the following code in order to enumerate all of the groups. This will print out Group[0] which is always present, but then it will try to point out Group[1]. This will succeed, but not print anything if the expression looks like, "(?<10>111)" and given the input "111";

for(int i = 0; i < m.Groups.Count; i++) { Console.WriteLine(m.Groups[i]); }

What you need to do is get a list of group numbers using the Regex.GetGroupNumbers method. This returns an array of indexes you can use instead of simply dropping blind into the group collection. Another common mistake may also be a foreach or enumeration of the group collection. If you do this, you will get all captures, but you won't be able to retrieve their name or number.

// BAD
foreach(Group g in m.Groups) { }
// GOOD
int[] groups = regex.GetGroupNumbers();
for(int i = 0; i < groups.Length; i++) {
    Group g = m.Groups[groups[i]];
}

Named and Numbered Reordering
All named captures are also given a number, but they aren't given a number in-line with their ordering within the expression. They are actually assigned numbers after all of the numbered groups have been allocated. That means the expression "(?<name>99)(9)" would return the 9 as the 1st ordered group and 99 as the second ordered group. If you name a group, be careful when you mix and match with unnamed groups because the indexes are not going to be what you might initially think.

Numbering Bug
Explicitly numbered groups may get overwritten by groups whose indexes are automatically assigned. For instance, if you start an expression by numbering a group, say "(?<10>99)", then go on to create an expression with more than 9 anonymous groups, you'll find that the 10th group gets overwritten or shared. The resulting Value of group 0 is the full capture, but the value present in group 10 will be the value of whatever group is processed last. This is best examined than talked about.

private static Regex regex2 = new Regex("(?<10>99)(9)(9)(9)(9)(9)(9)(9)(9)(9)(9a)(9)(9)(9)(9)(9)(9)(?<20>99)");
private static string data1 = "999999999999a9999999999999999";

For the results, how many groups print the value "99", how many groups print the value "9a". Further, count the number of 9's in the group 0, and the number of 9's in the rest of the groups. Do they even match? Have fun with this and be careful. Regular expressions are supposed to make your life easier, but there are some gotchas.

References:
[1] Match vs MatchSparse, a regular expression implementation detail that may surprise you.
[2] The extended .NET Regular Expression tester is up on www.RegexLib.com

Published Monday, September 06, 2004 4:31 PM by Justin Rogers

Comments

Tuesday, September 07, 2004 10:12 AM by TrackBack

# RegexLib Testing Tool - The new Details Grid

Tuesday, September 07, 2004 10:14 AM by TrackBack

# RegexLib Testing Tool - The new Details Grid

Leave a Comment

(required) 
(required) 
(optional)
(required)