Practical Parsing Using Groups in Regular Expressions - ISerializable - Roy Osherove's Blog

Practical Parsing Using Groups in Regular Expressions

Practical Parsing Using Groups in Regular Expressions

 

note: This article is part 2 in a series of articles. Here are the rest:

What we’ll cover:

 

What’s you’ll need:

  • If you don’t know what regular expressions are then read this article.
  • eXpresso  (If you are not familiar with eXpresso, See this article)
  • .Net Framework SDK (Visual Studio .Net preferred)

 

What are regular expression groups and why do I need them?

In order to explain that, let’s take a look at a simple example.

Open eXpresso. Clear both the data pane, and the expression pane.

Next, put the following string in the data pane:

 

My birthday is 17/05/1975. Thank you.

 

From this string, we would like to search and extract any dates that appear there.

Try coming up with an expression that matches this by yourself.

If you want it the easy way, here’s an expression that’ll work:

\d{2}/\d{2}/\d{4}

 

This expression expects 2 digits, then a slash, then 2 more digits followed by a slash followed by 4 digits.

Granted, you could make this expression a bit more flexible and efficient but let’s keep it simple for the purpose of the article.

 

Now, when you press the “Find Matches” button, you’ll get the string that matches the specified date. Good.

However, in the real world, we would use this date inside our code. Let’s say we have a function that receives this date and wants to get the month, date and year as separate values in order to do various tasks with it.

One solution would obviously be to use the DateTime Class found in the framework to parse this string. However, let’s try doing it another way.

 

Assuming we don’t have the ability to use regular expressions, we would have to use standard text parsing functionality in order to determine the location of the first and second slashes, and then retrieve the strings that exist between them.

This mundane task can be avoided easily. The solution is simple, and very powerful.

The Regex Object model allows us to define Named groups within the specified regular expression.

These groups are exposed using the “Match.Groups” property.

We can get each group by name, and get its value using the Group.Value property.

 

Let’s take a look at a modified expression, which divides the parsed date into sub-groups:

To add a group to an expression, simply enclose the part of the expression you would like to be divided with round braces.

For example, In order to divide the day section of the date, we would do this:

(\d{2})/\d{2}/\d{4}

 

Let’s take a look at the expression after we have divided all 3 desired sub-groups:

(\d{2})/(\d{2})/(\d{4})

 

Paste the last expression into eXpresso and press “Find Matches”.

You’ll see that you get the same output as before, but something is different: There’s a little “+” mark on the left!.

Click on the “+” to expand the result. You’ll see that there are 3 sub groups below this global result.

These sub groups each represent a “Group” object, which is part of the received “Match” Object.

This tree view represents perfectly the hierarchical relationship of the object model.

 

Notice, though, that the groups are not currently named; we haven’t named them yet; they are indexed by numbers starting from 1 by default.

This means that you can call each group in code by specifying its index. Let’s name those groups to make them easier to call in code.

The syntax to name a group is simply to add the following after the opening brace of the sub group:

?

 

Note: The name you provide is Case-sensitive!

Let’s see how the final expression looks  after naming the groups:

(?<Day>\d{2})/(?<Month>\d{2})/(?<Year>\d{4})

 

Pretty easy, right?

Paste this expression into eXpresso and click “Find Matches”.

You’ll get an expandable result again, but this time you’ll have names instead of index numbers.

Now you’ll be able to call each groups value by just using the group’s name.

 

Simple Code Demo

//This function will receive

            //a string containing a date. It will parse the date inside

            // and print the value of the Day,Month and year of that date.

            private void ParseDate(string date)

            {

                  //this is the pattern we'll use to match the date

                  //and then divide it to sub groups

                  string pattern = @"(?<Day>\d{2})/(?<Month>\d{2})/(?<Year>\d{4})";

                 

                  //Retrieve the Parsed Match Object Using the Regex Object

                  Match DateMatch = Regex.Match(date,pattern);

           

                  //make sure there's actually a date in the string

                  //we get a Match object anyway,

                  //so we have to test it's 'Success' property;

                  if(!DateMatch.Success)

                  {

                        MessageBox.Show("Could not find a date inside the string");

                        return;

                  }

                 

 

                  //Print the value of the global match result

                  listBox1.Items.Add("The Whole Date Value Is: " + DateMatch.Value);

 

                  //Get Each sub-group by name and print it's value

                  //Notice that each group is sub-member of the match we received

//Notice that the Names are Case-sensitive!    

                  listBox1.Items.Add("Day : " + DateMatch.Groups["Day"].Value);

                  listBox1.Items.Add("Month : " + DateMatch.Groups["Month"].Value);

                  listBox1.Items.Add("Year: " + DateMatch.Groups["Year"].Value);

            }

 

 

As you can see, the functionality is pretty straight forward.

In order to get a sub group of the match result we simply call “Match.Groups[GroupName].Value” to get its value.

 

Using Multiple Matches from a given string

In order to provide you with the ability to get to all the recieved matches, the Match Object has a “NextMatch() function, which returns a new Match Object.

You’ll need to test it for Success value again. All you need to do is keep going until the Match.Sucess value is False.

 

The Hard Way

Here’s the same method from before, implemented to go through all the matches:

 

      //This function will recieve

            //a string containing a date. It will parse the date inside

            // and print the value of the Day,Month and year of that date.

            private void ParseDate(string date)

            {

                  //the pattern we'll use to match the date

                  //and divide it to sub groups

                  string pattern = @"(?<Day>\d{2})/(?<Month>\d{2})/(?<Year>\d{4})";

                 

                  //Retrieve the Parsed Match Object Using the Regex Object

                  Match DateMatch = Regex.Match(date,pattern);

           

                  //make sure there's actually a date in the string

                  //we get a Match object anyway,

                  //so we have to test it's 'Success' property;

                  if(!DateMatch.Success)

                  {

                        MessageBox.Show("Could not find a date inside the string");

                        return;

                  }

 

                  //Iterate through all the parsing Matches and print them

                  while (DateMatch.Success)

                  {

                        //Print the value of the global match result

                        listBox1.Items.Add("The Whole Date Value Is: " + DateMatch.Value);

 

                        //Get Each sub-group by name and print it's value

                        //Notice that each group is sub-member of the match we received

//Notice that the Names are Case-sensitive!    

 

                        listBox1.Items.Add("Day : " + DateMatch.Groups["Day"].Value);

                        listBox1.Items.Add("Month : " + DateMatch.Groups["Month"].Value);

                        listBox1.Items.Add("Year: " + DateMatch.Groups["Year"].Value);

 

 

                        DateMatch = DateMatch.NextMatch();

                  }

 

            }

 

Handling multiple matches – the simple way

The last example was a bit cumbersome. There’s another way to go through all the matches –

Simply use the Regex.Matches() function instead of the Regex.Match() function.

Then simply iterate over each match (You don’t even have to check for success, since you’ll only receive successful matches)

 

//This function will recieve

            //a string containing a date. It will parse the date inside

            // and print the value of the Day,Month and year of that date.

            private void ParseDate(string date)

            {

                  //the patern we'll use to match the date

                  //and divide it to sub groups

                  string pattern = @"(?<Day>\d{2})/(?<Month>\d{2})/(?<Year>\d{4})";

                 

                  //Retrieve the Parsed Match Object Using the Regex Object

                  MatchCollection DateMatches = Regex.Matches(date,pattern);

                 

                  //notice there is no need to check for success here.

                  //we only get successfull matches from this function..

 

                  //Iterate through all the parsing Matches and print them

                  foreach(Match DateMatch in DateMatches)

                  {

                        //Print the value of the current match result

                        listBox1.Items.Add("The Whole Date Value Is: " + DateMatch.Value);

 

                        //Get Each sub-group by name and print it's value

                        //Notice that each group is sub-member of the match we received

//Notice that the Names are Case-sensitive!    

 

                        listBox1.Items.Add("Day : " + DateMatch.Groups["Day"].Value);

                        listBox1.Items.Add("Month : " + DateMatch.Groups["Month"].Value);

                        listBox1.Items.Add("Year: " + DateMatch.Groups["Year"].Value);

                  }

                 

 

            }

 

Conclusion

Using groups with regular expression is a powerful tool to add to your parsing arsenal. There are many other abilities to the Regex, but this one is probably the most important, since most of the more advanced abilities of the Regex rely on this functionality.

In (perhaps) future articles, I will explain more in-depth possibilities of regular expression in the .Net Framework.

Published Tuesday, May 13, 2003 2:40 AM by RoyOsherove
Filed under:

Comments

Monday, May 12, 2003 7:40 PM by TrackBack

# ISerializable

ISerializable
Monday, May 12, 2003 7:40 PM by TrackBack

# ISerializable

ISerializable
Wednesday, May 14, 2003 5:28 AM by Anonymous

# re: Practical Parsing Using Groups in Regular Expressions

This won't work.

First, anyone viewing this on the web can't see your named groups, because they're enclosed in less-than and greater-than signs. All people see is the question-mark if they're looking at it in a browser. You need to use the ampersand-representations, &lt; and &gt; .

Secondly, you say the group name is case-sensitive, but in your expression you use "DAY" and in your group index you use "Day". If they are indeed case-sensitive, this example won't work.
Wednesday, May 14, 2003 8:32 AM by Royo

# re: Practical Parsing Using Groups in Regular Expressions

Thanks for the comment! I fixed it (Doh,How did I not notice??)
Hopefully it makes more sense now :)
Sorry if this confused anyone...
Wednesday, May 28, 2003 6:34 PM by Shawn A. Van Ness

# How to enumerate the named capturing groups in a regex?

Hi Roy,

First off, if you were using the regex-based C# colorizer that Wes H and I have been working on, that anglebracket bug was my fault. Sorry! ;)

But that's not why I'm writing... or actually yes, it is. I was trying to add a feature to that same product, one which would ultimately require "reflecting" on a user-entered regex to get a list of named capturing groups.

Short of scanning a regex pattern with a regex (ugh!) I can't seem to find a way to do that.

IOW, it looks like although GroupCollection allows us to index by string, it doesn not allow us to enumerate by string.

You may have more experience than me, in this area -- am I missing something?

Wednesday, May 28, 2003 6:59 PM by Royo

# re: Practical Parsing Using Groups in Regular Expressions

Hey Shawn.
Actually, I Copy-Pasting from VS.NET into the editor.
Second, As for your question, It seems a pretty complicated case, and off the top of my head, I can't think of a better way than parsing Regex patterns with Regex patterns :)

However, If i'll come up with something , I'll let you know about it :)
I suggest, If you havn't yet, Trying to post on the Regex Mailing list from ASPAlliance.com
The folks there might help...
They have more experience than me on this subject.
Thursday, March 11, 2004 2:53 PM by Humberto Oliveira

# re: Practical Parsing Using Groups in Regular Expressions

Hi,

I really enjoyed your articles. They were very elucidative to me. I am facing a situation similiar to your article at MSDN (Turn Your Log Files into Searchable Data Using Regex and the XML Classes). Your solution seems to suit very well my needs except for the fact that the field in my log files are not separated by tabs, but they have fix lengths and space characters between them. Here is a sample:

FIRST PRES. PURCHASE ORIG 61 8 1,126.89 DR 1,126.89 DR 840-USD 21.25 CR 840-USD

I have the schema for the file and I know the exact location of the fields in the line(starting column and lenght). Can I use a regular expression to separate all the fields based on their positions? If not, do you suggest a different approach?

Thanks,

Humberto Oliveira
Tuesday, March 16, 2004 4:47 PM by mahmood khalid

# re: Practical Parsing Using Groups in Regular Expressions

i need my match.
Tuesday, June 15, 2004 7:36 PM by codevigilante

# re: Practical Parsing Using Groups in Regular Expressions

A quick question, how would one format the regular expression if the months were stored as Jan, Feb, Mar, etc. ? I have been trying to learn regular expressions and cannot figure it out.
Saturday, July 10, 2004 10:42 AM by Robert

# re: Practical Parsing Using Groups in Regular Expressions

Hello!
You use th format dd/MM/yyyy for date.
why don't use yyyy-MM-dd ?
i've seen many and many examples that use dd/MM/yyyy format.
But i think that format make confusion.
It's my personal opinion and problem?
Monday, July 12, 2004 7:49 PM by TrackBack

# Nice articles about groups in regex

Monday, January 10, 2005 5:04 PM by TrackBack

# Regular Expressions

Monday, January 10, 2005 5:05 PM by TrackBack

# Regular Expressions

Monday, January 10, 2005 5:06 PM by TrackBack

# Regular Expressions

Thursday, May 25, 2006 11:36 AM by Writing abs() function using Regular Expression

# re: Practical Parsing Using Groups in Regular Expressions

HI,
Is it possible to write abs() function in regular expression? If so pls explain. Thanks in advance.
Friday, July 14, 2006 1:22 PM by AL

# re: Practical Parsing Using Groups in Regular Expressions

I was wondering how to use regular expressions to parse columns of a line?