Regex fun

Saturday, February 19, 2005

Update: Oliver Sturm made a few great suggestions to improve the expression and also fixed a bug with eager matching. Click here to view the complete comment which explains what eager matching is and how to fix it if you run into the same issue!

Today I was working on a new functionality of the LLBLGen Pro code generator engines: user code region preservation. This feature allows template designers to specify area's in the generated code where developers can add their own code which is then preserved when the code is again generated. An example of this can be a custom property in a Customer class which returns the full name based on the existing FirstName and LastName properties. Using this technique avoids having to subclass generated classes to add functionality.

The obvious way to do this is by inserting a start marker and an end marker which mark the region which should be preserved. To be able to define those regions in different scopes, the regions will get a name, so when the template parser runs into a region statement in the template, placed there by the template author, it can look up the region in the current version of the generated code, and copy its contents over to the new version of the generated code.

For the start marker I had __LLBLGENPRO_USER_CODE_REGION_START in mind, and for the end marker __LLBLGENPRO_USER_CODE_REGION_END. Pretty basic. Placed inside comments these will be easy to find back and not very likely will they match with existing code, which is always the issue with markers in code . As the output is text (C# or VB.NET code or a code support file, like a .config file or any other output file the developer had in mind), it should be fairly easy to find back the markers and the regions by doing some string search voodoo, right?

So I opened my parser sourcecode and started working on the region finder code. As the current generated code isn't parsed by this parser, there is no token to nonterminal parser logic available for the generated code and because I'm raised with C, I thought "what the heck, just some string search routines will do fine.". However that's easier said than done. As the markers will be placed in C# or VB.NET code, the comment operator is unknown to the parser. Also, the full line on which the marker is placed has to be copied, so the search routine has to scan back to the first CRLF it runs into. When it finds the start marker, it has to scan further for the region name. This got out of hand pretty quickly.

As the parser itself is build with regular expressions, I knew what they could do. Looking at my string searcher code, I realized I had to do something drastic: try to do it with regex's. A feeling inside me said that it might even be possible to do it with 1 single regex. Well, let's see!

Consider this code snippet from the generated code which has a user code region and which should be preserved. It's from an OrderEntity class, which has an extra property for the customer name (also pay attention to the whitespace):

		// __LLBLGENPRO_USER_CODE_REGION_START customProperties
		/// <summary>
		/// Gets the company name of the related customer entity.
		/// </summary>
		public string CustomerCompanyName
		{
			get
			{
				if(this.Customer==null)
				{
					return string.Empty;
				}
				else
				{
					return this.Customer.CompanyName;
				}
			}
		}
		// __LLBLGENPRO_USER_CODE_REGION_END

How to find such regions back in the code with 1 regex? Well, with this one (wrapped over multiple lines for readability)

"^[ \t]*('+|/{2,}) __LLBLGENPRO_USER_CODE_REGION_START 
(?<regionName>\w+)\r\n(.*\r\n)*?[ \t]*('+|/{2,}) __LLBLGENPRO_USER_CODE_REGION_END"

It defines both VB.NET and C# comment operators, and uses a group match to find the region name back. It can handle empty regions and empty lines.

So how does my scanner now look like?

private void FindUserCodeRegions()
{
	// use the compiled regex to find all regions.
	MatchCollection matchesFound = _userCodeRegionRegExp.Matches(_originalFileContents);
	foreach(Match matchFound in matchesFound)
	{
		// a region was found. get the name of the region
		string regionName = matchFound.Groups["regionName"].Value;
		if(_userCodeRegions.ContainsKey(regionName))
		{
			// already there, skip.
			continue;
		}

		_userCodeRegions.Add(regionName, matchFound.Value);
	}		
}

That's it! It finds all regions and stores them by name in a hashtable, prior to the execution of the template.

Moral of the story: if you have to do string searches, be sure to check out regular expressions and the .NET classes for regular expressions in the System.Text.RegularExpressions namespace. It's a little sad that the Group object doesn't have a 'Name' property, as you can give groups names in the expression itself, but that's minor.

Oh, and before I forget: the hard part is often to write the expressions themselves. Use one of the various on-line regex tester sites, The Regulator or fire up Snippetcompiler and write a few lines to see if your expression does what it should do.

5 Comments

Elegant code Frans, no surprises there though...

Alex James - Saturday, February 19, 2005 8:32:00 PM

I've used "Regex fun" as a blog title before, but I meant it in jest. Not surprisingly, you were serious! :)

Jeff - Saturday, February 19, 2005 9:07:00 PM

Do you plan to use partial classes in your code generator when whidbey comes out? They would remove the need to have special sections like you have implemented above.

ben - Sunday, February 20, 2005 5:34:00 AM

Some comments on this. Right when I read your regex, I was wondering about greedy matches. I don't know LLBLGen, so I can't say if that's important to you, but a quick test showed me that your regex doesn't work correctly if there's more than one such region in the same file (while with that loop and everything, it certainly looks like you were trying to support that).

The problem in that case is the so-called greedy match, which regular expressions always perform by default. This means that the quantifiers * and + always try to match as much text (i.e. they are greedy) as possible in the context. In your case, the result is that you get only one match which stretches from the first start marker to the last end marker.

This can be changed easily by switching the greedy match off for the correct quantifier, that's the one that matches all the "content" lines in your regex. So instead of (.*\r\n)* you should use (.*\r\n)*? , the greedy matching being switched off by the trailing ?.

Two other things I'd change:

1) The comment operators could be better matched using ('+|/{2,}) instead of the ['/]+

2) For compatibility (with Mono, for instance), you shouldn't assume \r\n to be the line terminator, instead use Environment.NewLine. So you could construct your complete regex like this:

string regex = String.Format(CultureInfo.InvariantCulture,

@"[ \t]*('+|/{2,}) __LLBLGENPRO_USER_CODE_REGION_START

(?<regionName>\w+){0}(.*{0})*?[ \t]*('+|/{2,}) __LLBLGENPRO_USER_CODE_REGION_END", Environment.NewLine);

Have fun!

Oliver Sturm - Sunday, February 20, 2005 12:39:00 PM

Oliver: thanks a million for that fix and suggestion! I only tried it on a testcase with 1 region, indeed it needs less greedy matches, thanks for that!. The newline issue is not that important, but as it is easily changed I include that too. Thanks! :)

Ben: I'll still keep this as partial classes solve the problem of adding new methods/properties but doesn't solve the 'add code to existing method' problem. For example, an entity initialization routine which adds a custom validator object to the entity, a custom concurrency producer object to the entity, you can easily do that if there was a region for user code in the initialization method :) with partial classes I can't add that code.

Jeff: "Not surprisingly, you were serious! " haha :) I felt such a complete nerd after reading that ;)

Frans Bouma - Sunday, February 20, 2005 1:13:00 PM

Comments have been disabled for this content.