Regex: MatchCollection uses delay matching for performance

There are several ways to get all of the matches of an expression over the source input, but the most popular probably has to be the Matches method which returns a MatchCollection. What most people don't know, and I didn't until recently, is that the actual call to Matches doesn't do any work. It returns a shell that only does work when you actually request the results.

Count
This property is responsible for why most people probably can't tell the difference. Once you get a MatchCollection the first thing you do is check the Count property. All this does is force the MatchCollection to run the expression and fill the internal structures. When you do this, all of the delay loading potential is removed so the underlying collection knows how many matches are present.

This definitely posed some problems for the Asynchronous Regular Expressions using the ThreadPool and a cancellation model. We rewrote the code there to just return a MatchCollection but for some reason it was completing extremely fast all the time and we couldn't figure out why in the hell that was happening. Well, we went ahead and made a call to Count within the runner thread and now everything works as it should.

this[int i]
The default indexed property for getting access to Match classes is delay loaded, assuming you don't call Count first. You can pass any index you want to this bad boy, and it'll load all of the matches in between (if they aren't already loaded), and then service the match index you just passed in. Here is where I consider the bad code to start in. If the match doesn't exist, they internally return a null, so if you ask for match 20 and there are only 8, null will get returned internally but they'll throw a ArgumentOutOfRangeException externally.

// Normally this would be possible
Match m = null;
for(int i = 0; (m = matches[i]) != null; i++) {
}

// You could optimize for a specific count
Match m = null;
for(int i = 0; i < 5 && (m = matches[i]) != null; i++) {
}

Unfortunately, the above two constructs aren't possible because of the exception, however, they would be the ideal way to process matches without loading all of the work into a single call.

Enumerator
They do have an enumerator that I guess you really have to use if you want true delay loading syntax without that lame exception. This works out well because it balances the matching work between working with the match and we are no longer front-loaded. This allows us to make a number of syntaxes available, including processing pattern matches for a specific match, or processing only a specific number of matches.

// Process until we find the match we want
foreach(Match m in matches) {
    if ( m.Groups[4].Value == "choice X" ) { /* Do Work */ break; }
}

// Process the first n matches
int count = 0;
foreach(Match m in matches) {
    // Do work
    if ( ++count == 5 ) { return; }
}

We don't stop there. You can also enable RightToLeft matching in an attempt to get the last n matches instead of the first n. You start to see where you get some interesting options for processing a string for multiple matches while still have great control over your performance. Considering various patterns can hang the regular expression engine knowing how and where to use multi-threaded abortable operations can really save you from hanging your application.

Published Tuesday, September 07, 2004 2:27 AM by Justin Rogers

Comments

Monday, August 13, 2007 9:50 AM by contactkx

# re: Regex: MatchCollection uses delay matching for performance

Dim matches As MatchCollection

    =Regex.Matches(html)

i= matches.Count // This line takes a very long time to execute and Consuming 100% of CPU Usage time

Can you please guide me for the same

tkumari@adccindia.com

Leave a Comment

(required) 
(required) 
(optional)
(required)