Want faster regular expressions? Maybe you should think about that IgnoreCase option...

Let me start with a disclaimer.  RegexOptions.IgnoreCase is a very powerful option that allows you to match ignore character casing during matches irregardless of the currently running culture.  In other words, this will allow you to take into account other languages if you happen to be running on another system and will match something like the character 'a' to the character 'a' with an diacritical or accent mark depending on the culture.

Now, I'm a US expression user.  Case insensitive to me means either a or A, nothing more and nothing less.  And if I throw in an IgnoreCase that is all that something like a matches.  It doesn't match any of the other accented characters, at least not any of the characters present in the extended ASCII character set.  So what are the various ways I can do case insensitive matches?

    private static Regex ignoreCase = new Regex("a", RegexOptions.IgnoreCase | RegexOptions.Compiled);
    private static Regex explicitCase = new Regex("a|A", RegexOptions.Compiled);
    private static Regex charClass = new Regex("[aA]", RegexOptions.Compiled);

Note, these are the Regular Expression ways to do it.  I can go even further by specifying a regular expression of say “a” without RegexOptions.IgnoreCase, and then make ToLower calls on all my strings.  I find this kind of bogus since it imposes more work on me as the programmer.  I'll be testing the above three cases against each other to see which one more quickly matches or doesn't match against a group of characters, all of which are “a”, but only two of which will actually get matched under a system with US English settings.

What I found is that IgnoreCase is pretty slow.  Something along the lines of 6.5 seconds in my testing.  Why?  Well, because behind the scenes are some method calls to various culture comparers that are doing the work of making sure the comparison is case insensitive.  I'd rather alert the regular expression of this before hand.  So I'll also be using an alternation group and a character class.  These two perform about equally.  Here are the results when running the above regular expressions against an array of different a's, only two of which should match based on the US English culture settings.

C:\Projects\CSharp\Utilities\PoliCheck>SingleCharIgnoreCase.exe
Ignore Case Sanity Check
True
True
False
False
False
False
False
End Ignore Case Sanity Check

Explicit Case Sanity Check
True
True
False
False
False
False
False
End Explicit Case Sanity Check

Character Class Sanity Check
True
True
False
False
False
False
False
End Character Class Sanity Check

Ignore Case, 00:00:06.8598640
Explicit Case, 00:00:05.9886112
Character Class, 00:00:05.8484096

Character classes only edge ahead by a little bit.  I'm not sure why, but it is probably that the character class JITs maybe a switch statement while the alternation group maybe spits an if statement?  Not sure but I'm betting I'll be digging through the rotor source to find out now ;-)

So what happens when you expand an entire word?  Well the similar performance is maintained.  Here are the three new expressions:

    private static Regex ignoreCase = new Regex("hello", RegexOptions.IgnoreCase | RegexOptions.Compiled);
    private static Regex explicitCase = new Regex("(?:h|H)(?:e|E)(?:l|L)(?:l|L)(?:o|O)", RegexOptions.Compiled);
    private static Regex charClass = new Regex("[hH][eE][lL][lL][oO]", RegexOptions.Compiled);

IgnoreCase is by far the simplest regular expression.  The others start to get complicated.  I'm assuming the most optimal alternation group match I can make is what I've shown.  However, there might just be something a little simpler than that, that maybe performs a bit better.  The character class is simple enough, being much easier to read than the alternation group pattern, and you can simplify multiple character runs by specifying match counts [lL]{2} for instance.  The alternation group starts to perform quite close to the character class in this setup.  Here are the results.

C:\Projects\CSharp\Utilities\PoliCheck>WordIgnoreCase.exe
Ignore Case Sanity Check
True
End Ignore Case Sanity Check

Explicit Case Sanity Check
True
End Explicit Case Sanity Check

Character Class Sanity Check
True
End Character Class Sanity Check

Ignore Case, 00:00:01.6423616
Explicit Case, 00:00:01.1316272
Character Class, 00:00:01.1115984

Why the hell would I even bother with this type of performance testing?  Well, the end result is a naughty word matching utility.  Basically, given an input, find the naughty words and flag them somehow.  As the naughty word array gets larger and larger it takes more and more time to match using IgnoreCase so I need a faster option.  Right now it looks like tokenizing the bad words and expanding them into an alternation group or character class is going to be the easiest option.  The next step is going to be tuning the best way to match something like 5000 words against the input text.

Some background here is that I've already discounted putting all of the words into a single expression, because at some point the engine actually pukes with too many characters in the string.  So you have to limit the size of your pattern.  That's fine, so I tuned it down and found 200 words would almost always fit.  I left it at that for a long time and never really thought about whether or not I could get more performance out of the system.  It turns out setting the maximum number of alternations (words) to 15 yields the best results.  Now I have to stick the knowledge from this article with the word count maximum and find out the fastest way to do this check.  It should be fun, and I'll be blogging about it soon, so stick around.

Published Saturday, March 20, 2004 9:36 PM by Justin Rogers

Comments

Sunday, March 21, 2004 4:46 AM by Darren Neimke

# re: Want faster regular expressions? Maybe you should think about that IgnoreCase option...

Justin, just for laughs can you show the metrics for a test where you *do* the ToLower() on the input string. In other words:

private static Regex ignoreCase = new Regex("hello".ToLower(new CultureInfo("en-US", false)), RegexOptions.Compiled);

One of the things that interested me while peeking at Rotor source is that, one of the overloads of AddConcatenate takes a third boolean parameter which stops the ToLower() call from happening when doing a replacement - even if using IgnoreCase. I couldn't really work that one out.
Sunday, March 21, 2004 5:39 AM by Darren Neimke

# re: Want faster regular expressions? Maybe you should think about that IgnoreCase option...

BTW... good luck digging through the source to see what IL is produced; have you ever seen the stuff which is Emit'ted from the RegexCompiler module ? :-)
Sunday, March 21, 2004 5:59 AM by Justin Rogers

# re: Want faster regular expressions? Maybe you should think about that IgnoreCase option...

I know why the one is faster versus the other now... I'm pretty adept at going through Rotor ;-)

A character class uses a specialized binary search algorithm across a specialized string of characters that exists as a result of RegexCharClass. This becomes a single call within the compiled binary into a helper function.

With an alternation construct, a butt-load of instructions are emitted that trace each path in the alternation and backtrack if necessary to start on the next item. These are called groups and use something called the grouping stack.

At least this is what I'm gathering right now. I'll probably keep doing work on the source, however, running simulations seems to be turning out better numerical results, even if those results aren't being firmly rooted.
Sunday, March 21, 2004 6:31 AM by Justin Rogers

# re: Want faster regular expressions? Maybe you should think about that IgnoreCase option...

Here are the results of doing a ToLower before calling the expression:

C:\Projects\CSharp\Utilities\PoliCheck>WordIgnoreCase.exe
Ignore Case By ToLower Sanity Check
True
End Ignore Case By ToLower Sanity Check

Ignore Case Sanity Check
True
End Ignore Case Sanity Check

Explicit Case Sanity Check
True
End Explicit Case Sanity Check

Character Class Sanity Check
True
End Character Class Sanity Check

Ignore Case By ToLower, 00:00:02.8440896
Ignore Case, 00:00:01.7625344
Explicit Case, 00:00:01.2518000
Character Class, 00:00:01.1716848
Friday, December 5, 2008 8:19 PM by Semil

# re: Want faster regular expressions? Maybe you should think about that IgnoreCase option...

<a href= spiritez.com ></a>