Making the Regex engine choke... Ever wanted to? I'll show you how...

Okay, so back in the pre-V1 days I made a humongous regular expression alternation group and wanted to precompile that guy for maximum speed.  What exactly is a humongous regular expression?  Well imagine 12000 or so words, each on average of 10 characters.  I built these into the alternation group using code similar to the following, where text is an array of strings with 12000 elements.

Regex regex = new Regex("(?:(" + string.Join("|", text) + "))+", RegexOptions.Compiled | RegexOptions.IgnoreCase);

It doesn't take anything nearly this complex though to get the engine to barf.  My sample program uses 10000 strings of length 8 and precompiles the assembly.  If you are faint of heart, don't have a fast machine, or don't care to have your machine hung for 15 minutes, then don't run this program.  If you'd like to see a big exception getting thrown and the reason why I had to change my algorithms for finding bad words, then just run this stuff.

using System;
using System.Text.RegularExpressions;

public class TooLargeRegex {
    private static void Main(string[] args) {
        Random rand = new Random();
        string[] text = new string[10000];
        for(int i = 0; i < text.Length; i++) {
            text[i] =
                String.Concat(
                    ((char) rand.Next(97, 123)),
                    ((char) rand.Next(97, 123)),
                    ((char) rand.Next(97, 123)),
                    ((char) rand.Next(97, 123)),
                    ((char) rand.Next(97, 123)),
                    ((char) rand.Next(97, 123)),
                    ((char) rand.Next(97, 123)),
                    ((char) rand.Next(97, 123))
                );
        }
       
        Regex regex = new Regex("(?:(" + string.Join("|", text) + "))+", RegexOptions.Compiled | RegexOptions.IgnoreCase);
        Console.WriteLine(regex.IsMatch("foobarze"));
    }
}

Well, that'll explode on you with an InvalidProgramException.  This supposedly happens when invalid IL is detected.  Supposedly it means the compiler that generated the IL messed up.  Does this mean the .NET Regular Expression compiler is messed up?  Well, maybe.  We can take this one step further and compile the assembly out to disk and look at it using ILDasm.  Perhaps we can find out where the invalid stuff is coming from.  The following code should write out the compiled assembly so we can inspect the actual IL (note the assembly is about 3 megs):

RegexCompilationInfo rgi = new RegexCompilationInfo("(?:(" + string.Join("|", text) + "))+", RegexOptions.IgnoreCase, "myRegex", "fooBarBaz", true);
AssemblyName asmn = new AssemblyName();
asmn.Name = "fooBarBaz";
Regex.CompileToAssembly(new RegexCompilationInfo[] { rgi }, asmn);

Once you are done and you end up with fooBarBaz.dll, whatever you do, don't open it in ILDasm.  Or if you do, don't try to view the IL for the Go method on myRegexRunner0.  This will take an inordinate amount of time.  I guess 3 megabytes worth of op-codes is an exceptional amount of stuff to load.  However, the /out:fooBarBaz.il option will get you done in no time.  So do that instead, and use a text editor that doesn't blink when you throw a 60 megabyte text file at it.  So nothing appears to be wrong with the IL at all, so maybe that is a complete bust, but something is going wrong when you run this puppy and get your exception.  Perhaps there are too many instructions?  Time to round-trip this dern assembly using ilasm.  Don't do this unless you again don't have heart problems, because it will take quite some time.  I used my time playing some Champions of Norrath and eating some cheese sticks.

Okay, I'm posting, because ILAsm never actually finished compiling the code after two hours.  I'll check back again later and maybe post something more in comments.  I'm pretty sure either something is wrong in the CLR in terms of JIT'ing the extra large method or somehow the regular expression classes are emitting invalid IL as the InvalidProgramException would indicate.  Anyway, have fun!

Published Sunday, March 21, 2004 6:13 AM by Justin Rogers

Comments

Sunday, March 21, 2004 10:23 AM by Duncan Godwin

# re: Making the Regex engine choke... Ever wanted to? I'll show you how...

Hi,

I've been reading Advanced .NET recently and it mentions you can use peverify to tell how the IL is invalid, does this give anything of interest?

Sunday, March 21, 2004 6:07 PM by Justin Rogers

# re: Making the Regex engine choke... Ever wanted to? I'll show you how...

Using PEVerify results in what you would say is a valid assembly. The PEVerify tool must be a loose subset of all rules that can be implemented, and at least with the version I'm running there aren't any flags to specify how strict PEVerify should be since there are four levels of IL safety.

I'm not all that surprised that PEVerify doesn't find anything wrong. However, I am surprised to wake up after nearly 12 hours and find the ILAsm compilation still running.
Sunday, March 21, 2004 8:19 PM by TrackBack

# Monolithic patterns can cause exceptions

Sunday, March 21, 2004 8:21 PM by TrackBack

# Monolithic regex patterns can cause InvalidProgramException to occur

Sunday, March 21, 2004 10:43 PM by TrackBack

# The root of an InvalidProgramException and a possible JIT bug?

Tuesday, March 23, 2004 4:44 PM by Kit George [Microsoft]

# re: Making the Regex engine choke... Ever wanted to? I'll show you how...

Justin, this is interesting. We'll be taking a look at this to see what the problem is, since this does look problematic.

Thanks!
Kit
Tuesday, December 02, 2008 3:10 PM by Asina

# re: Making the Regex engine choke... Ever wanted to? I'll show you how...

<a href= bestpre.com ></a>

Friday, December 05, 2008 10:11 PM by Semil

# re: Making the Regex engine choke... Ever wanted to? I'll show you how...

<a href= spiritez.com ></a>

Friday, December 26, 2008 9:20 AM by Olgunka-ip

# re: Making the Regex engine choke... Ever wanted to? I'll show you how...

<a href= membres.lycos.fr/maffals >genetic disorters</a>

Friday, December 26, 2008 9:20 AM by elexx-kn

# re: Making the Regex engine choke... Ever wanted to? I'll show you how...

<a href= membres.lycos.fr/dertull >zx10r graphics</a>

Leave a Comment

(required) 
(required) 
(optional)
(required)