.NET Regex bug causes application to hang!!!

Note: this entry has moved.

Today I reported a bug on the new Microsoft Connect site (replacement for Product Feedback), which I had in the pipeling for quite a while.

Turns out that if you have a fairly complex regex (like the ones typically used for parsing small custom languages), you can easily kill your application, because the regex engine will hang completely evaluating even fairly small strings. Here's my repro regex:

    static readonly Regex ReferenceExpression = new Regex(@"

            # Matches invalid empty brackets #

            (?<empty>\$\(\)) |

            # Matches a valid argument reference with potencial method calls and indexer accesses #

            (?<reference>\$\(([^\(]+([\(\[][^\)\]]*[\)\]])?)+\)) |

            # Matches opened brackes that are not properly closed #

            (?<opened>\$\([^\)\$\(]*(?!\)))",

        RegexOptions.Compiled | RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);

and this is the string I'm parsing (part of the Mobile Client Software Factory guidance package, which uses these kind of pseudo-MSBuild syntax):

    static void Main(string[] args)

    {

        string hangString = @"DisconnectedAgents\$(CurrentItem.Name)\$(ProxyType.Name)AgentCallback.cs\$(ProxyType.Name)AgentCallbackBase.cs";

 

        Console.WriteLine(ReferenceExpression.IsMatch(hangString));

    }

If a site allows evaluations of arbitrary regex patterns using the .NET engine, they should be careful as this can easily bring the site down.

Please vote the bug if you also think it's critical.

10 Comments

  • Yes, that's frustrating in 2.0...

  • I believe it was there in v1.x too, but I'm not sure.

  • It is known that the execution time of some kinds of regex goes up exponentially with the length of the string being tested.

    I can write an infinite loop in C#, but that doesn't mean C# is broken. Regex engines aren't guaranteed to process "any" input.

    Get a copy of "Mastering Regular expressions" by Jeffrey E. F. Friedl, and read chapter 6.

    It's an interesting read, so don't just read chapter 6, but it's heavy stuff, and far too long since I read it, so I'm afraid I'm not even going to attempt analysing your regex to see if it falls into this category.

  • Ok, I'll add that book to my pending list.
    I still find it hard to believe that evaluating against a 50 characters string is instantaneous, and taking it to 100 chars freezes the engine for *minutes*... (I actually didn't wait more and assumed it was never going to come back).

  • To be fair though Dan, this is certainly not a bug. This is very much the expected behavior for a state machine like this.

  • Well, that's unfortunate. It means that I have to write a full scanner/parser stack just to parse even trivial mini-languages, instead of leveraging regexes :((((

  • No, you just need to protect yourself against inefficient patterns by running them in a sandbox as I indicated.

    Remember, it's not a buggy pattern, it's just an inefficient pattern.

  • You might not need a sandbox. The problematic regexes are IIRC those which cause the engine to backtrack exponentially - the symptoms are exactly that a 50 char string would work while one of 100 wouldn't. The canonical example is where you have alternation at the beginning, followed by a match of arbitrary length. For each additional character it picks up, it has to go back to the beginning again, which rapidly gets painful for longer strings.

    The point is that very often you can write a different regex to match the same thing, but without this problem. Other alternatives include two-step processing. e.g could you remove your illegal empty brackets first.

  • Yup, that might work... instead of alternating on the invalid syntaxes, I could just have a separate regex for that, and check it at the beginning....


    Thanks!!

  • Regex doesn't need long expressions to fail, this one also hangs the application:

    new Regex("()", RegexOptions.IgnoreCase);

Comments have been disabled for this content.