Splitting Camel Case with RegEx

Tuesday, September 27, 2005

Phil posted some code to Split Pascal/Camel Cased Strings a few days ago. We had an offline discussion on doing this via RegEx.

I like the RegEx approach since it's only one line of code:

output = System.Text.RegularExpressions.Regex.Replace(
    input,
    "([A-Z])",
    " $1",
    System.Text.RegularExpressions.RegexOptions.Compiled).Trim();

This matches all capital letters, replaces them with a space and the letter we found ($1), then trims the result to remove the initial space if there was a capital letter at the beginning.

So, which would you use?

Arguments for Phil's C# approach:

Easier for other programmers to read - not everyone knows RegEx
Faster (see comparison below)
All compiled code, so errors are more likely to be caught in development

Arguments for my RegEx approach:

Simpler (in my opinion)
RegEx is a string, so it can be put in a configuration file

So, let's compare performance. Now, this is mostly academic since this kind of function would likely be called less than 25 times, but still worth a look. Here are the sample string "SampleSplitText":

Approach	Repetitions	Time (seconds)
RegEx Replace	1000	.0312500
RegEx Replace	100000	.3125000
RegEx Replace	10000000	29.1562500
Code Approach	1000	0 (not measurable)
Code Approach	100000	.0156250
Code Approach	10000000	1.6562500
RegEx Delegate Replace	1000	0 (not measurable)
RegEx Delegate Replace	100000	.0937500
RegEx Delegate Replace	10000000	7.5000000

The only reason for calling this out is to show the exceptionally slow performance of the RegEx replace method for high iterations. For under a thousand iterations, I'd definitely go with the RegEx replace. For high repetitions, I'd consider using a RegEx replace with a MatchEvaluator delegate (see the code below). For my very simple test, it was just about as fast for anything under 100000 repetitions.

(updated - fixed a code error with delegate method)

using System;
using System.Collections;
using System.Collections.Specialized;
using System.Text.RegularExpressions;

public class SplitTest
{
    public static void Main()
    {
        string input;
        int iterations;

        for(;;)
        {
            Console.WriteLine("Enter CamelCase text to split (defaults to SampleSplitText):");
            input = Console.ReadLine();
            if(input==string.Empty)
                input="SampleSplitText";

            iterations = 0;
            Console.WriteLine("Enter number of operations ( enter 0 to quit):");
            try
            {
                iterations = int.Parse(Console.ReadLine());
            }
            catch
            {
                Console.WriteLine("Exiting");
                break;
            }

            if(iterations==0)
                break;

            System.DateTime start;
            start = System.DateTime.Now;
            Console.WriteLine(string.Format("Output from Inline RegEx approach: {0}", InlineRegExTest(input, iterations)));
            Console.WriteLine(string.Format("Inline RegEx approach took {0} seconds for {1} iterations.",System.DateTime.Now-start,iterations));

            start = System.DateTime.Now;
            Console.WriteLine(string.Format("Output from RegEx / MatchEvaluator approach: {0}", DelegateRegExTest(input, iterations)));
            Console.WriteLine(string.Format("RegEx / MatchEvaluator approach took {0} seconds for {1} iterations.",System.DateTime.Now-start,iterations));

            start = System.DateTime.Now;
            Console.WriteLine(string.Format("Output from Code approach: {0}", CodeTest(input, iterations)));
            Console.WriteLine(string.Format("Code approach took {0} seconds for {1} iterations.",System.DateTime.Now-start,iterations));
            Console.ReadLine();
        }
    }

    private static string InlineRegExTest(string input, int iterations)
    {
        string output = "Failed";

        for(int i=0;i<iterations;i++)
        {
            output = System.Text.RegularExpressions.Regex.Replace(input,"([A-Z])"," $1",System.Text.RegularExpressions.RegexOptions.Compiled).Trim();
        }
        return output;
    }

    private static string DelegateRegExTest(string input, int iterations)
    {
        System.Text.RegularExpressions.RegexOptions options = System.Text.RegularExpressions.RegexOptions.Compiled;
        Regex reg = new Regex("(?<Word>[A-Z])",options);
        string output = "Failed";

        for(int i=0;i<iterations;i++)
        {
            output = reg.Replace( input, new MatchEvaluator( FormatWord ) ) ;
        }
        return output;
    }

    private static string FormatWord(Match m)
    {
        if( m.Groups["Word"].Success )
        {
            string word = m.Groups["Word"].Value ;
            return " " + word;
        }
        else
            return m.Value ;
    }

    private static string CodeTest(string input, int iterations)
    {
        string output = "Failed";

        for(int i=0;i<iterations;i++)
        {
            output = SplitUpperCaseToString(input);
        }
        return output;
    }

        /// <summary>
    /// Parses a camel cased or pascal cased string and returns a new
    /// string with spaces between the words in the string.
    /// </summary>
    /// <example>
    /// The string "PascalCasing" will return an array with two
    /// elements, "Pascal" and "Casing".
    /// </example>
    /// <param name="source"></param>
    /// <returns></returns>
    public static string SplitUpperCaseToString(string source)
    {
        return string.Join(" ", SplitUpperCase(source));
    }

    /// <summary>
    /// Parses a camel cased or pascal cased string and returns an array
    /// of the words within the string.
    /// </summary>
    /// <example>
    /// The string "PascalCasing" will return an array with two
    /// elements, "Pascal" and "Casing".
    /// </example>
    /// <param name="source"></param>
    /// <returns></returns>
    public static string[] SplitUpperCase(string source)
    {
        if(source == null)
            return new string[] {}; //Return empty array.

        if(source.Length == 0)
            return new string[] {""};

        StringCollection words = new StringCollection();
        int wordStartIndex = 0;

        char[] letters = source.ToCharArray();
        // Skip the first letter. we don't care what case it is.
        for(int i = 1; i < letters.Length; i++)
        {
            if(char.IsUpper(letters[i]))
            {
                //Grab everything before the current index.
                words.Add(new String(letters, wordStartIndex, i - wordStartIndex));
                wordStartIndex = i;
            }
       }

        //We need to have the last word.
        words.Add(new String(letters, wordStartIndex, letters.Length - wordStartIndex));

         //Copy to a string array.
        string[] wordArray = new string[words.Count];
        words.CopyTo(wordArray, 0);
        return wordArray;
    }
}

Interesting. But, Delegate Replace doesn't work in your example. ;)

Chris Martin - Tuesday, September 27, 2005 10:53:00 AM

Oops! Good catch, Chris.

I updated the code with the fix. Should be matching on "(?<Word>[A-Z])" rather than just "([A-Z)" since I'm referring to the match by name.

Jon Galloway - Tuesday, September 27, 2005 12:01:00 PM

I'd definitely go for the RegEx. Simpler, but above all, much more elegant.

Wim Hollebrandse - Wednesday, September 28, 2005 8:26:00 AM

When would a function like this be needed? I can imagine problems with Irish/Scotish/Italian names such as McDonalds, MaCarthy, DeSando and also upper case trademarks/abbreviations like IBM, ASP, etc. Plus if it is a title then prepositions such as "the", "a", or "an" would remain lower case.

Haacked: Obviously it woun't seperate the numbers. This method won't even seperate words correctly.

This is a quick way to do this (for whatever reason), but it would be best to not lose those spaces between the words in the first place. Otherwise you need more advanced logic to split up the words then a 1 liner can handle.

Collin Yeadon - Friday, September 30, 2005 6:41:00 AM

Haacked -

Funny. My code isn't as good as the code Leon posted, and his was an example of lame code. Well, mine is faster and has published performance numbers.

Jon Galloway - Friday, September 30, 2005 10:06:00 AM

Can you do it the other way around? From "these words want camel" to "TheseWordsWantCamel"?

Silly - Thursday, January 4, 2007 3:58:45 AM

I'd recommend the following update to your regular expression:

"([A-Z][A-Z]*)"

which turns stuff like

SomeTypeOfID into:
'Some Type Of ID'
instead of
'Some Type Of I D'

Jake Heidt - Monday, April 2, 2007 6:59:02 AM

if you change the function "InlineRegExTest" same like the following, you will find it's faster than "DelegateRegExTest":
----------------------------------------------
private static string InlineRegExTest(string input, int iterations)
{
string output = "Failed";

Regex regex = new Regex("([A-Z])",System.Text.RegularExpressions.RegexOptions.Compiled);
for (int i = 0; i < iterations; i++)
{
output = regex.Replace(input, " $1");
}
return output;
}

---------------------------------------------

kingthy - Thursday, December 27, 2007 6:19:26 AM

Brilliant! Thanks!

Rich - Monday, April 14, 2008 11:21:01 PM

This code is great I have also used such type of code like
Replace(String, String, MatchEvaluator, RegexOptions)
But replacement works sometime and sometimes not in production environment but the same code works on local and dev environment

help needed to solve this

-Dhanaji

dhanaji - Friday, October 10, 2008 12:52:30 PM

how about ([A-Z][^A-Z]) to deal with any acronyms.

Chris - Monday, February 2, 2009 10:16:26 AM

str.replace(/([A-Z])/g," $1").replace(str.substring(0,1),str.substring(0,1).toUpperCase())

this works best for me for spliting

eg.
helloIamHere
gives:-Hello Iam Here

Nelson - Tuesday, September 1, 2009 1:01:10 AM

To anyone who finds it useful or helpful as I did, James' link for the Pascal Camel Case method works perfectly in .Net:

Regex:
(?<!^)([A-Z][a-z]|(?<=[a-z])[A-Z])

Mike Davis - Wednesday, February 15, 2012 9:23:59 AM

Pretty nice!

"Easier for other programmers to read - not everyone knows RegEx"

If they're a C# programmer, shouldn't they be able to learn a (very common) class in the .NET standard library? If not, how far do you take this? Not everyone knows System.Collections, so should I write my own collections classes?

"All compiled code, so errors are more likely to be caught in development"

The C# compiler will only stop on syntax errors and type errors, not logic bugs (like off-by-one errors, which are common with arrays). The non-regex one is 10 times longer so it probably has 10 times more bugs. How confident are you that the (additional) bugs in the non-regex version would *all* be syntax errors and type errors? That is a bet I would not want to take!

You're right that your regex approach is simpler. But simpler *means* it's easier to read, and errors are more likely to be caught in development.

It's true the non-regex version is faster, but I have a hard time imagining a case where you need to do camelCase splitting in under 2.9 microseconds. That's well into "RAM is too slow, this needs to run from L2 cache" territory, so I hope your program isn't doing any I/O. :-)

Pat - Wednesday, February 22, 2012 7:58:20 AM

14 Comments