Splitting Camel Case with RegEx - Jon Galloway

Splitting Camel Case with RegEx

Phil posted some code to Split Pascal/Camel Cased Strings a few days ago. We had an offline discussion on doing this via RegEx.

I like the RegEx approach since it's only one line of code:

output = System.Text.RegularExpressions.Regex.Replace(
    input,
    "([A-Z])",
    " $1",
    System.Text.RegularExpressions.RegexOptions.Compiled).Trim();

This matches all capital letters, replaces them with a space and the letter we found ($1), then trims the result to remove the initial space if there was a capital letter at the beginning.

So, which would you use?

Arguments for Phil's C# approach:

  1. Easier for other programmers to read - not everyone knows RegEx
  2. Faster (see comparison below)
  3. All compiled code, so errors are more likely to be caught in development

Arguments for my RegEx approach:

  1. Simpler (in my opinion)
  2. RegEx is a string, so it can be put in a configuration file

So, let's compare performance. Now, this is mostly academic since this kind of function would likely be called less than 25 times, but still worth a look. Here are the sample string "SampleSplitText":

Approach Repetitions Time (seconds)
RegEx Replace 1000 .0312500
RegEx Replace 100000 .3125000
RegEx Replace 10000000 29.1562500
Code Approach 1000 0 (not measurable)
Code Approach 100000 .0156250
Code Approach 10000000 1.6562500
RegEx Delegate Replace 1000 0 (not measurable)
RegEx Delegate Replace 100000 .0937500
RegEx Delegate Replace 10000000 7.5000000

The only reason for calling this out is to show the exceptionally slow performance of the RegEx replace method for high iterations. For under a thousand iterations, I'd definitely go with the RegEx replace. For high repetitions, I'd consider using a RegEx replace with a MatchEvaluator delegate (see the code below). For my very simple test, it was just about as fast for anything under 100000 repetitions.

(updated - fixed a code error with delegate method)

using System;
using System.Collections;
using System.Collections.Specialized;
using System.Text.RegularExpressions;

public class SplitTest
{
    
public static void Main()
    {
        
string input;
        
int iterations;
        
        
for(;;)
        {
            Console.WriteLine("Enter CamelCase text to split (defaults to SampleSplitText):");
            input = Console.ReadLine();
            
if(input==string.Empty)
                input="SampleSplitText";
        
            iterations = 0;
            Console.WriteLine("Enter number of operations ( enter 0 to quit):");
            
try
            
{
                iterations = 
int.Parse(Console.ReadLine());
            }
            
catch
            
{
                Console.WriteLine("Exiting");
                
break;
            }
            
            
if(iterations==0)
                
break;
    
            System.DateTime start;
            start = System.DateTime.Now;
            Console.WriteLine(
string.Format("Output from Inline RegEx approach: {0}", InlineRegExTest(input, iterations)));
            Console.WriteLine(
string.Format("Inline RegEx approach took {0} seconds for {1} iterations.",System.DateTime.Now-start,iterations));        

            start = System.DateTime.Now;
            Console.WriteLine(
string.Format("Output from RegEx / MatchEvaluator approach: {0}", DelegateRegExTest(input, iterations)));
            Console.WriteLine(
string.Format("RegEx / MatchEvaluator approach took {0} seconds for {1} iterations.",System.DateTime.Now-start,iterations));        

            start = System.DateTime.Now;
            Console.WriteLine(
string.Format("Output from Code approach: {0}", CodeTest(input, iterations)));
            Console.WriteLine(
string.Format("Code approach took {0} seconds for {1} iterations.",System.DateTime.Now-start,iterations));        
            Console.ReadLine();
        }
    }

    
private static string InlineRegExTest(string input, int iterations)
    {
        
string output = "Failed";
        
        
for(int i=0;i<iterations;i++)
        {
            output = System.Text.RegularExpressions.Regex.Replace(input,"([A-Z])"," $1",System.Text.RegularExpressions.RegexOptions.Compiled).Trim();
        }
        
return output;
    }

    
private static string DelegateRegExTest(string input, int iterations)
    {
        System.Text.RegularExpressions.RegexOptions options = System.Text.RegularExpressions.RegexOptions.Compiled;
        Regex reg = 
new Regex("(?<Word>[A-Z])",options);
        
string output = "Failed";
        
        
for(int i=0;i<iterations;i++)
        {
            output = reg.Replace( input, 
new MatchEvaluator( FormatWord ) ) ;
        }
        
return output;
    }

    
private static string FormatWord(Match m)
    {
        
if( m.Groups["Word"].Success )
        {
            
string word = m.Groups["Word"].Value ;
            
return " " + word;
        }
        
else
            return 
m.Value ;
    }

    
private static string CodeTest(string input, int iterations)
    {
        
string output = "Failed";

        
for(int i=0;i<iterations;i++)
        {
            output = SplitUpperCaseToString(input);
        }
        
return output;
    }
    
        
/// <summary>
    /// 
Parses a camel cased or pascal cased string and returns a new
    
/// string with spaces between the words in the string.
    
/// </summary>
    /// <example>
    /// 
The string "PascalCasing" will return an array with two
    
/// elements, "Pascal" and "Casing".
    
/// </example>
    /// <param name="source"></param>
    /// <returns></returns>
    
public static string SplitUpperCaseToString(string source)
    {
        
return string.Join(" ", SplitUpperCase(source));
    }
    
    
/// <summary>
    /// 
Parses a camel cased or pascal cased string and returns an array
    
/// of the words within the string.
    
/// </summary>
    /// <example>
    /// 
The string "PascalCasing" will return an array with two
    
/// elements, "Pascal" and "Casing".
    
/// </example>
    /// <param name="source"></param>
    /// <returns></returns>
    
public static string[] SplitUpperCase(string source)
    {
        
if(source == null)
            
return new string[] {}; //Return empty array.
    
        
if(source.Length == 0)
            
return new string[] {""};
    
        StringCollection words = 
new StringCollection();
        
int wordStartIndex = 0;
    
        
char[] letters = source.ToCharArray();
        
// Skip the first letter. we don't care what case it is.
        
for(int i = 1; i < letters.Length; i++)
        {
            
if(char.IsUpper(letters[i]))
            {
                
//Grab everything before the current index.
                
words.Add(new String(letters, wordStartIndex, i - wordStartIndex));
                wordStartIndex = i;
            }
       }
    
        
//We need to have the last word.
        
words.Add(new String(letters, wordStartIndex, letters.Length - wordStartIndex));
    
         
//Copy to a string array.
        
string[] wordArray = new string[words.Count];
        words.CopyTo(wordArray, 0);
        
return wordArray;
    }
}
Published Tuesday, September 27, 2005 4:36 PM by Jon Galloway
Filed under:

Comments

# re: Splitting Camel Case with RegEx

I'd pick your one-liner over mine any day. A lot cheaper to write.

Tuesday, September 27, 2005 1:47 PM by Haacked

# re: Splitting Camel Case with RegEx

Interesting. But, Delegate Replace doesn't work in your example. ;)

Tuesday, September 27, 2005 1:53 PM by Chris Martin

# re: Splitting Camel Case with RegEx

Oops! Good catch, Chris.

I updated the code with the fix. Should be matching on "(?<Word>[A-Z])" rather than just "([A-Z)" since I'm referring to the match by name.

Tuesday, September 27, 2005 3:01 PM by Jon Galloway

# re: Splitting Camel Case with RegEx

I'd definitely go for the RegEx. Simpler, but above all, much more elegant.

Wednesday, September 28, 2005 11:26 AM by Wim Hollebrandse

# re: Splitting Camel Case with RegEx

When would a function like this be needed? I can imagine problems with Irish/Scotish/Italian names such as McDonalds, MaCarthy, DeSando and also upper case trademarks/abbreviations like IBM, ASP, etc. Plus if it is a title then prepositions such as "the", "a", or "an" would remain lower case.

Haacked: Obviously it woun't seperate the numbers. This method won't even seperate words correctly.

This is a quick way to do this (for whatever reason), but it would be best to not lose those spaces between the words in the first place. Otherwise you need more advanced logic to split up the words then a 1 liner can handle.

Friday, September 30, 2005 9:41 AM by Collin Yeadon

# re: Splitting Camel Case with RegEx

Haacked -
Funny. My code isn't as good as the code Leon posted, and his was an example of lame code. Well, mine is faster and has published performance numbers.

Friday, September 30, 2005 1:06 PM by Jon Galloway

# re: Splitting Camel Case with RegEx

Can you do it the other way around? From "these words want camel" to "TheseWordsWantCamel"?

Thursday, January 4, 2007 6:58 AM by Silly

# re: Splitting Camel Case with RegEx

I'd recommend the following update to your regular expression:

"([A-Z][A-Z]*)"

which turns stuff like

SomeTypeOfID into:

'Some Type Of ID'

instead of

'Some Type Of I D'

Monday, April 2, 2007 9:59 AM by Jake Heidt

# re: Splitting Camel Case with RegEx

if you change the function "InlineRegExTest" same like the following, you will find it's faster than "DelegateRegExTest":

----------------------------------------------

       private static string InlineRegExTest(string input, int iterations)

       {

           string output = "Failed";

           Regex regex = new Regex("([A-Z])",System.Text.RegularExpressions.RegexOptions.Compiled);

           for (int i = 0; i < iterations; i++)

           {

               output = regex.Replace(input, " $1");

           }

           return output;

       }

---------------------------------------------

Thursday, December 27, 2007 9:19 AM by kingthy

# re: Splitting Camel Case with RegEx

Brilliant! Thanks!

Tuesday, April 15, 2008 2:21 AM by Rich

# re: Splitting Camel Case with RegEx

This code is great I have also used such type of code like

Replace(String, String, MatchEvaluator, RegexOptions)

But replacement works sometime and sometimes not in production environment but the same code works on local and dev environment

help needed to solve this

-Dhanaji

Friday, October 10, 2008 3:52 PM by dhanaji

# re: Splitting Camel Case with RegEx

how about ([A-Z][^A-Z]) to deal with any acronyms.

Monday, February 2, 2009 1:16 PM by Chris

# re: Splitting Camel Case with RegEx

str.replace(/([A-Z])/g," $1").replace(str.substring(0,1),str.substring(0,1).toUpperCase())

this works best for me for spliting

eg.

helloIamHere

gives:-Hello Iam Here

Tuesday, September 1, 2009 4:01 AM by Nelson

# re: Splitting Camel Case with RegEx

To anyone who finds it useful or helpful as I did, James' link for the Pascal Camel Case method works perfectly in .Net:

Regex:

(?<!^)([A-Z][a-z]|(?<=[a-z])[A-Z])

Wednesday, February 15, 2012 12:23 PM by Mike Davis

# re: Splitting Camel Case with RegEx

Pretty nice!

"Easier for other programmers to read - not everyone knows RegEx"

If they're a C# programmer, shouldn't they be able to learn a (very common) class in the .NET standard library?  If not, how far do you take this?  Not everyone knows System.Collections, so should I write my own collections classes?

"All compiled code, so errors are more likely to be caught in development"

The C# compiler will only stop on syntax errors and type errors, not logic bugs (like off-by-one errors, which are common with arrays).  The non-regex one is 10 times longer so it probably has 10 times more bugs.  How confident are you that the (additional) bugs in the non-regex version would *all* be syntax errors and type errors?  That is a bet I would not want to take!

You're right that your regex approach is simpler.  But simpler *means* it's easier to read, and errors are more likely to be caught in development.

It's true the non-regex version is faster, but I have a hard time imagining a case where you need to do camelCase splitting in under 2.9 microseconds.  That's well into "RAM is too slow, this needs to run from L2 cache" territory, so I hope your program isn't doing any I/O.  :-)

Wednesday, February 22, 2012 10:58 AM by Pat