Cleaning invalid characters from SharePoint

Friday, April 25, 2008

I stumbled onto one of those "gotchas" you get with SharePoint. We were creating new document libraries based on user names in a domain. A change came in and we had to support multiple domains so a document library name would need a domain identifier (since you could have two of the same user names in two different domains). During acceptance testing we found that document libraries created with dashes in the names (as we were creating them using [domain]-[username] pattern) would strip the dash out (without telling you of course). This caused a bit of a headache with the email we send out with a link since the URL was invalid.

I remember this from a million years ago (as I'm replacing a few SharePoint brain cells with Ruby ones lately) so after a bit of Googling I found a great article by Eric Legault here on the matter.

Here's a small method with a unit test class to handle this cleansing of names.

public static string CleanInvalidCharacters(string name)

    string cleanName = name;

    // remove invalid characters

    cleanName = cleanName.Replace(@"#", string.Empty);

    cleanName = cleanName.Replace(@"%", string.Empty);

    cleanName = cleanName.Replace(@"&", string.Empty);

    cleanName = cleanName.Replace(@"*", string.Empty);

    cleanName = cleanName.Replace(@":", string.Empty);

    cleanName = cleanName.Replace(@"<", string.Empty);

    cleanName = cleanName.Replace(@">", string.Empty);

    cleanName = cleanName.Replace(@"?", string.Empty);

    cleanName = cleanName.Replace(@"\", string.Empty);

    cleanName = cleanName.Replace(@"/", string.Empty);

    cleanName = cleanName.Replace(@"{", string.Empty);

    cleanName = cleanName.Replace(@"}", string.Empty);

    cleanName = cleanName.Replace(@"|", string.Empty);

    cleanName = cleanName.Replace(@"~", string.Empty);

    cleanName = cleanName.Replace(@"+", string.Empty);

    cleanName = cleanName.Replace(@"-", string.Empty);

    cleanName = cleanName.Replace(@",", string.Empty);

    cleanName = cleanName.Replace(@"(", string.Empty);

    cleanName = cleanName.Replace(@")", string.Empty);

    // remove periods

    while (cleanName.Contains("."))

        cleanName = cleanName.Remove(cleanName.IndexOf("."), 1);

    // remove invalid start character

    if (cleanName.StartsWith("_"))

        cleanName = cleanName.Substring(1);

    // trim length

    if(cleanName.Length > 50)

        cleanName = cleanName.Substring(1, 50);

    // Remove leading and trailing spaces

    cleanName = cleanName.Trim();

    // Replace spaces with %20

    cleanName = cleanName.Replace(" ", "%20");

    return cleanName;

[TestFixture]

public class When_composing_a_document_library_name

    [Test]

    public void Spaces_should_be_converted_to_a_canonicalized_string()

        string invalidName = "Cookie Monster";

        Assert.AreEqual("Cookie%20Monster", SharePointHelper.CleanInvalidCharacters(invalidName));

    [Test]

    public void Remove_invalid_characters()

        string invalidName = @"#%&*:<>?\/{|}~+-,().";

        Assert.AreEqual(string.Empty, SharePointHelper.CleanInvalidCharacters(invalidName));

    [Test]

    public void Remove_invalid_underscore_start_character()

        string invalidName = "_CookieMonster";

        Assert.AreEqual("CookieMonster", SharePointHelper.CleanInvalidCharacters(invalidName));

    [Test]

    public void Remove_any_number_of_periods()

        string invalidName = ".Co..okie...Mon....st.er.";

        Assert.AreEqual("CookieMonster", SharePointHelper.CleanInvalidCharacters(invalidName));

    [Test]

    public void Names_cannot_be_longer_than_50_characters()

        string invalidName = "CookieMonster".PadRight(51, 'C');

        Assert.AreEqual(50, SharePointHelper.CleanInvalidCharacters(invalidName).Length);

    [Test]

    public void Leading_and_trailing_spaces_should_be_removed()

        string invalidName = " CookieMonster ";

        Assert.AreEqual("CookieMonster", SharePointHelper.CleanInvalidCharacters(invalidName));

I'm not 100% happy with the method as that whole "remove invalid characters" block is repetetive and I know it's creating a new string object with each call. I started to look at how to do this in a regular expression, but frankly RegEx just frightens me. I cannot for the life of me figure out the gobbly-gook syntax and if I do need it, I'll Google for an example and then cry and curl up into a fetal position. I even tried firing up Roy's Regulazy but that didn't help me. I'm just stumbling in the dark on this. If some kind soul wants to convert this into a regular expression for me I'll buy you a beer or small marsupial for your effort.

BTW, this would make for a nice 3.5 string extension method (string.ToSharePointName), but alas I'm stuck in 2.0 land for this project.

Enjoy!

11 Comments

__CookieMonster fails :)

use: name.TrimStart('_');

I don't understand why the removing of periods is different then the other character replacements that are above it.

Surely, a regex to remove those 'unwanted' chars would be far better, even if you can't do the entire thing.

Regex.Replace(name, Regex.Escape(@"#%&*:?\/{|}~+-,()."), string.Empty);

With a little more elbow grease, the whole thing apart from the space replacement and the length trimming could be put into a regex.

Dave Transom - Friday, April 25, 2008 9:48:25 AM

This should work for you

([$#%&*:?\/{}|~+-,().-\])

Marc - Friday, April 25, 2008 9:48:26 AM

@Marc: Thanks, I knew there was an easier way.

Bil Simser - Friday, April 25, 2008 9:51:44 AM

@hexy: Yes, it could be and your pattern does replace all those calls to .Replace. However if you remove the while(cleanName.Contains(".")) which strips out all periods it fails. There probably needs a "match all '.'" regex thingy here. As for the underline, I see the double underscore fails, but the rule is to only check it at the beginning of the name and not throughout. Again, a regex here could probably work. It's just beyond my meager skills.

Bil Simser - Friday, April 25, 2008 10:12:28 AM

I'm also a fan of System.Web.HttpUtility.UrlEncode() and/or HtmlEncode() - use at the end to catch any odd characters you've missed (e.g. ñ)?

Anyway, neat.

Peter Seale - Friday, April 25, 2008 11:56:05 AM

There are couple of problems with your code.

1. You should do trim before length check, not after. This way, you can fit more characters. Try following input for your test.

string name = " testInput".PadRight(60, 'C');
string expected = "testInput".PadRight(50, 'C');
Assert.AreEqual(expected, Helper.RemoveInvalidCharacters(name));

By trimming first, you can fit in more non-space characters in your name.

2. The length limit should be enforced at the end. Because you are converting space to %20 after checking for length, your function may return a string with length more than 50.
Try following input for your test.

string invalidName = "Cookie Monster".PadRight(50, 'C');
Assert.AreEqual(50, Helper.RemoveInvalidCharacters(invalidName).Length);

3. Why the special treatment to remove the period (.)?

4. The following code removes a valid character from the index 0.
if(cleanName.Length > 50)
cleanName = cleanName.Substring(1, 50);

You should do
cleanName = cleanName.Substring(0, 50);

I have re-written the code to use Regex and covered the edge cases.
Hope that helps,
Jd

using System;
using System.Text.RegularExpressions;

namespace Misc
{
public class Helper
{
public static string RemoveInvalidCharacters(string name)
{
if(name == null)
throw new ArgumentNullException("name");

//We should trim the input before we do length check.
//This way, we could be able to fit in more characters in
//our output if there are spaces in the beginning.
name = name.Trim();

string[] invalidCharacters =
new string[]
{
"#", "%", "&", "*", ":", "", "?", "\\", "/", "{", "}", "~", "+", "-", ",", "(", ")", "|",
"."
};

Regex cleanUpRegex = GetCharacterRemovalRegex(invalidCharacters);

string cleanName = cleanUpRegex.Replace(name, string.Empty);

cleanName = cleanName.Replace(" ", "%20");

if (cleanName.StartsWith("_"))
cleanName = cleanName.Substring(1);

if (cleanName.Length > 50)
cleanName = cleanName.Substring(0, 50);

return cleanName;
}

private static Regex GetCharacterRemovalRegex(string[] invalidCharacters)
{
if(invalidCharacters == null)
throw new ArgumentNullException("invalidCharacters");

if(invalidCharacters.Length == 0)
throw new ArgumentException("invalidCharacters can not be empty.", "invalidCharacters");

string[] escapedCharacters = new string[invalidCharacters.Length];

int index = 0;
foreach (string input in invalidCharacters)
{
escapedCharacters[index] = Regex.Escape(input);
index++;
}

return new Regex(string.Join("|", escapedCharacters));
}
}
}

using MbUnit.Framework;
using Misc;

namespace TestMisc
{
[TestFixture]
public class HelperTest
{
[Test]
[RowTest]
[Row("_Monster", "Monster")]
[Row("...period...removal", "periodremoval")]
[Row(" test ", "test")]
[Row("Cookie Monster", "Cookie%20Monster")]
[Row("...", "")]
public void CharacterRemovalTest(string name, string expected)
{
Assert.AreEqual(expected, Helper.RemoveInvalidCharacters(name));
}

[Test]
public void SpecialCharacterRemovalTest()
{
string invalidName = @"#%&*:?\/{|}~+-,().";
Assert.AreEqual(string.Empty, Helper.RemoveInvalidCharacters(invalidName));
}

[Test]
public void NameLengthTest1()
{
string invalidName = "CookieMonster".PadRight(51, 'C');
Assert.AreEqual(50, Helper.RemoveInvalidCharacters(invalidName).Length);
}

[Test]
public void NameLengthTest2()
{
string invalidName = "Cookie Monster".PadRight(50, 'C');
Assert.AreEqual(50, Helper.RemoveInvalidCharacters(invalidName).Length);
}

[Test]
public void NameLengthTest3()
{
string name = " testInput".PadRight(60, 'C');
string expected = "testInput".PadRight(50, 'C');
Assert.AreEqual(expected, Helper.RemoveInvalidCharacters(name));
}
}
}

JD - Friday, April 25, 2008 12:03:13 PM

@Peter: The url encoding isn't the issue. For example create a document library in SharePoint using the API, Web Service, or from the UI called "my-library" and the created library will have a display name of "my-library" but a url addressable name of "mylibrary".

@JD: Thanks for the RegEx and edge cases. The 50 character limit before the "%20" replacement is done because of a SharePoint quirk, but it makes more sense to trim it afterwards in any case.

Bil Simser - Friday, April 25, 2008 1:01:22 PM

SPEncode.IsLegalCharInUrl(c)

red - Sunday, April 27, 2008 1:34:00 PM

@Red: Thanks and that could be useful, however it ties me to the SharePoint library and frankly, SPEncode and all of it's static methods are highly untestable IMHO. I'll look into this though.

Bil Simser - Monday, April 28, 2008 8:26:26 AM

The problem with trimming after ' '->%20 replacement is that you can end up with an invalid string.

Suppose the pre-replacement string was 48 Xs, a space, and an X. This would after replacement convert to a string where the 48th-52nd characters are "X%20X". Truncating this would end in "X%2", causing problems.

Darn those boundary conditions.

Jon - Wednesday, April 30, 2008 8:15:53 AM

Just wanted to drop in to say Great Post! and surprisingly good work by the other commentators : )

Robert - Wednesday, November 3, 2010 2:50:33 PM

Comments have been disabled for this content.