XMLSerializer and invalid XML

A user had pasted some text from powerpoint into a textarea in one of our web apps which was eventually serialized into XML. Upon trying to de-serialize the XML, the web server threw the exception below:

System.InvalidOperationException: There is an error in XML document (1, 50). ---> System.Xml.XmlException: '♂', hexadecimal value 0x0B, is an invalid character. Line 1, position 50.

The XML that was about to be de-serialized looked like something this:

<?xml version="1.0" encoding="utf-16"?>
<string>Quick&#xB;Brown
Fox
</string>

The code used to de-serialze the xml is shown below:

public static T DeSerializeObject<T>(string xml)
{
    using (System.IO.StringReader sr = new System.IO.StringReader(xml))
    {
        XmlSerializer serializer = new XmlSerializer(typeof(T));
        return (T)serializer.Deserialize(sr);
    }
}


As seen from the exception above, the de-serializer was complaining about the invalid character - &#xB;.  

The PowerPoint slide that text was being pasted from looked something like this:

ppt
The user had used the Shift-Enter key combination to force a line break between text in a bullet. So “Brown” is started on a new line for the same bullet as “Quick”.

By selecting all the text on the slide and pasting it into a hex editor, we see the standard hex values for carriage return and line feed – 0x0D 0x0A between “Brown” and “Fox”. But we see 0x0B was generated for the new line between “Quick” and “Brown”. 0x0B in ASCII stands for a vertical line tab which is an invalid character in XML. 
ppthex

The code used to serialize the string is shown below:

private static string SerializeObject<T>(T source)
{
    var serializer = new XmlSerializer(typeof(T));
 
    using (var sw = new System.IO.StringWriter())
    using (var writer = new XmlTextWriter(sw))
    {
        serializer.Serialize(writer, source);
        return sw.ToString();
    }
}

The problem with this code is that the XmlTextWriter class which inherits from XmlWriter, does not, on its own, validate each character before serializing it.

The recommended way to serialize is to use the static XmlWriter.Create method. The method has overloads where you specify an XmlWriterSettings class. If you do not specify one, the default values for the XmlWriterSettings class is used. One of the properties of this class is the XmlWriterSettings.CheckCharacters Property which is set to true by default. This property ensures that the XmlWriter instance created by the .Create method will perform character checking. By implementing the .Create* method in our code, we ensure that the serializer will throw an exception if it encounters invalid XML.

We, therefore, can rewrite our serializer class like so:

private static string SerializeObject<T>(T source)
{
    var serializer = new XmlSerializer(typeof(T));
    using (var sw = new System.IO.StringWriter())
    using (var writer = XmlWriter.Create(sw))
    {
        serializer.Serialize(writer, source);
        return sw.ToString();
    }
}

* The writer created in this case is of type System.Xml.XmlWellFormedWriter. It will vary based on the .Create overload used.

4 Comments

  • Alternatively, you could use an XmlReader with CheckCharacters set to false to deserialize:

    private static T DeSerializeObject(string xml)
    {
    var settings = new XmlReaderSettings { CheckCharacters = false };

    using (var sr = new System.IO.StringReader(xml))
    using (var reader = XmlReader.Create(sr, settings))
    {
    var serializer = new XmlSerializer(typeof(T));
    return (T)serializer.Deserialize(reader);
    }
    }


    http://msdn.microsoft.com/en-us/library/aa302290.aspx

  • Be careful when you accept and process XML submitted by an user. It is possible to create a recursive entity reference in the DTD, aka the billion laughs attack, to consume all the memory on the server. For example, the Opera browser does not check for malicious DTD in SVG image files.

  • Robert,
    We are not accepting xml submitted by the user. We are serializing *text* entered by the user into xml.

  • Richard,

    Nice tip!

    Unfortunately in our case, we can't use that (For simplicity sake, I did not give the entire picture).

    The idea is to prevent bad xml from being generated in the first place. This is because it is also being saved in a sql xml column. SQL will throw an exception if it encounters bad xml.

Comments have been disabled for this content.