XMLSerializer and invalid XML
A user had pasted some text from powerpoint into a textarea in one of our web apps which was eventually serialized into XML. Upon trying to de-serialize the XML, the web server threw the exception below:
System.InvalidOperationException: There is an error in XML document (1, 50). ---> System.Xml.XmlException: '♂', hexadecimal value 0x0B, is an invalid character. Line 1, position 50.
The XML that was about to be de-serialized looked like something this:
<?xml version="1.0" encoding="utf-16"?>
<string>QuickBrown
Fox
</string>
The code used to de-serialze the xml is shown below:
public static T DeSerializeObject<T>(string xml)
{
using (System.IO.StringReader sr = new System.IO.StringReader(xml))
{
XmlSerializer serializer = new XmlSerializer(typeof(T));
return (T)serializer.Deserialize(sr);
}
}
As seen from the exception above, the de-serializer
was complaining about the
invalid character
- .
The PowerPoint slide that text was being pasted from
looked something like this:
The user had used the Shift-Enter key combination to
force a line break between text in a bullet. So “Brown” is
started on a new line for the same bullet as “Quick”.
By selecting all the text on the slide and pasting it
into a hex editor, we see the standard hex values for
carriage return and line feed – 0x0D 0x0A between “Brown”
and “Fox”. But we see 0x0B was generated for the new line
between “Quick” and “Brown”. 0x0B in ASCII stands for a
vertical line tab
which is an
invalid character
in XML.
The code used to serialize the string is shown below:
private static string SerializeObject<T>(T source)
{
var serializer = new XmlSerializer(typeof(T));
using (var sw = new System.IO.StringWriter())
using (var writer = new XmlTextWriter(sw))
{
serializer.Serialize(writer, source);
return sw.ToString();
}
}
The problem with this code is that the XmlTextWriter class
which inherits from XmlWriter, does not, on its own,
validate each character before serializing it.
The recommended way to serialize is to use the static
XmlWriter.Create
method. The method has overloads where you specify an
XmlWriterSettings
class. If you do not specify one, the default values for the
XmlWriterSettings
class is used. One of the properties of this class is the
XmlWriterSettings.CheckCharacters
Property which is set to true by default.
This property ensures that the XmlWriter instance created by
the .Create method will perform character checking. By
implementing the .Create* method in our code, we ensure that
the serializer will throw an exception if it encounters
invalid XML.
We, therefore, can rewrite our serializer class like so:
private static string SerializeObject<T>(T source)
{
var serializer = new XmlSerializer(typeof(T));
using (var sw = new System.IO.StringWriter())
using (var writer = XmlWriter.Create(sw))
{
serializer.Serialize(writer, source);
return sw.ToString();
}
}
* The writer created in this case is of type System.Xml.XmlWellFormedWriter. It will vary based on the .Create overload used.