XmlTextWriter + StringWriter = Headache

Thursday, July 31, 2003

I've come to the conclusion that .NET doesn't really make coding easier (yet), because most Framework classes are incomplete, and use Inheritance as an excuse to leave them that way. Case in point: XmlTextWriter.

I'm changing GenX.NET's XML formatter to use the XmlTextWriter instead of building XML manually. It's a bit cleaner this way, and I can use formatting to overcome this really weird issue I've been having with the StringBuilder.ToString method putting in breaks every 1024 characters. More on that later. Anyways, so the XmlTextWriter constructor takes an instance of the StringWriter class, which is where the problems begin. The XmlTextWriter's constructor looks like this:

Sub New(StringBuilder)
Sub New(Filename, Encoding)
Sub New(System.IO.TextWriter)

The lameness begins. So you can't set the encoding of the XML document if you pass in the StringBuilder. Sucks to be me. So I whip open the Object Browser, navigate to the XmlTextWriter, and I get the following pearl of wisdom:

Public Sub New(ByVal w As System.IO.TextWriter)
Member of: System.Xml.XmlTextWriter

Summary:
Creates an instance of the XmlTextWriter class using the specified System.IO.TextWriter .

Parameters:
w: The TextWriter to write to. It is assumed that the TextWriter is already set to the correct encoding.

Well, this would be a fabulous assumption to make, save for one thing... The TextWriter's encoding property is READ ONLY. La-de-frickin-da. Time to add bloat to my codebase again.

So I do a GoogleSearch on “XmlTextWriter StringBuilder Encoding”, and I get Roy Osherove talking about the subject. The dude knows XML & .NET, so I'm thinking “Great”.... but no dice. The examples in the comments don't work. The 2nd sample freaks out IE because the IE XSLT parser can't hack it if there are spaces at the end of the file. For some reason, converting a MemoryStream's buffer to a string kicks out extra data at the end. This is very bad. 45 minutes wasted.

The 1st example does not exactly work, because it doesn't allow for a StringBuilder to be passed in. This one is simple enough to correct, I just hate adding unnecessary code to my object model. The solution looks like this:

StringWriterWithEncoding Class:

Imports System.IO
Imports System.Text

Friend Class StringWriterWithEncoding
    Inherits StringWriter

Private m_encoding As Encoding
Public Sub New(ByVal sb As StringBuilder, ByVal encoding As Encoding)
    MyBase.New(sb)
    m_encoding = encoding
End Sub
Public Overrides ReadOnly Property Encoding() As Encoding
    Get
        Return m_encoding
    End Get
End Property
End Class

XML Parser Class:

Protected Friend Overridable Function DataReader(ByRef FromDataReader As SheetBuilder.FromDataReader) As String Implements IFormatProvider.DataReader
    Dim i As Integer
    Dim sb As New StringBuilder
    Dim writer As New XmlTextWriter(New StringWriterWithEncoding(sb, Encoding.UTF8))
    writer.Formatting = Formatting.Indented
    writer.WriteStartDocument()
    writer.WriteStartElement(“document“))
    writer.WriteElementString(dr.GetName(i), HtmlEncode(dr.GetValue(i).ToString))
    writer.WriteEndElement()
    writer.Flush()
    writer.Close()
    Return sb.ToString
End Function

There you have it. Now you can add whatever encoding you want, and the StringWriter will compensate accordingly. Notice also that the XmlTextWriter DOES NOT compensate for things like Ampersands (&) and so forth. I decided I'd take the burden off of the end user, and sacrifice a little performace by HtmlEncoding the output, rather than risk a document breaking and having to deal with a support issue.

Hopefully, MS will fix that stupid ReadOnly property and make it a two-way street, like they did with the SelectedValue property in the DropDownList. For now I'll have to use mine.

14 Comments

It's worth pointing out that a string in .NET is *always* UTF-16, whether or not you assert a different encoding in your XML declaration. We had a lot of problems with this fact in classic ASP/MSXML, since ASP strings are also always utf-16, and people would assert a different encoding in their XML decl, then get confused when things broke downstream (assert utf-8, but dump as utf-16, then get confused when the browser tries to read utf-8 for example). There MAY be some cases where it is correct for you to create XML in a string that is (always) encoded as utf-16 and assert that it is utf-8, but more often than not it is a bug and will lead to undesirable behavior. The XML declaration should normally match the actual encoding of the document instance, and we introduced a lot of bugs prior to .NET by making it too *easy* to use encodings other than utf-16 in a string of XML. I know that is little consolation if you are doing some blackbelt thing and really *need* to use an incorrect encoding decl temporarily, but at least it should explain why .NET is designed this way. I honestly believe that it has eliminated some very frequent user bugs that were common before .NET.

Regards,

Joshua Allen

XmlWriter PM

Joshua Allen - Saturday, August 2, 2003 12:05:00 AM

Wow. Thanks Joshua for an amazingly clear explanation. Here is my question then, and forgive me for being dumb, but.... why isn't there a method to change the string encoding to UTF8 for XML?

Are you saying that it's better to write directly to the file system? If you use the method that allows you to select enoding, it writes to a file. But, if this is the case, it's still a string when it gets written to the file....

*still confused*

Robert McLaws - Saturday, August 2, 2003 12:35:00 AM

Hi Rob,

Have you tried trimming the MemoryStream buffer? This is usally longer than the size specified in the Length property. I usually create an array of length stream.Length and then copy the bytes from the buffer into this. This should remove the dodgy spaces at the end?

Cheers,

Matthew

Matthew Reynolds - Monday, August 4, 2003 9:18:00 AM

I tried that without success. That was part of the wasted 45 minutes.

Robert McLaws - Monday, August 4, 2003 2:33:00 PM

What a fabulous explanation. Joshua, I thank you for taking the time to provide such a detailed explanation.

Now, having said that. I have the following comment:

WOW. That really sucks.

It should not be that complicated. I'm gonna have to take a serious look at it and see if I can't come up with a cleaner solution. More than likely, I'll have to show you my specific situation to prove why. Are you at the Redmond campus? If so I'll show you next week in person.

Robert McLaws - Thursday, August 7, 2003 8:25:00 PM

Per our conversation last night, I would recommend using MemoryStream to store in a particular encoding. I attach a code sample below (Test3 is the one with MemoryStream and a utf8). Also, all three of the examples result in XML that is converted to and from string while staying in the proper encoding, and they all seem to be working OK for me with no padding of bytes, etc. If you want me to check out any padding issues, you can send a repro code that I can compile and run to see if I get the same behavior. Thanks!

using System;

using System.IO;

using System.Text;

using System.Xml;

namespace foo {

public class bar {

public static void Main() {

Test1();

Test2();

Test3();

}

public static void Test1() {

string strInput = "<?xml version='1.0' encoding='utf-16'?><foo><bar /></foo>";

XmlTextReader r = new XmlTextReader(new StringReader(strInput));

StringBuilder sb = new StringBuilder();

XmlTextWriter w = new XmlTextWriter(new StringWriter(sb));

w.WriteNode(r, false);

w.Flush();

string strOutput = sb.ToString();

Console.WriteLine("Input = {0}, Output = {1}", strInput.Length, strOutput.Length);

}

public static void Test2() {

string strInput = "<?xml version='1.0' encoding='utf-16'?><foo><bar /></foo>";

XmlTextReader r = new XmlTextReader(new StringReader(strInput));

MemoryStream ms = new MemoryStream();

XmlTextWriter w = new XmlTextWriter(ms, Encoding.Unicode);

w.WriteNode(r, false);

w.Flush();

ms.Position = 0;

StreamReader sr = new StreamReader(ms);

string strOutput = sr.ReadToEnd();

Console.WriteLine("Input = {0}, Output = {1}", strInput.Length, strOutput.Length);

}

public static void Test3() {

string strInput = "<?xml version='1.0' encoding='utf-16'?><foo><bar /></foo>";

XmlTextReader r = new XmlTextReader(new StringReader(strInput));

MemoryStream ms = new MemoryStream();

XmlTextWriter w = new XmlTextWriter(ms, Encoding.UTF8);

w.WriteNode(r, false);

w.Flush();

ms.Position = 0;

StreamReader sr = new StreamReader(ms);

string strOutput = sr.ReadToEnd();

Console.WriteLine("Input = {0}, Output = {1}", strInput.Length, strOutput.Length);

}

}

}

Joshua Allen - Thursday, August 14, 2003 10:13:00 PM

Before I even add my .02 I want to say thanks to Robert McLaws, this "thread" put me down a path that eventually ended in success.

You are all 100% more advanced than I am and have likely moved on, but I had the exact same problem with the memory stream. I decided to stick it out. The key I found was to use

ToArray(); Rather than GetBuffer;

Here are parts of the code I used, I apologize for the sloppiness.

#creating the xmltextwriter

ms = new MemoryStream();

xmlw = new XmlTextWriter(ms, new System.Text.UTF8Encoding());

xmlw.Formatting = Formatting.Indented;

#turning the memorystream into a byte[] array so I can pass to to my socket later

byte[] _message = ms.ToArray();

#This is pivitol, using ms.GetBuffer() results in the extra padding that exists in the memorystream.

#now how I wrote it to file, I test was if it opened in IE (which uses the MSXML parser, then the java server on the other end using sax would be happy)

Stream outputStream = File.OpenWrite(@"c:\zplease.xml");

outputStream.Write(_message,0,_message.Length);

outputStream.Flush();

outputStream.Close();

I hope this helps someone, because it has driven me up the wall for about 4 hours.

My Guess is that Joshua Allen's solution works because he used the Stream Reader on the MemoryStream. I did not try his solution though.

Toby J Boyd - Saturday, September 6, 2003 3:28:00 PM

Thanks, Joshua.

That was just what I was looking for!

And thank everyone else for the efforts.

Good thread!

Xaphod - Thursday, November 20, 2003 11:50:00 PM

Thanks for posting this thread guys. I used an XMLTextWriter, with a StringWriter, and I got output that said [encoding="utf-16"]. For some reason, this would turn into binary garbage sometimes within visual studio.net. I changed it manually to utf-8, and VS.NET seemed to be a lot happier. The trick now was how to get the default output to say "utf-8" and not "utf-16". I used the StringWriterWithEncoding class and that seems to work nicely (but only found this post after trying to set the readonly property and seeing no good work arounds).

Anyone else notice VS.NET acting strangely with good XML?

Juraj

Juraj Pivovarov - Friday, December 5, 2003 3:32:00 AM

Looks like I've come late to this conversation, but I'm glad I stumbled onto it. I just ran into the same problem. I needed to create an XML string to be passed to a SQL Server stored procedure. SQL Server insisted that the XML be in UTF-8. I assume the string is marshalled behind the scenes, but I was left with the problem of correctly setting the encoding attribute.

I used the StringWriterWithEncoding solution. Simple, but annoying.

I like Toby's MemoryStream approach as it seems to be internally consistent (the XML is actually created in UTF-8), but it doesn't seem possible to (easily) convert the byte array into something I can pass to SQLServer.

RogerW - Friday, February 27, 2004 8:29:00 PM

Hi RogerW

>>>just ran into the same problem. I needed to create an XML string to be passed to a SQL Server stored procedure

>>>SQL Server insisted that the XML be in UTF-8

>>>

Use ntext/nvarchar instead of varchar to pass XML string

Jai Bharat Patel - Wednesday, March 10, 2004 3:46:00 PM

question on xml with array. may i ask how is array represented in xml and how to create it with xmltextwriter ?

eg i have the following text to convert.

abc="test"

xyz="my"

def={"one" "two" "three"}

the first two statement in xml would be

<abc>test</abc>

<xyz>my</xyz>

Any idea how would the 3rd statement be represented ? and how can i use xmlwriter to create ? Appreciate your help.

ben - Friday, April 23, 2004 2:29:00 AM

UTF-8 files are supposed to include those 3 characters. How are you trying to open it?

Eric - Thursday, May 13, 2004 10:54:00 PM

Thanks for the awesome chunk of code, Robert. We ran into this problem using XSLTransform with StringWriter. In the template, we were doing <xsl:output method="html" encoding="utf-8" /> and the XSLTransform insisted on inserted <meta encoding="utf-16"...> into the <head> of resulting HTML output regardless. This was seriously disrupting output of certain special characters in IE that were saved under utf-8.

Using this overriden StringWriter, I was able to finally get it to output meta encoding="utf-8" which was getting put in using the encoding of the StringWriter.

Here's a c# version of that code if anyone's interested.

using System;

using System.IO;

using System.Text;

namespace MyAwesomeNamespace

{

public class StringWriterWithEncoding : StringWriter

{

private Encoding _enc;

public StringWriterWithEncoding(Encoding NewEncoding) : base()

{

_enc = NewEncoding;

}

public override System.Text.Encoding Encoding

{

get

{

return _enc;

}

}

}

}

Alex Beynenson - Wednesday, July 28, 2004 2:36:00 PM

Comments have been disabled for this content.