StringWriter Encoding Hack

A few months back I blogged about the problems I had with the System.IO.StringWriter when dealing with XML. This whole dealing with XML things is not very intuitive IMO. There are way too many steps involved in making it work. My other problem with it is kind of complicated to explain, so I will try my best.

I was trying to create an XML string to pass up the various layers of my application, to be later written to a file, or some other destination. Well, when you start to march into the XML world, you have to deal with text encoding. XML is very finicky about that kind of thing, and if you don't do it just right, you're screwed. Now, you may not have known this, but when you deal with strings, they are ALWAYS encoded as UTF-16. It doesn't matter what you do to it, or how you work with it, String = UTF16. XML on the other hand, requires all the text to be encoded in UTF8, otherwise IE and just about every other application on earth totally craps out. Has the problem begun to become apparent to you yet?

In the System.Xml namespace, you can specify the text encoding when dealing with Stream objects, and you can specify the encoding when automatically writing to a file, but you CANNOT change the encoding when you deal with strings. NEVER. You're stuck building UTF16 XML strings even if XML is almost always in UTF8. I don't know about you guys, but this is extremely confusing. It doesn't make one bit of sense to me.

I talked to some of the MS XML gurus (Dare and Joshua) and they told me that my scenario was the first legitimate one they had ever heard for needing to add encoding to the string. That surprised me, because, having never dealt with building XML using the namespace before, I would have expected it would work this way. My other beef is, if a string is always UTF16 and XML is always UTF8, shouldn't it automatically convert internally? Even if it means the Framework has to take the StringWriter, dump it into a StreamReader, change the encoding, and dump the encoded string back into a new StringWriter and pass it back... that's how I would think it would work. I'm hoping this is possible for System.Xml 2.0.

At any rate, I came up with a hack to at least make the XML document header show the right encoding. Now, I'm pretty sure that this code does not change the encoding of the document, but it is effective in that you can now set the encoding yourself, and the doc header will be emitted properly. USE AT YOUR OWN RISK, because the text may still not be encoded properly, and may still break the app. I haven't had any problems so far with IE reading the output.

I'm gonna display the source code here. I broke out stuff like this into my own base library, so that I can use it anywhere, not just in GenX.NET. They reside in the “Interscape” base namespace. I also put in the Data Access source code that I use for all my samples. I got tired of dealing with that over and over again.... but more on that later. I will make the base library source available as soon as it has a few more classes.


Imports System.IO
Imports System.Text

Namespace Text

    '''<summary>

Implements a TextWriter for writing information to a string. The information is stored in an underlying StringBuilder.</summary>
    Public Class EncodedStringWriter
       
Inherits StringWriter

       
'Private property setter
       
Private _Encoding As Encoding

       
'''<summary>Default constructor for the EncodedStringWriter class.</summary>
       
'''<param name=“sb“>The formatted result to output.</param>
       
'''<param name=“Encoding“>A member of the System.Text.Encoding class.</param>
       
Public Sub New(ByVal sb As StringBuilder, ByVal Encoding As Encoding)
           
MyBase.New(sb)
            _Encoding = Encoding
       
End Sub

       
'''<summary>Gets the Encoding in which the output is written.</summary>
       
'''<param name=“Encoding“>The Encoding in which the output is written.</param>
       
'''<remarks>This property is necessary for some XML scenarios where a header must be written containing the encoding used by the StringWriter. This allows the XML code to consume an arbitrary StringWriter and generate the correct XML header.</remarks>
        Public Overrides ReadOnly Property Encoding() As Encoding
           
Get
               
Return _Encoding
           
End Get
       
End Property

    End Class

End
Namespace


Basically what is happening is, I'm creating a new class called EncodedStringWriter that has the same good stuff that the regular StringWriter has. I create a private variable placeholder, and I allow that placeholder variable to be set in the new constructor I created. Then I override the Encoding property (which is ReadOnly for some damned reason) and return the private placeholder that was set on instantiation. Bingo, I've now made my read-only peoperty not so read-only after all. Now, to use this new class to build an XML document (the way I thought I could in the first place, you do this (inside a function that I do not define here):


Dim i As Integer
Dim sb As New StringBuilder
Dim writer As New XmlTextWriter(New EncodedStringWriter(sb, Encoding.UTF8))
Dim dr As IDataReader = YourDataAccessFunctionHere

writer.Formatting = Formatting.Indented
writer.WriteStartDocument()  'Now the proper header will be rendered
writer.WriteStartElement("document")

'Cycle through the rest of the DataReader
While dr.Read()
    writer.WriteStartElement("item")
   
For i = 0 To dr.FieldCount - 1
        writer.WriteElementString(dr.GetName(i), HtmlEncode(dr.GetValue(i).ToString))
   
Next
   
writer.WriteEndElement()
End While

writer.WriteEndElement()
writer.WriteEndDocument()
writer.Flush()
writer.Close()

Return sb.ToString


Yes, I know this is a hack. I absolutely hate it. At the same time, I love it because I came up with it on my own, without asking anyone for help, and it gets the job done. Hopefully it will be fixed in Whidbey.

OK, my XML rant is over for the day. Later on I'll talk about my Universal Demo DAL, dealing with Access DBs, and why Northwind.mdb is once again the demo app's best friend.

9 Comments

  • &gt;Hopefully it will be fixed in Whidbey.



    I'm not sure what exactlt is broken that you expect to be fixed.

  • Huh ? all XML parsers are required to support at least UTF &amp; UTF16 according to the XML spec

  • Simon: IE wigs out on any and all UTF16 docs I've tried to feed it.



    Dare: I would expect that you would be able to change the encoding on ANY XML document, not just on those that get written to the file system using the System.Xml black box &quot;write to file&quot; setup.



    It's not that it is broken, but it is not complete IMO because it is missing that key functionality. Maybe I just don't understand the system very well but I shouldn't have to write that kind of hack to get it to work.

  • Never heard of any problems with UTF-16. Of course IE supports it as any conformant XML parser.

    &quot;XML is always UTF8&quot; - oh boy :)

  • IE sure as hell didn't support it when I tried it.



    Don't believe me? Grab your Northwind database, load some sample data up into a DataSet, parse the DataSet into XML (using a StringBuilder and a StringWriter, not the Black Box file option), manually write it to a file, and open it in IE. Watch what happens. Caused 3 days of frustration. I may be right, I may be wrong. All I know is, it works now, and before it did not.

  • Strings are Unicode and thus can never have encodings from its definition. Unicode is one of encodings, so String is always Unicode (UTF-16) encoded.



    If you want to manage it as a UTF-8 stream, write it to MemoryStream instead of StringWriter. You will then get a byte stream of your desired encoding. A byte sream can have its associated encoding.



    What you get with your hack is, a UTF-8 byte stream but each byte was convereted to UTF-16. So, each byte is represented as 2 bytes on memory. If you write it to a file using UTF-16, you will get UTF-8 file, because UTF-8 was once convereted to UTF-16 but then again converted back, and because UTF-16 does not break any bytes during the conversion. But from spec perspective, it's not guaranteed, and thus you have the risk.



    This is very easy to confuse, because in the days before Unicode, a byte stream is a string. However, in the Unicode world, a byte stream itself cannot represent a string; a byte stream + encoding = string.



    And .NET's String class uses UTF-16 encoding, so XML to write UTF-16 when you write to String is a correct behavior.

  • UTF-xx: Pain in the arse. Excellent article by the way! Did the trick in getting SharpReader to read the aspx file. Now I won't have to generate the XML file on my computer then do a &quot;Save As&quot;.

  • protected class StringWriterWithEncoding : StringWriter

    {

    private Encoding _Encoding;



    public StringWriterWithEncoding (StringBuilder stringbuilder, Encoding encoding):base(stringbuilder)

    {

    _Encoding = encoding;

    }



    public override Encoding Encoding

    {

    get {return _Encoding;}

    }

    }

  • I am so sick of the way .NET handles XML.



    I decided to upgrade to .NET rather than Java because I have a lot of legacy systems using ASP.



    To me it made scense to use .NET with XML to remove my legacy problems but guess what? MSXML 4 just flakes out trying to read anything from .NET because it doesn't understand UTF-16.



    But to make it worse I can't find anyway to work around this problem. Are MS just stupid or are we the programmers the ginue pigs yet again.



    How angry, fed up and sick of .NET's handling of XML am I? Java here I come.

Comments have been disabled for this content.