Omer van Kloeten's .NET Zen

Programming is life, the rest is mere details

News

Note: This blog has moved to omervk.wordpress.com.

Omer van Kloeten's Facebook profile

Omer has been professionally developing applications over the past 8 years, both at the IDF’s IT corps and later at the Sela Technology Center, but has had the programming bug ever since he can remember himself.
As a senior developer at NuConomy, a leading web analytics and advertising startup, he leads a wide range of technologies for its flagship products.

Get Firefox


powered by Dapper 

.NET Resources

Articles :: CodeDom

Articles :: nGineer

Culture

Projects

Your Mouth Says Windows-1255, But Your Eyes Say ISO-8859-1

I recently wrote an engine that gets XML files stored at our clients’ servers using HTTP requests. One of our clients decided to serve the XML file with one encoding and encode the file itself with another. This posed a problem to XDocument.

The client decided to encode their XML using the Windows-1255 encoding (Hebrew), noting the encoding correctly in the XML’s declaration, but served the file stating the ISO-8859-1 (Latin) encoding. This meant that I couldn’t just use XDocument’s normal Load method to load directly from the stream because XDocument looks at the HTTP headers and takes the document’s encoding from them.

Here’s a snippet of the code I used to get over that:

using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
    // Use response's charset.
    var encoding = Encoding.GetEncoding("ISO-8859-1");

    if (!string.IsNullOrEmpty(response.CharacterSet))
        encoding = Encoding.GetEncoding(response.CharacterSet);

    byte[] bytes = ReadStream(response.GetResponseStream());

    // Get the XML with the response's charset.
    string xml = new string(encoding.GetChars(bytes));
    int endOfDeclaration = xml.IndexOf("?>");

    if (endOfDeclaration != -1)
    {
        // Try to find out the encoding from the declaration.
        string decl = xml.Substring(0, endOfDeclaration + 2) + "<duperoot />";
        XDocument declDoc = XDocument.Parse(decl);
        var docEncoding = Encoding.GetEncoding(declDoc.Declaration.Encoding);

        if (docEncoding == encoding)
            return xml;
        else
            return new string(docEncoding.GetChars(bytes));
    }
    else
    {
        // Not XML or something... Send up.
    }
}

What I did here was to create a new document with the original XML’s declaration (the Latin characters which make up the XML’s declaration always have the same byte position), add a dupe root and parse that to get the name of the encoding used by the document. I then use that encoding to decode the document correctly.

Note that I’m using ISO-8859-1 as the default response’s encoding, since that is what HTTP’s specification demands.

Comments

No Comments