SQL + XML(?) + Unicode = Issue

The title of this post may not be entirely accurate, but it is pretty close.  I recently encountered an issue with a 3rd party component that was reading unicode information from a MSSQL 2005 database and then sending an email using the value.  For some reason, what appeared to be random unicode characters from SQL were displayed as the flattened '?'.  The odd thing was the they were right next to other unicode values that came through just fine.  Very odd...

After profiling SQL, the request to, and the response from SQL looked like everything was fine, but this blackbox process was garbling data nonetheless.  Since the final destination for the information was an HTML formatted message, I took a shot in the dark, and tried encoding the unicode value.  This worked like a champ.  I could put 計 into the column and it would read out and display without a hitch.

The solution is now clear, I need the ability to return a unicode field encoded for HTML.  Hello SQL CLR :-)  Wait, a moment, what about this whole XML thing?  Well, after I found the solution, I tried to find the problem.  The thrid party control is totally blackbox, so I have no way of knowing what they're doing, but I did stumble into this little gem in BOL:

Caution:
EVENTDATA returns XML data. This data is sent to the client as Unicode that uses 2 bytes for each character. The following Unicode code points can be represented in the XML that is returned by EVENTDATA:

0x0009, 0x000A, 0x000D, >= 0x0020 && <= 0xD7FF, >= 0xE000 && <= 0xFFFD

Some characters that can appear in Transact-SQL identifiers and data are not expressible or permissible in XML. Characters or data that have code points not shown in the previous list are mapped to a question mark (?).

This somehow makes me think that the 3rd party component is serializing the data from SQL and the "random" unicode characters that are being flattened are actually outside XML's rendering capabilities...  This probably isn't the case, but it is the most plausable thing I found and with our current production schedule, it was a fix and move on.

On to the code.  The following is a SQL CLR procedure that I created using an encoding function I found on the net and tweaked a little.  My apologies to the author for not being able to reference you at posting time.

  1. [Microsoft.SqlServer.Server.SqlFunction]
  2. public static SqlString DoubleByteToHtmlEncoding(SqlString input)
  3. {
  4.   // check for null
  5.   if (input.IsNull == true) return input;
  6.  
  7.   // convert the sql string to a .net string
  8.   char[] chars = input.ToString().ToCharArray();
  9.  
  10.   // create a string builder to hold the converted output, slightly larger than
  11.   // the source string
  12.   StringBuilder Result = new StringBuilder(chars.Length + (int)(chars.Length * 0.1));
  13.  
  14.   foreach (char c in chars)
  15.   {
  16.     int Value = Convert.ToInt32(c);
  17.  
  18.     // if the value is larger than an ascii value, escape it
  19.     if (Value > 127)
  20.       Result.AppendFormat("&#{0};", Value);
  21.     // otherwise add it as is
  22.     else
  23.       Result.Append(c);
  24.   }
  25.  
  26.   // return the processed result to sql
  27.   return new SqlString(Result.ToString());
  28. }

I know everyone loves that name, I have one small concern and it is simply that someone might think it will html encode all data, where it only encodes double byte values.

Now that I have this function, I just tell the 3rd party control that instead of reading from a table, read from a view, where the view implements the function on the necessary columns, leaving the source data untouched!

Consuming the function can be as easy as:

SELECT dbo.DoubleByteToHtmlEncoding(N'計');
or
SELECT dbo.DoubleByteToHtmlEncoding(t.Col)
FROM TableName AS t;

Implementation of the clr function is outside the scope of this article, but here are a couple links that might get you started if you're interested:
Using CLR Integration in SQL Server 2005

How to: Create and Run a CLR SQL Server User-Defined Function

No Comments