Visual C++ in Short: Converting between Unicode and UTF-8

The Windows SDK provides the WideCharToMultiByte function to convert a Unicode, or UTF-16, string (WCHAR*) to a character string (CHAR*) using a particular code page. Windows also provides the MultiByteToWideChar function to convert a character string from a particular code page to a Unicode string. These functions can be a bit daunting at first but unless you have a lot of legacy code or APIs to deal with you can just specify CP_UTF8 as the code page and these functions will convert between Unicode UTF-16 and UTF-8 formats. UTF-8 isn’t really a code page in the original sense but the API functions lives on and now provide support for UTF conversions.

ATL provides a set of class templates that wrap these functions to simplify conversions even further. It takes a fairly efficient and elegant approach to memory management (compared to previous versions of ATL) that should serve you well in most cases. CW2A is a typedef for the CW2AEX class template that wraps the WideCharToMultiByte function. Similarly, CA2W is a typedef for the CA2WEX class template that wraps the MultiByteToWideChar function.

In the example below I start with a Unicode string that includes the Greek capital letters for Alpha and Omega. The string is converted to UTF-8 with CW2A and then back to Unicode with CA2W. Be sure to specify CP_UTF8 as the second parameter in both cases otherwise ATL will use the current ANSI code page.

Keep in mind that although UTF-8 strings look like characters strings, you cannot rely on pointer arithmetic to subscript them as the characters may actually consume anywhere from one to four bytes. It’s also possible that Unicode characters may require more than two bytes should they fall in a range above U+FFFF. In general you should treat user input as opaque buffers.

#include <atlconv.h>
#include <atlstr.h>

#define ASSERT ATLASSERT

int main()
{
    const CStringW unicode1 = L"\x0391 and \x03A9"; // 'Alpha' and 'Omega'

    const CStringA utf8 = CW2A(unicode1, CP_UTF8);

    ASSERT(utf8.GetLength() > unicode1.GetLength());

    const CStringW unicode2 = CA2W(utf8, CP_UTF8);

    ASSERT(unicode1 == unicode2);
}

 

If you’re looking for one of my previous articles here is a complete list of them for you to browse through.

Produce the highest quality screenshots with the least amount of effort! Use Window Clippings.

Published Thursday, July 24, 2008 6:31 AM by KennyKerr
Filed under:

Comments

# re: Visual C++ in Short: Converting between Unicode and UTF-8

Thursday, July 24, 2008 3:49 AM by asdf

its not like UTF16 is always 2bytes either

# re: Visual C++ in Short: Converting between Unicode and UTF-8

Thursday, July 24, 2008 4:10 AM by KennyKerr

asdf: Thanks I forgot to mention that. I updated the text to point that out.