C++/VB - DB2 UTF-8 ODBC double conversion

Asked By Mihajlo Cvetanović
20-Nov-09 11:02 AM
I inherited a non-Unicode MFC project that connects to DB2 database
through ODBC. The application uses UTF-8 strings for database
transactions because the the strings are multilingual. In DB2 Control
Center I see those strings as "raw", but when calling SELECT
LENGTH(<some_column>) FROM <some_table> I see that each UTF-8 byte
occupies two bytes in database, as if there is extra conversion to UTF-8
taking place between ODBC and database. It seems that DB2 does not
understand that I am working with UTF-8 strings and thinks that those
strings coming from ODBC are in system code page.

Is there a way to disable this extra conversion to UTF-8, and make DB2
aware that strings from ODBC are already UTF-8?

I realize this is more the question for some DB2 newsgroup, but it
involves MFC as well.
MultiByteToWideChar
(1)
WideCharToMultiByte
(1)
XP
(1)
ISCSI
(1)
DoSomethingA
(1)
DoSomethingW
(1)
CreateFileA
(1)
CStringA
(1)
  Tom Serface replied to Mihajlo Cvetanović
20-Nov-09 12:55 PM
Is the data in the DB/2 database Unicode?  If you are getting two bytes for
each then maybe there is a conversion being done some where.  I do not think
this is necessarily a bad thing since it might save you some conversion
headaches if you just convert your MFC app to Unicode and use that all the
time.

My guess is the conversion is happening somewhere before it gets to the
database.  Are you sure there is nothing like MultiByteToWideChar() being
called in your code somewhere?

Of course, doing ANSI and counting on codepage would be the worst way to
handle strings in a database that could be accessed from any language
install of the OS.  In my experience going to Unicode has not really affected
the performance much.

Tom
  Giovanni Dicanio replied to Tom Serface
20-Nov-09 02:31 PM
Actually, speaking of performance, my guess is that ANSI might be "slower"
(of course, it could not be noticeable by the end user), because modern
version of Windows like XP or 7 implement the ANSI version (DoSomethingA) of
the APIs doing conversion from ANSI to Unicode, and calling the Unicode
versions (DoSomethingW).
So ANSI has a conversion overhead.

G
  Giovanni Dicanio replied to Tom Serface
20-Nov-09 02:34 PM
I read here:

http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.doc/admin/c0004821.htm

that:


So if the OP is getting two bytes for each character probably the DB is
using UCS-2 ?

G
  Tom Serface replied to Giovanni Dicanio
21-Nov-09 01:38 AM
Or the conversion is happening before it gets to the database.

Tom
  Tom Serface replied to Giovanni Dicanio
21-Nov-09 01:39 AM
I think you are right.  It does use more memory, of course, but there is far
less conversion to be done and I think, ultimately, the data is handled
faster.

Tom
  Mihajlo Cvetanoviæ replied to Giovanni Dicanio
21-Nov-09 04:08 AM
I was not accurate enough. It is not true that all bytes I send to database
occupy two bytes there. That is true only for bytes outside the basic ASCII
range. If the string contains only numbers and english letters then the DB2
length of the string in bytes is the count of characters +1 for NUL (as
expected).

The thing that happened is that DB2 could not execute one operation because
one string was too long. Now, when I catch a CDBException within debugger
and in Watch window dig into bound parameters of UPDATE query I see strings
in memory as they are sent to ODBC driver. One of the characters in one
string is UTF-8 character U201A (SINGLE LOW-9 QUOTATION MARK), which MFC
sends to DB2 as 3 bytes, but this actually takes 6 bytes in database. I was
expecting it would take 3 bytes in database as well, because the database is
set to UTF-8.

I am 99% sure that there is conversion from system code page to UTF-8
between DB2 ODBC driver and DB2 database. Therefore I am technically in the
wrong group, but was kinda hoping someone knows about this already.
microsoft.public.vc.mfc is more responsive group than most. My question is
how to avoid this conversion?
  Giovanni Dicanio replied to Mihajlo Cvetanoviæ
21-Nov-09 05:35 AM
I assume: "that all bytes" --> "that all characters"


So probably DB2 is actually using UTF-8.



If you go there:

http://www.fileformat.info/info/unicode/char/201a/index.htm

you can see that 0x201A is actually Unicode UTF-16, and corresponding UTF-8
is 0xE2 0x80 0x9A.
So probably MFC is sending the UTF-8 encoding to the DB2.
Can you see the actual bytes that MFC is sending?
What is the actual value of the 6 bytes?

Giovanni
  Mihajlo Cvetanovic replied to Giovanni Dicanio
21-Nov-09 06:58 AM
My code is using UTF-8 CStrings. My application is non-Unicode, but I must
have Unicode strings, so CStrings are UTF-8 strings when dealing with pretty
much everything except GUI. But when sending them through ODBC they are
technically strings of bytes, and my guess is that ODBC is interpreting them
as strings in system code page instead of as UTF-8 strings..


Yes.


Yes.


Yes, they are UTF-8 strings in memory. U201A character is represented with 3
bytes you gave here (0xE2 0x80 0x9A). There is no internal conversion. Those
strings in memory go to SQL SDK functions within MFC.


I dont know at this time (I could find out on Monday), but I am 99% certain
that they are those 3 bytes you gave here, but converted to UTF-8 again. All
I know for now is that they occupy 7 bytes in database (6 + NUL).
  Giovanni Dicanio replied to Mihajlo Cvetanovic
21-Nov-09 07:17 AM
You are storing UTF-8 data in CString's in a non-Unicode MFC app, so CString
is actually CStringA.
OK.
(Note that you may want to use CStringA explicitly or std::string, so even
if you switch your app to Unicode, your strings will continue storing UTF-8
data.)


OK.


So, let us wait on Monday.

Giovanni
  Mihai N. replied to Mihajlo Cvetanović
23-Nov-09 02:57 AM
First gut feeling: there are components in Windows (except for code page
conversions APIs) that take utf-8 strings as input/output.
So I strongly doubt ODBC is an exception to that.
If you do not use wchar_t (or WCHAR, same thing), meaning UTF-16, then
you probably use char* so that means ANSI code page for ODBC and does
(the extra) conversion.

You might get away with double conversion if you run on an English system,
but I anticipate troubles for anything non-1252 (think adding stuff to DB
from English OS and reading it on a Japanese one).

I would say that you should try to byte the bullet and "talk" utf-16
with ODBC.


--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
  Mihai N. replied to Mihajlo Cvetanovic
23-Nov-09 03:03 AM
There is no such thing as UTF-8 CStrings.
You might store UTF-8 bytes in there, but pretty much anybody outside your
application will interpret it as ANSI code page.
So to "pretty much everything except GUI" should change to:
Pretty much everything except
- GUI
- ODBC
- Registry access
- File system
- Clipboard
- network comunication (unless you do your own low-level
thing and do not use the Win APIs/MFC classes)
- who knows what else

Basically using UTF-8 is asking for trouble.
In the long run you should really consider going utf-16.
This is the only Unicode encoding in the Windows world.



Correct. This is expected.


--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
  Joseph M. Newcomer replied to Mihai N.
23-Nov-09 09:26 AM
UTF-8 is used "at the boundaries" when you bring data in or write data out, if you use it
at all.  It should never be used internally.  Use Unicode internally.
joe

Joseph M. Newcomer [MVP]
email: newcomer@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
  Mihajlo Cvetanović replied to Giovanni Dicanio
23-Nov-09 09:54 AM
Problem solved. All I had to do is call one line in command prompt:

db2set DB2CODEPAGE=1208

Now it works, UTF-8 strings in non-Unicode application go to DB2 as they
should. Thank you all.
  Tom Serface replied to Mihajlo Cvetanović
23-Nov-09 11:12 AM
Glad you got it figured out.  That was an  interesting one even if it was
not specifically MFC related.

I use UTF-8 to store data outside of my program (files), but I always
convert to Unicode when I come back in.

You may still want to go to Unicode in your application as some point
because being dependent on the code page for the conversion is eventually
going to become more trouble for you than it is worth IMO.  But for now
sounds like it works for you.

Tom
  Tom Serface replied to Joseph M. Newcomer
23-Nov-09 11:14 AM
I wish MSFT would get better at working with UTF-8.  It is more trouble to
work with in memory, of course, but for storing files or database data (like
OP is doing) it makes a great deal of sense.  In cases like ourse where most
of the time users only need one byte characters, but we still need to
support MBCS it means our external files are only bigger if they need to be.
The conversion from UTF-8 to Unicode is very fast in memory (from what I have
seen).

Tom
  David Wilkinson replied to Joseph M. Newcomer
23-Nov-09 12:23 PM
UTF-8 *is* Unicode. "Windows Unicode" is UTF-16, stored in wchar_t strings (with
surrogate pairs where necessary).

But I think we agree that the thing to avoid, at all costs, is having anything
to do with the local code page. It is a pity, IMHO, that CString (and other
parts of MFC) has automatic conversions that assume the local code page for
8-bit strings. These features can be very dangerous if you want an application
that uses UTF-8 "at the boundaries", especially if you only test your code with
ASCII strings (sloppy, I agree, but it happens...).

--
David Wilkinson
Visual C++ MVP
  Giovanni Dicanio replied to Mihajlo Cvetanović
23-Nov-09 12:43 PM
Great.


OK, so you set the UTF8 code page identifier in DB2.


Glad you solved that.

BTW: I second Tom's (and others) suggestion to move your app to Unicode.

Giovanni
  Joseph M. Newcomer replied to David Wilkinson
23-Nov-09 12:44 PM
Well, let me rephrase that: use UTF-16.  UTF-8 is an encoding of Unicode, and now that
Unicode is 32-bit, UTF-16 is also an encoding, but it is the native encoding of Windows,
whereas UTF-8 is not.

I always do my own encode/decode when I use UTF-8.
joe


Joseph M. Newcomer [MVP]
email: newcomer@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
  Tom Serface replied to David Wilkinson
23-Nov-09 02:34 PM
We windows programmers often fall in to the trap of considering Unicode ==
UTF-16 and often UTF-16 LE even though the Windows Unicode is not exactly
any of them that I can see.  You're right abotu CString doing auto
conversions.  It can get someone into trouble that they did not even know
they had.

Tom
  Tim Slattery replied to David Wilkinson
23-Nov-09 04:07 PM
I do not think so, not exactly. A file encoded in UTF-8 uses a single
byte to store characters in the 7-bit ASCII code. For other characters
it uses a multi-byte sequence to store the appropriate Unicode value:
sometimes 2 bytes, sometimes 3, sometimes 4 bytes for a single Unicode
character.

See the Wikipedia article: http://en.wikipedia.org/wiki/UTF8

--
Tim Slattery
Slattery_T@bls.gov
http://members.cox.net/slatteryt
  Joseph M. Newcomer replied to Tim Slattery
23-Nov-09 06:14 PM
See below...

****
I think the issue is whether or not UTF-8 is functionally equivalent to Unicode.
Essentially, from an *information* content perspective, UTF-8 == UTF-16[LB]E == UTF32[LB]E

But UTF-8 is independent of the endianness of the sender and receiver, and as such means
that you can transparently read a UTF-8 stream into ANY endian machine, converting it to
UTF-16 or UTF-32 internally using the native endianness of the machine, manipulate it
using the standard Unicode capabilities (OK, we will still have to deal with surrogates with
UTF-16, and that remains a problem, but the problem is independent of the endianness).
Then you can write the data back out in UTF-8 (native endian to UTF-8), and have an
endian-independent representation again.

Now Windows uses UTF-16LE as its native encoding.  If I give a UTF-8 string to CreateFile,
I will be calling CreateFileA, and get a weird-looking filename if any character in that
string has a numeric value > 127.  If I am interfacing to a database system, I need to know
what *its* native representation of a string is.  If it stores 8-bit strings, I have to
store a UTF-8 encoding if I want to retain the information content of the original string.
Of course, this means that everyone who is using that database has to understand that the
strings are being stored in UTF-8.  The values can be handed around as uninterpreted
character sequences; they can be compared for equality (but not collating sequence),
written back into other string fields of the database, and transmitted out to other
places, all of which have to understand they are dealing with UTF-8.  Because it is a
sequence of endian-independent characters with no embedded 0 bytes (which, if a native
UTF-LE or UTF-BE string is sent to a context that expects an 8-bit string will cause it to
malfunction), it can be handed around as a "black box" value that has very little that can
be done with it; for example, it cannot be displayed, it cannot be compared for collating
order, it cannot be used in any context in which the characters have significance (such as
a file name).  Because this can become confusing, and at various points you end up having
to convert to or from UTF-16[BL]E so the appropriate API has a well-formed character
string, it is usually simpler to keep the representation in the program entirely in
UTF-16[BL]E (that is, in a Windows app, UTF-16LE), and convert it only at the boundaries
where it gets transmitted, or in the case of a database limited to 8-bit characters,
stored.  Life is simpler if only the external forms are UTF-8.

Which leads to the interesting question about database sizes.  Suppose I want to store an
80-character string.  In ANSI, I need 80 bytes to store that string.  But in UTF-16
encodings, I would need 160 bytes.  Sounds bad.  But realize that to hold 80 *characters*
if I have UTF-8, I could potentially need ***320*** bytes, since each of the UTF-8
encodings could be 4 8-bit bytes.  If I must guarantee 80 characters, I have to allow for
80 UTF-8-encoded characters.  So, you argue, the number of 4-byte encodings is small, and
applies only to "foreign" languages; because of the richness of the ideographic languages,
while I might need 4 bytes to encode the information, the languages are so rich that I
might need only 20 glyphs in these languages, so my 80-byte field can hold 20 Chinese, or
Sumerian, or Cuneiform glyphs.  Well, now try to explain this to someone using one of the
UTF-32 glyphs whose value is > 65536.  Now in UTF-16, I would need surrogates, so to
represent 80 glyphs I would need 80 surrogate+value pairs, which is 160 UTF-16 values, or
320 bytes.  So whether I use UTF-8 or UTF-16 with surrogates, my field has to be 320 bytes
in length.  So the file sizes are the same!  Note that it is not just dead languages like
Cuneiform that are in that region, so are you going to try to explain to someone whose
native language is expressed in the range > 65535 that they cannot use your database to
represent as many characters as someone who is using, say, American English.  Note that
you have to now think about new kinds of abstraction; if you want 80 characters, the limit
has to be computed purely on the actual characters, independent of the encoding.  So while
the database field might be 320 bytes in lenght, I have to artificially impose on it a
limit of 80 *characters*.  If you give me a "UTF-8 encoded" string that only has the 7-bit
subset (space thru ~, to simplify the explanation), then you would have to limit the field
to 80 bytes (losing 160 bytes in each record); if you get something that has character
sequences in the range 128..65535, encoded in UTF-16, you would limit it to 80 characters,
by counting the number of characters, or 160 bytes, losing, well, 160 bytes in each
record.  In UTF-8, you would have to count the number of UTF-16 characters, and truncate
at 80.

For codes that require surrogates, you would essentially have to encode in UTF32, and
count characters.  Your lossage woud vary from between 316 and 0 bytes (1-character UTF-32
to 80 UTF-32 characters in length).

If using UTF-8 or UTF-16, you would have to convert to UTF-32, limit the length to 80
UTF-32 characters, then convert back to UTF-16 or UTF-8 to the string you want, which
could be somewhere between 1 and 320 bytes in length.

So it is not at all clear that if a database supports only 8-bit characters that you would
actually have "smaller" files if you really want to maintain identical functionality
across all possible customer character sets.

Of course, the ultimate answer do making smaller files is to store your database with
variable-length strings.  This has the advantage that it saves space, but now you have the
time performance to consider.  You cannot locate record N by computing a file offset of
base_offset + N * (record_length), updates can be more expensive (for those of us who grew
up with IBM's ISAM variable-length databases, the whole issue of overflow pages is a
natural paradigm, but it is slow, expensive in time and disk space, and in those days it
was essential when BIG disk drives held tens of megabytes)  Today, it is not clear that
disk space has any meaning whatsoever until you get into the gigarecord-sized databases; I
just added 4TB of RAID storage for $320 plus the cost of the iSCSI rack to hold them.
joe
  David Wilkinson replied to Tim Slattery
23-Nov-09 09:02 PM
For some code points UTF-16 uses two code units. There is really no conceptual
difference between UTF-8 and UTF-16. Both are full encodings of the Unicode
standard.

--
David Wilkinson
Visual C++ MVP
  Mihai N. replied to Joseph M. Newcomer
24-Nov-09 03:28 AM
In general I recomend against that.
It is quite easy to get wrong and not validate it properly.
I have seen even RFCs giving wrong algorithms for that.
Not saying that yours is bad, I have not seen it.

But getting it right is something the average Joe
(in contrast to the Newcomer Joe :-) will probably
fail to do.

For people that did not want to use the Windows stuff
I used to recomend the conversion routines at
ftp://ftp.unicode.org/Public/PROGRAMS/CVTUTF
But they are now gone!
(on 10/31/2009 they were still there)

I have to ask why that happened.


--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
  Mihai N. replied to Joseph M. Newcomer
24-Nov-09 03:49 AM
Unicode considers the various UTFs flavors completely equivalent.
Just various encoding forms for the same thing.
Unicode itself is not 32 bit, or 8, or anything.
It is just a mapping from characters to numbers plus a collection of
character properties.



I would argue that I should not have to care about the internal encoding
of the database.
The correct types used should be NCHAR, NVARCHAR and NTEXT.
The public API should take UTF-16 or UTF-32 or UTF-8 and document it.
Any conversion between the public API text representation and the internal
format should be transparent.

Also the database should be aware that text stored is Unicode, and not
just a bunch of bytes.
Becase otherwise things like sorting (and functions like between),
case-insensitive searching, functions like substring, replace, like,
% (one or more characters), _ (one character), will not do the right thing.


Stuff can be move around without awarenes of what is in there, but one has to
be very careful what operations are save and wich ones are not
(pretty much liks storing utf-8 in CString).



100% agree.



To make it more real: characters beyond BMP (Basic Multilingual Plane)
are required in order to support the GB-18030 Chinese National standard.
And the standard is enforced. If you want to sell your software in China,
you have to get a GB-18030 certification, or you do not sell it.

Also national standard in Japan and Hong Kong require support for
characters above U+FFFF. Although the standard are not enforced like the
Chinese one, supporting them might give you an extra edge in a competitive
market.

So beyond BMP it is not about some extinct languages that only few
archeologists care about.


--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
  Joseph M. Newcomer replied to Mihai N.
24-Nov-09 10:11 AM
When I said "My own encode/decode" I was referring to using

MultiByteToWideChar(CP_UTF8,...)
or
WideCharToMultiByte(CP_UTF8,...)

and not that I was actualy writing my own algorithms to do it.
joe


Joseph M. Newcomer [MVP]
email: newcomer@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
  Mihai N. replied to Joseph M. Newcomer
25-Nov-09 01:31 AM
I agree that using the win api is the safe way.
Although I am sure you can do your own stuff, if you want :-)


--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
Create New Account
help
Problem with XP scheduler? C++ / VB I am having firefox 3.5.5 and 3.5.6 freeze my box when I update several tabs in a rapid succession on Win XP SP3. What happens is box totally freezes. Task manager -> Performance tab shows one of cores ctl-alt-del) and disk stops flashing every few seconds as it normally does under XP. Interestingly enough, it does not happen under Windows 7 on the same box no matter how hard I try. Some say it is a problem with crappy XP scheduler. Is there anything I can do under XP to fix this issue? Hardware: Asus M4A78 motherboard, Athlon 2, X4 620 CPU, 4 Gigs of DDR2 ram by Team Extreme. Running on Windows XP SP3. - - Programmer's Goldmine collections: http: / / preciseinfo.org Win32 Kernel Discussions Windows NT 3.51 (1) Windows XP (1) Windows 7 (1) Python (1) Linux (1) SystemTimeToFileTime (1) UNIX (1 XP (1) I'd say it is a problem with Firefox. Send a bug to them
support some VB6 application. If I Package (Visual Studio package and deploy) my applications in XP they work across 2000, XP, Vista and 7. But if I Package them under Vista I get an "unable to find entry point " in the MSVCRT.dll when installing on 2000 or XP machines. Under windows 2000 and XP it appears this DLL only had handlers 1 and 2 but in Vista it has 6.0 (1) Windows Server (1) Visual Studio (1) Vista (1) WinSysPathSysFile (1) DLLSelfRegister (1) XP (1) VB (1) Don't include the MSVCRT.dll in your package. In fact you The OP is compiling and packaging on Vista and attempting to distribute to Win2k and XP. Thus, that is exactly what is going on, however, a search 'n compare is not much going from Vista down. [Everyone I know that still supports VB, builds on Win2k / XP boxes.] I did a quick little test compiling and packaging a "Hello World" app on environment, or how and where you are building, on Vista compared to a Win2k or XP configuration? Compile a test app in your Vista environment and then xcopy it to Win2k
Opening URL with ShellExecute returns SE_ERR_ACCESSDENIED on Windows XP C++ / VB Hello, I have an application that uses (several times) the function "ShellExecute" for values are different and there are 3 different ones) This used to work on Windows XP, and this still works on Windows Vista and Windows 7. But now, for an unknown time, maybe after some update to Windows XP, the call no longer opens anything and returns error code 5 ( = SE_ERR_ACCESSDENIED). Same result whatever I compile in Unicode or not. Used default brower is Mozilla Firefox. OS is Windows XP Professional SP3, French edition, all critical and many not critical updates installed as of today (on all computers). I checked this using 3 different computers having Windows XP (for getting an error result) and also 3 different computers having Windows Vista or Windows works correctly. I would like to know how to have it working again on Windows XP, or if there is some other method for opening an URL using the default browser that would be garanteed to work on Windows XP. I made some searches using Google with the words ShellExecute and SE_ERR_ACCESSDENIED, this gives a shell, but it looks like this newsgroup is no longer managed. Win32 UI Discussions Windows XP (1) Internet Explorer 7 (1) Windows Update (1) Windows Vista (1) Windows 7 (1) FileProtocolHandler
MultiByteToWideChar does not correctly detect an invalid UTF-8 String C++ / VB MultiByteToWideChar does not correctly detect an invalid UTF-8 String with some single byte values that valid Latin1 charactes, but invalid UTF-8 encodings. If you hand a Latin1 string to MultiByteToWideChar and that string e.g. only contains U+00C4 / U+00D6 / U+00E4 / . . . as characters 8) (Find a complete source code example at the end.) From the MSDN entry for MultiByteToWideChar: (as seen here: http: / / msdn.microsoft.com / query / dev10.query?appId = Dev10IDEF1&l = EN-US&k = k%28MULTIBYTETOWIDECHAR%29;k%28DevLang-%22C%2B%2B%22%29&rd = true ) - -- - MultiByteToWideChar Function (. . .) MB_ERR_INVALID_CHARS - (. . .) Windows XP: Fail if an invalid input character is encountered. (. . .) (. . .) Note For UTF-8 (. . .) dwFlags must be surrogate pairs (. . .) - -- - Since I currently do not have access to a Vista Test box, only XP here, can anyone tell me if the behaviour is corrected on Vista / Win7 or if Tested with VS 2005 + VS 2010 Express const char* const test_str[] = { }; int main() { printf("Testing MultiByteToWideChar with Latin1 strings and UTF-8 conversion . . . \ n"); printf("Each of the following strings should
empty imput string size_t nLen8 = 3D utf8.GetLength(); size_t nLen16 = 3D 0; if ((nLen16 = 3D MultiByteToWideChar (CP_UTF8, 0, utf8, nLen8, NULL, 0)) = 3D = 3D 0) return utf16; / / conversion error! pszUtf16 = 3D new wchar_t[nLen16]; if (pszUtf16) { wmemset (pszUtf16, 0x00, nLen16); / / here is the error located: MultiByteToWideChar (CP_UTF8, 0, utf8, nLen8, pszUtf16, nLen16); utf16 = 3D CStringW(pszUtf16); } / / the length will be 12 UINT length = 3D utf16.GetLength(); delete [] pszUtf16; return utf16; / / utf16 encoded string } If I use MultiByteToWideChar (CP_UTF8, 0, utf8, nLen8, pszUtf16, (nLen16 -1)); instead of MultiByteToWideChar (CP_UTF8, 0, utf8, nLen8, pszUtf16, nLen16); and the CStringA test is a CString including a explanation why the code is not working with ("html"). . . . best regards Hans VC MFC Discussions MultiByteToWideChar (1) Deterministic (1) Exception (1) Declare (1) Class (1) Bytes (1) Bit (1) GetString (1 text. For "html", you receive 4. n16); Problem is here. You allocated 4 wide chars. MultiByteToWideChar wrote 4 wide chars to pszUtf16. CString(LPCTSTR) constructor expects a zero- terminated string, but L"" would be better. if (utf8.IsEmpty()) return utf16; / / empty imput string if ((nLen16 = 3D MultiByteToWideChar (CP_UTF8, 0, utf8, nLen8, NULL, 0)) = 3D = 3D 0) return utf16; / / conversion error! pszUtf16 = 3D at least get their code off the internet!) { wmemset (pszUtf16, 0x00, nLen16); This is useless. MultiByteToWideChar will write over same bytes again. What is wmemset doing? Preparing a clean room for