
See below...
****
I think the issue is whether or not UTF-8 is functionally equivalent to Unicode.
Essentially, from an *information* content perspective, UTF-8 == UTF-16[LB]E == UTF32[LB]E
But UTF-8 is independent of the endianness of the sender and receiver, and as such means
that you can transparently read a UTF-8 stream into ANY endian machine, converting it to
UTF-16 or UTF-32 internally using the native endianness of the machine, manipulate it
using the standard Unicode capabilities (OK, we will still have to deal with surrogates with
UTF-16, and that remains a problem, but the problem is independent of the endianness).
Then you can write the data back out in UTF-8 (native endian to UTF-8), and have an
endian-independent representation again.
Now Windows uses UTF-16LE as its native encoding. If I give a UTF-8 string to CreateFile,
I will be calling CreateFileA, and get a weird-looking filename if any character in that
string has a numeric value > 127. If I am interfacing to a database system, I need to know
what *its* native representation of a string is. If it stores 8-bit strings, I have to
store a UTF-8 encoding if I want to retain the information content of the original string.
Of course, this means that everyone who is using that database has to understand that the
strings are being stored in UTF-8. The values can be handed around as uninterpreted
character sequences; they can be compared for equality (but not collating sequence),
written back into other string fields of the database, and transmitted out to other
places, all of which have to understand they are dealing with UTF-8. Because it is a
sequence of endian-independent characters with no embedded 0 bytes (which, if a native
UTF-LE or UTF-BE string is sent to a context that expects an 8-bit string will cause it to
malfunction), it can be handed around as a "black box" value that has very little that can
be done with it; for example, it cannot be displayed, it cannot be compared for collating
order, it cannot be used in any context in which the characters have significance (such as
a file name). Because this can become confusing, and at various points you end up having
to convert to or from UTF-16[BL]E so the appropriate API has a well-formed character
string, it is usually simpler to keep the representation in the program entirely in
UTF-16[BL]E (that is, in a Windows app, UTF-16LE), and convert it only at the boundaries
where it gets transmitted, or in the case of a database limited to 8-bit characters,
stored. Life is simpler if only the external forms are UTF-8.
Which leads to the interesting question about database sizes. Suppose I want to store an
80-character string. In ANSI, I need 80 bytes to store that string. But in UTF-16
encodings, I would need 160 bytes. Sounds bad. But realize that to hold 80 *characters*
if I have UTF-8, I could potentially need ***320*** bytes, since each of the UTF-8
encodings could be 4 8-bit bytes. If I must guarantee 80 characters, I have to allow for
80 UTF-8-encoded characters. So, you argue, the number of 4-byte encodings is small, and
applies only to "foreign" languages; because of the richness of the ideographic languages,
while I might need 4 bytes to encode the information, the languages are so rich that I
might need only 20 glyphs in these languages, so my 80-byte field can hold 20 Chinese, or
Sumerian, or Cuneiform glyphs. Well, now try to explain this to someone using one of the
UTF-32 glyphs whose value is > 65536. Now in UTF-16, I would need surrogates, so to
represent 80 glyphs I would need 80 surrogate+value pairs, which is 160 UTF-16 values, or
320 bytes. So whether I use UTF-8 or UTF-16 with surrogates, my field has to be 320 bytes
in length. So the file sizes are the same! Note that it is not just dead languages like
Cuneiform that are in that region, so are you going to try to explain to someone whose
native language is expressed in the range > 65535 that they cannot use your database to
represent as many characters as someone who is using, say, American English. Note that
you have to now think about new kinds of abstraction; if you want 80 characters, the limit
has to be computed purely on the actual characters, independent of the encoding. So while
the database field might be 320 bytes in lenght, I have to artificially impose on it a
limit of 80 *characters*. If you give me a "UTF-8 encoded" string that only has the 7-bit
subset (space thru ~, to simplify the explanation), then you would have to limit the field
to 80 bytes (losing 160 bytes in each record); if you get something that has character
sequences in the range 128..65535, encoded in UTF-16, you would limit it to 80 characters,
by counting the number of characters, or 160 bytes, losing, well, 160 bytes in each
record. In UTF-8, you would have to count the number of UTF-16 characters, and truncate
at 80.
For codes that require surrogates, you would essentially have to encode in UTF32, and
count characters. Your lossage woud vary from between 316 and 0 bytes (1-character UTF-32
to 80 UTF-32 characters in length).
If using UTF-8 or UTF-16, you would have to convert to UTF-32, limit the length to 80
UTF-32 characters, then convert back to UTF-16 or UTF-8 to the string you want, which
could be somewhere between 1 and 320 bytes in length.
So it is not at all clear that if a database supports only 8-bit characters that you would
actually have "smaller" files if you really want to maintain identical functionality
across all possible customer character sets.
Of course, the ultimate answer do making smaller files is to store your database with
variable-length strings. This has the advantage that it saves space, but now you have the
time performance to consider. You cannot locate record N by computing a file offset of
base_offset + N * (record_length), updates can be more expensive (for those of us who grew
up with IBM's ISAM variable-length databases, the whole issue of overflow pages is a
natural paradigm, but it is slow, expensive in time and disk space, and in those days it
was essential when BIG disk drives held tens of megabytes) Today, it is not clear that
disk space has any meaning whatsoever until you get into the gigarecord-sized databases; I
just added 4TB of RAID storage for $320 plus the cost of the iSCSI rack to hold them.
joe