[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Bug#292330: project: UTF-8 as default



On Sun, Jan 30, 2005 at 11:23:29AM -0200, Henrique de Moraes Holschuh wrote:
> > Please could you explain why?
> 
> Do your homework about Unicode and locales.  Hints for the googling:
> Unicode CJK unification problems.

My recollection is that this is only a problem for displaying eg. both
Chinese and Japanese text at once.  If you're a Japanese user, converting
from SJIS to UTF-8 and back always round-trips properly and rendering
UTF-8 text with a Japanese font should be the same as rendering SJIS
text with the same Japanese font.

Problems arise due to different languages wanting different glyphs on
the same codepoint: if all you know is that text is UTF-8, you don't
know whether to display using a Japanese font or a Chinese font (or
something else), so you might display the wrong glyph.  However,
Japanese users will be configuring a system to use Japanese fonts;
unless he's a multilingual user, and wants to display Chinese text
with a Chinese font, this isn't a problem.  (A multilingual user is
no better off with a different locale, though--if you can store files
in different encodings, you have to tag the encoding, and if you can
do that, the locale doesn't matter.)

A similar problem arises when sending mail out: I want to send mail
in local charsets, not in UTF-8, since many popular but broken
mailers don't support it.  If my system knows nothing about my usage,
then it won't know what to use: whether a given piece of CJK UTF-8
text should be converted to ISO-2022-JP or the Chinese equivalent.
However, it does: I've configured it to send mail as ISO-8859-1,
ISO-2022-JP, UTF-8 priority.  This isn't automatic or trivial, but
it's not black magic, either.

I don't claim UTF-8 is usable, yet, for all languages and all environments
(eg. input method support, perhaps), but I *have* done some research
on CJK unification ("done my homework"), and I havn't seen how it's a
difficult problem for a single-language user.  (I may, of course, have missed
something obvious--feel free to point it out.)

(Even Windows manages to get font selection right: despite the fact that
it's entirely UTF-16--ugh--internally, I nonetheless see Japanese fonts
for Japanese web pages and Chinese fonts for Chinese pages.)

> Also I can assure you 80% of the mail I see getting through the mail servers
> I admin is either latin-1 encoded, or that Windows CP1252 monstruosity
> (often mistagged as latin-1).  Too much of it without any sort of charset
> declarations at all, since too many people use extremely crappy software.
> It is even worse for web pages.

Mutt automatically attempts to guess the charset of incoming data when
they have no content-type--the charset of the local system is completely
irrelevant to that.  Mutt detects incoming mail the same way regardless
of whether my locale is UTF-8 or ISO-8859-1.  (I don't know about other
mailers, but this seems very basic.)  The same applies for browsers.

There are ambiguous cases, where a web page might be in more than one
character set.  In this case, the software needs a preference; this
preference is based on the user--a Japanese person probably wants a
page which might be SJIS or Big5 to be displayed as if it's SJIS.
Again, the locale is irrelevant here, though some software might use
the locale to determine the default.

-- 
Glenn Maynard



Reply to: