[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Bug#292330: project: UTF-8 as default



On Sun, Jan 30, 2005 at 06:18:19PM -0200, Henrique de Moraes Holschuh wrote:
> You can do that with some encodings.  I seem to recall many of the ISO ones
> for CJK are quite capable of doing it.  EUC certainly *isn't*, and
> shift-jis isn't either (but then, EUC is a hack, and shift-jis is an ugly
> hack).
and
> Many of us are *NOT* single-language users, as in "single-charset-using"
> language users, even.  But that's not where the problem lies.

What I mean is that you have to tag the document at a separate level than
the encoded text itself (eg. Content-Type, or XML tags), so you know what
encoding it is.  If you're dealing with multiple (local) character sets,
the default locale doesn't matter--you're not using anything like the
defaults.

Multiple *remote* character sets is the norm, which everyone has to be
able to deal with as transparently as possible; that's probably the case
that you're worried about.

> > However, it does: I've configured it to send mail as ISO-8859-1,
> > ISO-2022-JP, UTF-8 priority.  This isn't automatic or trivial, but
> > it's not black magic, either.
> 
> We are talking about defaults.  Yes, it is possible depending on the MUA
> (some cannot do it).  But is it a sane *default* ?

I can't speak for all users, but I think Mutt defaulting to the above
priority for Japanese users (eg. ja_JP.UTF-8) is probably very sane;
for a person sending mostly Japanese mail, it'll send everything in
ISO-2022-JP and he'll never have to care.  I'd be interested to hear
what a native Japanese person (in a real-world Japanese environment),
as well as other languages, think.

> The problem is with the idea of using UTF8 as a *default* for *all* locales
> right now.  That means one has to know how to deal with charsets (most
> people don't even know what it its. I hope the 0-8-15 :) Debian user does,
> but...), and that one has to go around fixing the charset setup in most
> applications.  This would make UTF8 a bad default.

I'd say that finding out what configuration still needs to be done
manually to handle language-specific settings, and figuring out how
(and if) to deal with that automatically, is a good step towards making
UTF-8 the default.  (I don't think UTF-8 should be shifted for everything
at once, of course--clean transitions don't happen all at once.)

Harald has said that he's actually doing real work towards making UTF-8
a sensible default, remember, not that he wants to flip a switch for
the next release and not actually do anything, or that he thinks it's
ready now.  :)  So, maybe this subthread will be useful to him.

> It is not basic at all :(

I guess I'm spoiled by Mutt.

> Sort of.  Most software will use your default locale unless you configure
> them differently.  This is usually much nicer to broken software that
> *others* have, so chances are you are going to want it that way.

Okay.  So, character set selection, when the user's locale is UTF-8, should
probably be based on the language--for example, in ja_JP.UTF-8, text which
clearly isn't UTF-8 or ASCII should probably guess at SJIS before, say, Big5.

In practice, this probably means "programs should construct a list of
character sets to try, in a reasonable order, based on the language",
since if you have a plain text file and no clue what the language is,
short of real language analysis it's impossible to distinguish between
eg. ISO-8859-2 and ISO-8859-3, and you probably don't want to guess
at Big5 at all unless there's an indication that the text is probably
Chinese (eg. the user's locale is Chinese).

Figuring out these lists is a difficult project of its own, of course.

> And in that case, you have to configure your application to change charsets
> (even mutt).  If you are going to do that for enough applications, how is
> UTF8 a sane *default* value for that locale?

Well, the only applications I'd have to do it for are those which interact
with network data coming in that's untagged or mistagged; in practice--for
my own use--that's entirely email and web pages, and everything else is on
the fringe.  Oh, and file editors, since we'll all be receiving plain text
files in our respective country's legacy encoding for the forseeable future.

-- 
Glenn Maynard



Reply to: