Bug#99933: second attempt at more comprehensive unicode policy

To: 99933@bugs.debian.org
Cc: starner@okstate.edu
Subject: Bug#99933: second attempt at more comprehensive unicode policy
From: Colin Walters <walters@debian.org>
Date: 14 Jan 2003 10:37:20 -0500
Message-id: <[🔎] 1042558640.2901.35.camel@space-ghost>
Reply-to: Colin Walters <walters@debian.org>, 99933@bugs.debian.org
In-reply-to: <[🔎] 522458897.1042529031064.JavaMail.root@dexter.okstate.edu>
References: <[🔎] 522458897.1042529031064.JavaMail.root@dexter.okstate.edu>

On Tue, 2003-01-14 at 02:23, starner@okstate.edu wrote:

> Not acceptable. Filenames are and must be in the locale charset.
> There is no other sane option  [...]

Heh.  I will quote from a previous message of mine about filenames in
the locale charset, which, since you joined the discussion later, you
might not have seen:

On Fri, 2003-01-03 at 18:11, Jochen Voss wrote:
> As I see it, the current (broken ?) behaviour is, to use the user's
> locale setting (LC_CTYPE) to encode file names.  

It appears so, and yes, this behavior is completely and fundamentally
broken.  If you have say a Chinese friend who logs onto your computer,
and he sets LANG to something like cn_CN.BIG5, then when he tries to
'ls' your files, it will completely fail.  Likewise, when you try to
look at his, it will not work at all.

Moreover, say the system administrator does something like 'find
/home'.  The resulting stream will be a mixture of ISO-8859-X and BIG5,
and impossible to reliably differentiate.  And of course the problem
doesn't just occur when you have a multiuser system; your Chinese friend
could send you a .ogg file named using BIG5, and your Latin 1 system
would simply fail to encode the filename.

And finally, having the encoding of filenames dependent on the current
locale often doesn't make sense even for a single user; what if you are
a software developer in an ISO-8859-1 locale, and you want to test the
Japanese translation of your software.  So you run it with
LANG=ja_JP.ISO-2022-JP or something to get the translations displayed. 
As a side effect, all the filenames on your system will fail to work.

In summary, UTF-8 is the *only* sane character set to use for
filenames.  Major upstream software for Debian like GNOME is moving
towards requiring UTF-8 for filenames, and we should too. 

>  what do you expect "echo *" to do? 

Quite frankly, I expect it to not work, unless they're using a UTF-8
terminal.

> You can't slap
> filters around everything; it's horribly buggy, and error-prone and would
> take forever to implement, IF everyone wanted to go along with it. 

I am not sure.  I have a feeling we could make "core" programs like 'ls'
and such do conversion, but I agree it would be quite a long time before
we covered "most" of the programs people use.

> The 
> only sane situation is to transition everything as a whole to UTF-8, 
> with filterm or the like for legacy terminals. You can't just change 
> filenames.

I think programs should start expecting UTF-8 filenames today, but be
able to sanely handle filenames in the locale charset.  That way we get
the best of both worlds, and minimize the pain of the transition.

Note again that GNOME programs and the like are already creating UTF-8
filenames, because they work completely in UTF-8 internally.  Now, they
*could* try to convert them back to the locale charset.  But I would
argue strongly against this, because the conversion could fail if the
locale's charset isn't able to encode some target characters.  That may
be an "unlikely" scenario, but when you're dealing with something as
fundamental as filenames, you don't want to just ignore "unlikely"
scenarios.

Reply to:

References:
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: starner@okstate.edu

Prev by Date: Bug#176506: Make debconf mandatory for prompting the user
Next by Date: Unidentified subject!
Previous by thread: Re: Bug#99933: second attempt at more comprehensive unicode policy
Next by thread: Bug#99933: second attempt at more comprehensive unicode policy
Index(es):
- Date
- Thread