Bug#99933: second attempt at more comprehensive unicode policy

To: 99933@bugs.debian.org
Subject: Bug#99933: second attempt at more comprehensive unicode policy
From: Colin Walters <walters@debian.org>
Date: 03 Jan 2003 21:50:26 -0500
Message-id: <[🔎] 1041648625.21808.28.camel@space-ghost>
Reply-to: Colin Walters <walters@debian.org>, 99933@bugs.debian.org
In-reply-to: <[🔎] 20030103231158.GB8502@tatonka.pfalz.de>
References: <[🔎] 1041476827.25298.32.camel@space-ghost> <[🔎] 20030102181206.GA24191@atlas15.dnp.fmph.uniba.sk> <[🔎] 1041533855.15063.19.camel@space-ghost> <[🔎] 1041546314.22038.9.camel@space-ghost> <[🔎] 20030103231158.GB8502@tatonka.pfalz.de>

On Fri, 2003-01-03 at 18:11, Jochen Voss wrote:

> Is this meant to apply to programs like "ls", "bash", "touch", and
> "emacs"? 

Yes.

> I imagine that the transition period could be a hard time
> for users who (like me) use non-ASCII characters in file-names.

That is probably true.  But we really have no other choice.  See below.

> As I see it, the current (broken ?) behaviour is, to use the user's
> locale setting (LC_CTYPE) to encode file names.  

It appears so, and yes, this behavior is completely and fundamentally
broken.  If you have say a Chinese friend who logs onto your computer,
and he sets LANG to something like cn_CN.BIG5, then when he tries to
'ls' your files, it will completely fail.  Likewise, when you try to
look at his, it will not work at all.

Moreover, say the system administrator does something like 'find
/home'.  The resulting stream will be a mixture of ISO-8859-X and BIG5,
and impossible to reliably differentiate.  And of course the problem
doesn't just occur when you have a multiuser system; your Chinese friend
could send you a .ogg file named using BIG5, and your Latin 1 system
would simply fail to encode the filename.

And finally, having the encoding of filenames dependent on the current
locale often doesn't make sense even for a single user; what if you are
a software developer in an ISO-8859-1 locale, and you want to test the
Japanese translation of your software.  So you run it with
LANG=ja_JP.ISO-2022-JP or something to get the translations displayed. 
As a side effect, all the filenames on your system will fail to work.

In summary, UTF-8 is the *only* sane character set to use for
filenames.  Major upstream software for Debian like GNOME is moving
towards requiring UTF-8 for filenames, and we should too.  See for
example:
http://www.gtk.org/gtk-2.0.0-notes.html

Microsoft Windows has used Unicode for filenames for a long time because
of issues like these.  MacOS also uses Unicode.

And like Tollef said, Red Hat 8 has already switched to defaulting to
UTF-8 for new systems.

> During the
> transition period non-ASCII file names will have two possible
> representations in the file system (LC_CTYPE vs. UTF-8).  I think
> we should clarify the following points before introducing the above
> into policy:
> 
>     1) Should interpretation of existing files' names as UTF-8
>        be implemented before the encoding of newly created files'
>        names is switched?

I am not sure what policy can say here.  For people using filenames in
legacy encodings, perhaps policy could suggest that programs try to fall
back to the user's locale encoding, if the filename is not valid UTF-8. 
This might become common practise, but I don't think policy should
require it.

Again, major chunks of upstream software which have Unicode support
(like GNOME), are *already* defaulting to interpreting filenames as
UTF-8 by default.  I am just trying to bring policy in line with best
practise in this regard.

>     2) How should already existing files with non-ASCII names
>        be converted?

There are lots of different options; we could have a package
'unicode-transition' in base which would convert all local filesystems,
or we could do it as part of a base-files upgrade.  But mainly, this is
a technical issue separate from policy, in my opinion.  We can hash out
those detailed plans separately from this proposal.

Reply to:

Follow-Ups:
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Marco d'Itri <md@Linux.IT>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Robert Bihlmeyer <robbe@orcus.priv.at>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Jochen Voss <jvoss2@web.de>

References:
- Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: Colin Walters <walters@debian.org>
- Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: Radovan Garabik <garabik@melkor.dnp.fmph.uniba.sk>
- Re: Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: Colin Walters <walters@debian.org>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Colin Walters <walters@debian.org>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Jochen Voss <jvoss2@web.de>

Prev by Date: Bug#99933: second attempt at more comprehensive unicode policy
Next by Date: Bug#99933: second attempt at more comprehensive unicode policy
Previous by thread: Bug#99933: second attempt at more comprehensive unicode policy
Next by thread: Bug#99933: second attempt at more comprehensive unicode policy
Index(es):
- Date
- Thread