Bug#99933: second attempt at more comprehensive unicode policy

To: 99933@bugs.debian.org
Subject: Bug#99933: second attempt at more comprehensive unicode policy
From: Colin Walters <walters@debian.org>
Date: 06 Jan 2003 13:45:31 -0500
Message-id: <[🔎] 1041878731.17565.29.camel@space-ghost>
Reply-to: Colin Walters <walters@debian.org>, 99933@bugs.debian.org
In-reply-to: <[🔎] Pine.LNX.3.96.1030105235625.513o-100000@wakko.debian.net>
References: <[🔎] Pine.LNX.3.96.1030105235625.513o-100000@wakko.debian.net>

On Mon, 2003-01-06 at 02:46, Jason Gunthorpe wrote:

> I think you'd need to have all of argv be converted to utf-8 by the shell.

Besides Sebastien's reply, there is another good reason not to do
recoding in the shell: for any program which actually manipulates
filenames, we will need to add Unicode/UTF-8 support *anyway*, even if
the shell did convert everything to UTF-8.  For example, any program
that used to do:

char *c;
for (c = some_function_that_gets_user_input(); c != NULL; c++)
  printf("%s\n", c);

will have to be changed to do something like:

char *c;
for (c = some_function_that_gets_user_input(); c != NULL; utf8_next_char(c))
  printf("%s\n", c);

Since we will have to change programs anyways, we might as well fix them
to decode filenames as well.  The shell is kind of tempting as a "quick
fix", but I don't think it will really help us.

> IMHO it can't work any other way. If for instance you have a directory
> with some chinese utf-8 filenames and you do:
> 
> ls <typed filename in latin-1> * 
> 
> The only way ls ever has a hope of working is if it expects all of argv to
> be utf-8. Basically, I don't see any way that ls could ever hope to do
> automatic conversion and have a program that works in all cases. 

Well, let's be clear; nothing we can do will truly work in all cases. 
The vast majority of data is untagged, and charsets are not always
reliably distinguishable.  We are just trying to minimize what breaks.

For the case you named above, I think what should happen is that 'ls'
converts all the arguments to UTF-8 for internal processing.  For the
first argument, UTF-8 validation will fail, so ls will try converting
from the locale's charset, which will work.  The rest of the arguments
will validate as UTF-8, so ls just goes on its way.

> The shell
> must do it, because only the shell knows the source encoding for each
> argument

I don't think the shell does in all cases.  Think about when arguments
are computed dynamically.

> , and the only character coding that the shell could use to pass
> the information to the program is unicode. 

Generally speaking, I think the shell should just be a conduit for
bytes, and not modify them at all.  Much like 'cat'.

> The problem of output is further complicated, consider for instance:
> 
> find -type f
> find -type f | xargs ls
> 
> With what I just said. The first find must know it is talking to a
> terminal that is not UTF-8 and do conversion, the 2nd must know it is
> talking to a pipe and only output UTF-8.

Well, this situation can already break horribly on systems whose users
use different character encodings.  So we aren't creating a regression
here, in my opinion.

> Frankly, I think it's unworkable to try and make individual programs
> responsible for character conversion, except when processing files that it
> knows for certain are in a special locale. The way forward must be to
> implement UTF-8 at the terminal/pty and make that work as well as possible
> for everyone concerned.

We will definitely need UTF-8 support for the terminal.  I know
gnome-terminal works, and uxterm works too.  I don't know about support
for Linux consoles.

Reply to:

Follow-Ups:
- Re: Bug#99933: second attempt at more comprehensive unicode policy
  - From: Jason Gunthorpe <jgg@debian.org>

References:
- Re: Bug#99933: second attempt at more comprehensive unicode policy
  - From: Jason Gunthorpe <jgg@debian.org>

Prev by Date: Re: Bug#99933: second attempt at more comprehensive unicode policy
Next by Date: Re: Bug#99933: second attempt at more comprehensive unicode policy
Previous by thread: Re: Bug#99933: second attempt at more comprehensive unicode policy
Next by thread: Re: Bug#99933: second attempt at more comprehensive unicode policy
Index(es):
- Date
- Thread