Bug#99933: second attempt at more comprehensive unicode policy

To: Jason Gunthorpe <jgg@debian.org>
Cc: 99933@bugs.debian.org
Subject: Bug#99933: second attempt at more comprehensive unicode policy
From: Colin Walters <walters@debian.org>
Date: 06 Jan 2003 18:30:06 -0500
Message-id: <[🔎] 1041895805.19863.27.camel@space-ghost>
Reply-to: Colin Walters <walters@debian.org>, 99933@bugs.debian.org
In-reply-to: <[🔎] Pine.LNX.3.96.1030106133622.13059D-100000@wakko.debian.net>
References: <[🔎] Pine.LNX.3.96.1030106133622.13059D-100000@wakko.debian.net>

[ CC's trimmed, since mail to the bug will reach -policy ]

On Mon, 2003-01-06 at 16:07, Jason Gunthorpe wrote:

> Fixing progams that handle terminal input is a different matter IMHO, it's
> something that should be decided on a more case by case basis, and alot of 
> cases might be effortless handled just by extending ncurses/slang

A lot of programs don't use curses...

> I think the philosophy should be that everything should be converted to
> UTF-8 after it is read from the terminal. Programs that interface with the
> terminal need to convert.

I generally agree with that.

> Changing programs that handle terminal input is a far smaller scope than
> changing every program that touches argv and every program that does
> terminal input.

If by 'touching argv' you mean 'modifying and creating output based on',
then I hope you agree that we will almost certainly have to make those
programs grok Unicode anyways, as I said before.  UTF-8 is a multibyte
encoding, and traversing and manipulating it correctly generally
requires one to use different string functions (although stuff like
strchr(foo, '.') will still work).

> If this route is followed then a huge swath of programs are half correct
> already, their only problem is that they will not be converting utf-8 for
> display. That might be best handled through glibc (again, changing
> *everything* just to get around the lack of utf-8 terminals is insane)

Output is a big problem, I agree.  But how exactly do you propose to
modify glibc?

> Well, that's not true. At the shell level everything is tagged. The shell
> knows things returned from readdir are utf-8 

No, it doesn't!  Even if we force users to run a script which converts
all legacy encodings to UTF-8, people will still have files NFS mounted
readonly on other systems, files that they created using a legacy
program, files on CD-ROM or DVD, etc.

What do you mean anyways that everything on the shell level is tagged? 
How is that possible?

What if I do something like this:

touch $(nc www.random.org 80)

> When I mean 'all cases' I mean the cases the come up in a system with only
> UTF-8 names in the filesystem, not one that has mixed encodings already
> in the filesystem, that's hopeless.

But mixed encodings will happen in the real world.  It is unavoidable. 
There is a lot of legacy data.

> > For the case you named above, I think what should happen is that 'ls'
> > converts all the arguments to UTF-8 for internal processing.  For the
> > first argument, UTF-8 validation will fail, so ls will try converting
> > from the locale's charset, which will work.  The rest of the arguments
> > will validate as UTF-8, so ls just goes on its way.
> 
> Eww, that's gross, it isn't definate that UTF-8 validation will always
> fail for non UTF-8 text, you could easially get lucky and type in a word
> that is valid UTF-8, but needs conversion! That's a terribly subtle UI
> bug.

I agree, it sucks and it's pretty gross.  But I don't think there is a
better solution.

> Consider the shell to be a scripting language just like python/java and
> look at how it's handled there - all internal strings are UTF-8, functions
> that read/write to the terminal convert automatically, functions exist to
> convert arbitary text/files.

Yes, but even in Python/Java/C# or whatever, you don't always know the
encoding for sure; what if you're opening up a Debian changelog?  By
default the strema will be opened using the user's locale encoding, but
we already mandated that Debian changelogs be UTF-8.

> You have everything needed to make the shell work uniformly in any
> environment, but some cases might require an iconv, but the iconv is
> required for *all* users, not just those with different locale settings. I
> think that's a good goal.

I don't see how you can make iconv just make everything work.

> The trouble is, the shell interfaces with the terminal, so it is the only
> thing in a position to know how to convert characters coming from the
> terimal to UTF-8, nothing else can do this.

As I said, I don't think the shell knows everything, and I think just
modifying the shell will not fix everything, even if it did.

Reply to:

References:
- Re: Bug#99933: second attempt at more comprehensive unicode policy
  - From: Jason Gunthorpe <jgg@debian.org>

Prev by Date: Bug#99933: second attempt at more comprehensive unicode policy
Next by Date: Bug#99933: Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
Previous by thread: Re: Bug#99933: second attempt at more comprehensive unicode policy
Next by thread: Bug#99933: second attempt at more comprehensive unicode policy
Index(es):
- Date
- Thread