Bug#99933: second attempt at more comprehensive unicode policy

To: 99933@bugs.debian.org
Subject: Bug#99933: second attempt at more comprehensive unicode policy
From: Colin Walters <walters@debian.org>
Date: 08 Jan 2003 01:30:09 -0500
Message-id: <[🔎] 1042007408.3157.81.camel@space-ghost>
Reply-to: Colin Walters <walters@debian.org>, 99933@bugs.debian.org
In-reply-to: <[🔎] 20030107090755.A1623@jbj2.jbj.danware.dk>
References: <[🔎] 1041476827.25298.32.camel@space-ghost> <[🔎] 20030102181206.GA24191@atlas15.dnp.fmph.uniba.sk> <[🔎] 1041533855.15063.19.camel@space-ghost> <[🔎] 1041546314.22038.9.camel@space-ghost> <[🔎] 20030103231158.GB8502@tatonka.pfalz.de> <[🔎] 1041648625.21808.28.camel@space-ghost> <[🔎] 20030106211523.GB1603@tatonka.pfalz.de> <[🔎] 20030107090755.A1623@jbj2.jbj.danware.dk>

On Tue, 2003-01-07 at 03:07, Jakob Bohm wrote:

> I agree, this is the only way to go.  Naive, simple, classic
> UNIX-style programming should continue to "just work",

Naïve, simple, classic UNIX-style programs are ASCII-only.  Then someone
got the idea to bolt this huge "locale" kludge on top of all of it.  It
is not something to be proud of or emulate.

> I like
> the idea that I can download any old program written in a past
> decade and just type make.

Yay for broken software.

> 1. Unless otherwise specified here, or there are very special
> circumstances, all programs and libraries should assume that all
> strings they receive or output (including, but not limited to
> filenames) are in the same encoding, and make no externally
> visible character encoding conversion.  (This is usually trivial
> to do, just do nothing).

This is the way things currently work; it is also exceedingly broken.

> 2. If a program really needs to make assumptions about the
> character encoding of data, it should assume the character
> encoding specified by the locale. 

I think that if you are writing a program today, it is saner to assume
UTF-8, since that is the future direction.

> 3. Unless required for security or other functionality, programs
> and libraries should not object to processing invalid
> characters. (This increases the users chance of being able to
> deal with data in inconsistent or broken encodings, e.g. with
> commands such as mv M?nch.txt Maench.txt).

I believe that the programs to which you might need to pass invalid
characters will also be the programs which will not look at or
manipulate the filenames anyways.  'mv' is a good example of a program
which we will *not* need to change.  It just basically takes its
arguments and passes them to the rename system call (well obviously it
is more complicated than that, but that's the basic idea).

> 4. The low level software which converts keystrokes (or other
> non-string input) to strings or converts strings to pixels (or
> other non-string output), is responsible for doing so
> consistently with the locale of the programs to which it
> provides this service, unless those programs explicitly specify
> otherwise.

I generally agree.

> For terminal-style input/output, there will be a tool or library
> feature (existing or Debian-created) which does two-way
> conversion of character sets around a pty.  This tool can /
> should be plugged into ssh, telnet, serial line getty and other
> conduits which allow terminal access from terminals that might
> have different locales than preferred on a given Debian system.

Such a tool could save us time (perhaps this tool already exists in the
form of GNU screen, as mentioned by David Starner), but note we can't
really force users to use it.  

> 5. Software which persists or transports strings outside the
> current process group, such as the name processing in
> filesystems, should convert strings from the current locale to a
> common encoding chosen by the implementor, such as UTF8, UTF16,
> UTF32 or in some cases another encoding.  It must be possible to
> turn off the translation through an extra environment variable,
> no matter what the locale or its character encoding.

Ugh, I am opposed to any sort of environment variable like this.  I
think it will not be necessary, and will complicate the implementation.

> For filenames or other data to which access must be possible
> even if it is improperly encoded, the translation code should
> include a well-defined escaping mechanism for accessing invalid
> character encodings on the medium.  This code must not be
> enabled in other contexts, due to serious security issues (it
> could e.g. allow bad people to bypass code to filter out shell
> metacharacters etc.).  This escape mechanism should allow things
> like tar backups to just work, no matter how confused the
> filenames on a disk.

Not sure how this "escaping mechanism" would be possible, or what it
would even really do.

> A mechanism needs to be devised, either in kernel or libc, which
> allows the conversion of filenames and console i/o to and from
> the process locale to indeed match the process locale.  A
> similar or identical mechanism should be put in Xlib.

I think it might make sense to have common library functions to do stuff
like this in glibc.

> 6.  The base software in sarge, such as libc, Xlib, xterm must
> support UTF8 variants of all locales as soon as possible. 
> Without this, the rest cannot even begin to be implemented.

It already does.  I just tried uxterm again for the first time in a
while, and I'm really impressed with its current level of UTF-8
support.  It can do almost all of UTF-8-demo.txt on my system.

> P.S. I am not a DD, just trying to be helpful and constructive.

Thanks for your comments.

Reply to:

Follow-Ups:
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: John Goerzen <jgoerzen@complete.org>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Jakob Bohm <jbj@image.dk>

References:
- Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: Colin Walters <walters@debian.org>
- Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: Radovan Garabik <garabik@melkor.dnp.fmph.uniba.sk>
- Re: Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: Colin Walters <walters@debian.org>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Colin Walters <walters@debian.org>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Jochen Voss <jvoss2@web.de>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Colin Walters <walters@debian.org>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Jochen Voss <jvoss2@web.de>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Jakob Bohm <jbj@image.dk>

Prev by Date: Bug#99933: second attempt at more comprehensive unicode policy
Next by Date: Bug#99933: second attempt at more comprehensive unicode policy
Previous by thread: Bug#99933: second attempt at more comprehensive unicode policy
Next by thread: Bug#99933: second attempt at more comprehensive unicode policy
Index(es):
- Date
- Thread