Bug#99933: second attempt at more comprehensive unicode policy

To: Jochen Voss <jvoss2@web.de>, 99933@bugs.debian.org
Subject: Bug#99933: second attempt at more comprehensive unicode policy
From: Jakob Bohm <jbj@image.dk>
Date: Tue, 7 Jan 2003 09:07:55 +0100
Message-id: <[🔎] 20030107090755.A1623@jbj2.jbj.danware.dk>
Reply-to: Jakob Bohm <jbj@image.dk>, 99933@bugs.debian.org
In-reply-to: <[🔎] 20030106211523.GB1603@tatonka.pfalz.de>; from jvoss2@web.de on Mon, Jan 06, 2003 at 10:15:24PM +0100
References: <[🔎] 1041476827.25298.32.camel@space-ghost> <[🔎] 20030102181206.GA24191@atlas15.dnp.fmph.uniba.sk> <[🔎] 1041533855.15063.19.camel@space-ghost> <[🔎] 1041546314.22038.9.camel@space-ghost> <[🔎] 20030103231158.GB8502@tatonka.pfalz.de> <[🔎] 1041648625.21808.28.camel@space-ghost> <[🔎] 20030106211523.GB1603@tatonka.pfalz.de>

Hello everybody,

On Mon, Jan 06, 2003 at 10:15:24PM +0100, Jochen Voss wrote:
> Hello Colin,
> 
> On Fri, Jan 03, 2003 at 09:50:26PM -0500, Colin Walters wrote:
> > In summary, UTF-8 is the *only* sane character set to use for
> > filenames.
> At least I agree to this :-)
> 
> I think that we need filename conversion between UTF-8 and the user's
> character set, because we cannot ban all non-UTF8 terminal types.  In
> my opinion the main problem is, where this conversion should take
> place.
> 
> Because a lot of programs is affected, it would gain us much, if we
> could move this as deep as into libc or even into the kernel.  I
> remember there are some questions about character sets in the kernel
> configuration.  Are there file-systems with in-kernel character set
> conversion?
> 

I agree, this is the only way to go.  Naive, simple, classic
UNIX-style programming should continue to "just work", I like
the idea that I can download any old program written in a past
decade and just type make.

And Yes!, there are several filesystems in the Linux kernel
which do character set conversions on the fly.  Specifically,
all the Microsoft/IBM compatible filesystems (*fat, ntfs, hpfs,
iso9660) allow the DOS-side and unix-side character sets to be
specified as mount options.  Some versions of the smb file
sharing tools also do this.  And I think there is some
conversion code in the text mode vt implementation (screen and
keyboard) too.

At least the filesystem character conversions already use
UNICODE as the intermediary format, and thus the kernel includes
an almost complete set of UNICODE to/from X conversion tables,
each as a separate module with kerneld autoload support and all.

So here is my idea of how to do it (no I have not checked what
RH or others do, but I know what MS did wrong 10 years ago and I
live with those mistakes as a cross platform programmer every
day).

1. Unless otherwise specified here, or there are very special
circumstances, all programs and libraries should assume that all
strings they receive or output (including, but not limited to
filenames) are in the same encoding, and make no externally
visible character encoding conversion.  (This is usually trivial
to do, just do nothing).

2. If a program really needs to make assumptions about the
character encoding of data, it should assume the character
encoding specified by the locale. As a minimum, the following 3
cases must work correctly:
   2.1. UTF8   
   2.2. iso8859-1+ defined as the single byte encoding where
      each byte is one character, which is its own UNICODE
      equivalent, and where all byte values are treated as
      valid, even if the corresponding UNICODE codepoint is not
      defined.  (This character set is usually combined with the
      C locale to allow processing of arbitrary binary data in
      any unknown encoding).      
   2.3. any other single byte encoding where the values 0..127
      are ASCII and 128..255 are graphic characters not
      interpreted in any particular way.

Support for other multi-byte character encodings than UTF8 is
not required for sarge and later, but should not be removed if
it is already there.  For new code, either use the libc
character handling functions, or just treat anything not UTF8 as
iso8859-1+ except when converting to/from UTF8.

Note 2.1: Code which just treats strings as binary data already
satisfy the above.

Note 2.2: Code which just checks for ASCII values such as \n, /
etc. and passes consecutive sequences of high-numbered chars
around as is, already satisfy the above thanks to the design
properties of UTF8.

3. Unless required for security or other functionality, programs
and libraries should not object to processing invalid
characters. (This increases the users chance of being able to
deal with data in inconsistent or broken encodings, e.g. with
commands such as mv M?nch.txt Maench.txt).

However no conversions should cause bytes to be treated as an
ASCII control char unless its encoding is exactly that ASCII
byte value alone.  This means not converting the "redundant"
UTF8 encodings to their shortest form, but either leaving them
as is or converting them to something harmless.  ? is not
harmless, any ASCII char other than a-zA-Z is not harmless in
general context.

Note 3.1: This is trivially satisfied by code which does not
do convert or check character encoding at all.

4. The low level software which converts keystrokes (or other
non-string input) to strings or converts strings to pixels (or
other non-string output), is responsible for doing so
consistently with the locale of the programs to which it
provides this service, unless those programs explicitly specify
otherwise.

For terminal-style input/output, there will be a tool or library
feature (existing or Debian-created) which does two-way
conversion of character sets around a pty.  This tool can /
should be plugged into ssh, telnet, serial line getty and other
conduits which allow terminal access from terminals that might
have different locales than preferred on a given Debian system.

Note 4.1: Editors, libreadline etc. are not under this rule. 
Those are just regular software which needs to count characters
(and thus check for multibyte chars in the specified encoding). 
This rule is about the actual terminal interfaces, whether text
or graphic.

5. Software which persists or transports strings outside the
current process group, such as the name processing in
filesystems, should convert strings from the current locale to a
common encoding chosen by the implementor, such as UTF8, UTF16,
UTF32 or in some cases another encoding.  It must be possible to
turn off the translation through an extra environment variable,
no matter what the locale or its character encoding.

For filenames or other data to which access must be possible
even if it is improperly encoded, the translation code should
include a well-defined escaping mechanism for accessing invalid
character encodings on the medium.  This code must not be
enabled in other contexts, due to serious security issues (it
could e.g. allow bad people to bypass code to filter out shell
metacharacters etc.).  This escape mechanism should allow things
like tar backups to just work, no matter how confused the
filenames on a disk.

A mechanism needs to be devised, either in kernel or libc, which
allows the conversion of filenames and console i/o to and from
the process locale to indeed match the process locale.  A
similar or identical mechanism should be put in Xlib.

6.  The base software in sarge, such as libc, Xlib, xterm must
support UTF8 variants of all locales as soon as possible. 
Without this, the rest cannot even begin to be implemented.

P.S. I am not a DD, just trying to be helpful and constructive.

Cheers,

Jakob

-- 
This message is hastily written, please ignore any unpleasant wordings,
do not consider it a binding commitment, even if its phrasing may
indicate so. Its contents may be deliberately or accidentally untrue.
Trademarks and other things belong to their owners, if any.

Attachment: pgpE150UgL9vN.pgp
Description: PGP signature

Reply to:

Follow-Ups:
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Colin Walters <walters@debian.org>

References:
- Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: Colin Walters <walters@debian.org>
- Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: Radovan Garabik <garabik@melkor.dnp.fmph.uniba.sk>
- Re: Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: Colin Walters <walters@debian.org>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Colin Walters <walters@debian.org>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Jochen Voss <jvoss2@web.de>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Colin Walters <walters@debian.org>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Jochen Voss <jvoss2@web.de>

Prev by Date: Bug#99933: Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
Next by Date: Bug#99933: Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
Previous by thread: Bug#99933: second attempt at more comprehensive unicode policy
Next by thread: Bug#99933: second attempt at more comprehensive unicode policy
Index(es):
- Date
- Thread