Bug#99933: second attempt at more comprehensive unicode policy

To: Colin Walters <walters@debian.org>, 99933@bugs.debian.org
Subject: Bug#99933: second attempt at more comprehensive unicode policy
From: Jakob Bohm <jbj@image.dk>
Date: Sun, 12 Jan 2003 00:37:52 +0100
Message-id: <[🔎] 20030112003752.A3902@jbj2.jbj.danware.dk>
Reply-to: Jakob Bohm <jbj@image.dk>, 99933@bugs.debian.org
In-reply-to: <[🔎] 1042007408.3157.81.camel@space-ghost>; from walters@debian.org on Wed, Jan 08, 2003 at 01:30:09AM -0500
References: <[🔎] 1041476827.25298.32.camel@space-ghost> <[🔎] 20030102181206.GA24191@atlas15.dnp.fmph.uniba.sk> <[🔎] 1041533855.15063.19.camel@space-ghost> <[🔎] 1041546314.22038.9.camel@space-ghost> <[🔎] 20030103231158.GB8502@tatonka.pfalz.de> <[🔎] 1041648625.21808.28.camel@space-ghost> <[🔎] 20030106211523.GB1603@tatonka.pfalz.de> <[🔎] 20030107090755.A1623@jbj2.jbj.danware.dk> <[🔎] 1042007408.3157.81.camel@space-ghost>

On Wed, Jan 08, 2003 at 01:30:09AM -0500, Colin Walters wrote:
> On Tue, 2003-01-07 at 03:07, Jakob Bohm wrote:
> 
> > I agree, this is the only way to go.  Naive, simple, classic
> > UNIX-style programming should continue to "just work",
> 
> Naïve, simple, classic UNIX-style programs are ASCII-only.  Then someone
> got the idea to bolt this huge "locale" kludge on top of all of it.  It
> is not something to be proud of or emulate.
> 

Naive, simple, classic UNIX-style programs (if 8 bit clean) will
implicitly handle UTF8, latin-1, latin-2, Korean DBCS, Arab,
Hebrew, most old DOS codepages, and generally any encoding which
includes ASCII as a proper subset.  The notable exception is
certain Japanese DBCS encodings, which allow ASCII character
encodings to have a different meaning if preceded by the wrong
byte values.  I am not sure if the common Chinese DBCS encodings
are safe like Korean or unsafe like Japanese.

This is what I want to keep working.

But this pleasant situation presumes, that all the system
interfaces (terminal, filesystem, Xlib ...) happen to use the
*same* encoding at any given invocation of the program, at least
as far as input/output to that program is concerned.

So my detailed proposal is about getting UTF8 support work
without breaking this basic programming assumption.

> > I like
> > the idea that I can download any old program written in a past
> > decade and just type make.
> 
> Yay for broken software.
> 

Again, I assume that the program is 8 bit clean or I would have
to restrict my input to ASCII anyway today.  But if I do
restrict my own input to ASCII for such a broken program, the
system should do nothing which may increase the breakage beyond
that manual workaround.

To understand my concrete proposal, it should be seen in the light
of the following general transition plan:

Step S1. Get all the ultra-core software to support UTF8 (items 4
and 6 in the proposal).

Step S2. Now maintainers of other software will have a
reasonable environment in which to start implementing and
testing that their code works with UTF8 variants of locales. 
And users can actually use such locales without massive
breakage.

Step S3. Make all Debian packages work correctly in the presence
of UTF8 locales.  Proposal items 1 to 3 are about making this as
trivial as possible, with 90% plus of current packages (both
source and binary) needing no change at all.

Step S4. While implementing S3, work on creating solutions which
allow processes running in UTF8 locales to interoperate with a
world, where some systems and users will continue to use other
encodings anyway for many years to come.

Proposal item 5 says that this is the responsibility of the few
pieces of software actually interfacing with the outside world,
not of the many pieces of neutral software which may or may not
happen to be used in those situations.

Proposal item 4 emphasizes that simply having a user interface
(such as libreadline in the shell, ncurses in some full screen
text mode programs, Athena or Motif/lesstif widgets in X in X
programs) does not put a program in that category.

Thus character conversion should be done at the very edge of the
system: In the local terminals (vt, xterm, Xlib), in remote
terminal access software (ssh, telnet, tty wrappers for serial
lines, Xlib for remote X terminals), and in physical storage
interfaces (already partially in the stock kernel for non-UNIX
filesystems).

Step S5. Make UTF8 locales the default.

Step S6. Subject support for other encodings to bit rot, not
deliberate removal.

> > 1. Unless otherwise specified here, or there are very special
> > circumstances, all programs and libraries should assume that all
> > strings they receive or output (including, but not limited to
> > filenames) are in the same encoding, and make no externally
> > visible character encoding conversion.  (This is usually trivial
> > to do, just do nothing).
> 
> This is the way things currently work; it is also exceedingly broken.

It is very much not broken:  If I set my locale to UTF8, use a
UTF8 terminal and all my filesystems present UTF8 at the system
call level, everything works.  If I set my locale to latin-1,
use a latin1 terminal and all my filesystems present latin1 at
the system call level, everything works too.  If I set my locale
to the predominant Japanese DBCS encoding, use a Japanese DBCS
terminal and all my filesystems present Japanese DBCS at the
system call level, almost everything works, unless I use one of
the few characters whose DBCS encoding abuses the byte values
normally associated with e.g. "/", or "\\" .  And yes, I do use
all of these variations on some of my machines, even though I
don't speak the Japanese language personally.

> 
> > 2. If a program really needs to make assumptions about the
> > character encoding of data, it should assume the character
> > encoding specified by the locale. 
> 
> I think that if you are writing a program today, it is saner to assume
> UTF-8, since that is the future direction.

If the locale says UTF8, then assuming UTF8 is safe.  If the
locale is not UTF8, assuming UTF8 is VERY broken, my proposal
went on to say that supporting the UTF8 setting correctly is the
most important case to implement, but a neutral 8-bit clean mode
must also be available, which will handle most other encodings
implicitly.  Support for legacy DBCS encodings is not required
at all, because it may be too difficult to add to programs in
some situations, and users can soon get around by using UTF8 for
those languages.

> 
> > 3. Unless required for security or other functionality, programs
> > and libraries should not object to processing invalid
> > characters. (This increases the users chance of being able to
> > deal with data in inconsistent or broken encodings, e.g. with
> > commands such as mv M?nch.txt Maench.txt).
> 
> I believe that the programs to which you might need to pass invalid
> characters will also be the programs which will not look at or
> manipulate the filenames anyways.  'mv' is a good example of a program
> which we will *not* need to change.  It just basically takes its
> arguments and passes them to the rename system call (well obviously it
> is more complicated than that, but that's the basic idea).
> 
Here is a simple example:

/bin/more needs to count the number of encoded characters in
order to determine, when lines will wrap and thus when to pause
output.  So /bin/more must recognize the UTF8 (or other charset)
values which indicate multi-byte encodings representing a single
character.  It may even need to know about zero and double width
characters.  But whatever it does, it should not refuse to pass
through unmodified any non-UTF8 data I might feed it, because I
probably have a reason to do that if I do (maybe my LOCALE
variable says UTF8 by mistake, maybe my super-smart terminal
does dynamic character set recognition, maybe I am piping binary
data through it and it will be processed by the next filter in
line).  The same applies to multi-column /bin/ls output, or to
my text editor.

A very well known example is perl 5.8 .  Many existing perl
scripts process pure binary data using string functions.  This
broke unnecessarily when perl 5.8 started to assume all string
data to be valid in the users character set and did
non-reversible conversions to it in order to do UNICODE
internally.  The proposal says that any future changes to
software should not make this mistake.

> > 4. The low level software which converts keystrokes (or other
> > non-string input) to strings or converts strings to pixels (or
> > other non-string output), is responsible for doing so
> > consistently with the locale of the programs to which it
> > provides this service, unless those programs explicitly specify
> > otherwise.
> 
> I generally agree.
> 
> > For terminal-style input/output, there will be a tool or library
> > feature (existing or Debian-created) which does two-way
> > conversion of character sets around a pty.  This tool can /
> > should be plugged into ssh, telnet, serial line getty and other
> > conduits which allow terminal access from terminals that might
> > have different locales than preferred on a given Debian system.
> 
> Such a tool could save us time (perhaps this tool already exists in the
> form of GNU screen, as mentioned by David Starner), but note we can't
> really force users to use it.  

The idea is, that those Debian packages, which provide the
interfaces to external terminals (telnet, ssh, serial line
variants of getty) should be packaged to invoke the tool or
feature implicitly by default, thereby causing all terminals to
look like UTF8 terminals (if LC_CHARSET=UTF8), even if external
computers or hardware terminals are really not.

Since Debian is Free Software, users still have the freedom to
break things, but they should not be broken as shipped.

> 
> > 5. Software which persists or transports strings outside the
> > current process group, such as the name processing in
> > filesystems, should convert strings from the current locale to a
> > common encoding chosen by the implementor, such as UTF8, UTF16,
> > UTF32 or in some cases another encoding.  It must be possible to
> > turn off the translation through an extra environment variable,
> > no matter what the locale or its character encoding.
> 
> Ugh, I am opposed to any sort of environment variable like this.  I
> think it will not be necessary, and will complicate the implementation.

There are some real world tasks (mostly related to system
administration, crash recovery, backup etc.), where the ability
to directly access the raw encodings of filenames etc. is vital,
but correct graphic display of some characters is not.  Such
tasks need to run with character set translation turned off, and
ditto for any other unwanted "automatic" assistance.  A good
example is your hypothetical script to convert on-disk filenames
to UTF8 by renaming files, this tool obviously needs to bypass
UTF8 translation in order to access the old filenames in the
first place, another is tools which relate raw disk blocks to
the output of e.g. /bin/ls output or filenames specified by
"/sbin/fstool *.bak".

This is actually one of the big MS mistakes around 1990.  When
they implemented Windows 2.x/3.x/9x on top of MS-DOS, they
switched from the old IBM/DOS encodings (like 437 and 850) to
early versions of latin-1 and friends (known in the MS world as
ANSI encodings), and they added implicit character conversions
to some of the file system interfaces.  But they forgot to
create a safe and easy way for sysadmins / advanced users to
access and manipulate files whose names contained
non-convertible characters.  Even worse, they mandated that it
was the responsibility of individual programs to invoke
conversion functions at the "right" times.  This meant that a
lot of programs got it wrong, creating a situation where users
had to stick to pure ASCII or risk exposing untested bugs in
strange places.  They never found a way to fix things once the
bad spec had been implemented by all the Windows programs in the
world.  In the 32 bit version of Windows they removed all the
non-converted system calls thereby removing the problem for the
DOS chars in filesystems, killing off any differently encoded
filenames, and moving those conversions into the kernel, but at
the same time, they did it again for UNICODE.

> 
> > For filenames or other data to which access must be possible
> > even if it is improperly encoded, the translation code should
> > include a well-defined escaping mechanism for accessing invalid
> > character encodings on the medium.  This code must not be
> > enabled in other contexts, due to serious security issues (it
> > could e.g. allow bad people to bypass code to filter out shell
> > metacharacters etc.).  This escape mechanism should allow things
> > like tar backups to just work, no matter how confused the
> > filenames on a disk.
> 
> Not sure how this "escaping mechanism" would be possible, or what it
> would even really do.
> 

Assume user X is running on sarge+5, a pure UTF8 setup all the
way through.  Assume, that filesystem xyzfs stores filenames in
another character set and is subject to automatic implicit
conversions.

For some reason he mounts a device containing a few (perhaps
only one) non-UTF8 filename (perhaps an old removable disc,
perhaps NFS, perhaps a corrupted disc, perhaps a network mount). 
Such an escaping mechanism would:

   1. Allow the filename to just appear in all sorts of file
     listings, file open dialogs etc. without those dialogs
     doing anything special because it is all in the conversion
     routine.

   2. Allow the file to be opened and manipulated with any tool
     the user might find useful, because the conversion routines
     allow the filename to make it through.

   3. Allow the file to be backed up and restored, even if the
     operator is unaware of the presence of corrupted filenames
     on the system.

Technically such a conversion might work as follows:

   1. When converting on-device filenames to/from the
     intermediary format (probably UTF32), reversibly map any
     invalid byte values to some part of the Corporate Zone in
     UNICODE.  The same 256 UNICODE code points can be used for
     all character sets, there may already be a tradition or
     standard indicating what values to use.

   2. When converting locale format (UTF8 or otherwise)
     system call / library call filenames from/to the intermediary
     format, reversibly map any UNICODE code point not in the local
     encoding to a sequence of chars indicating the HEX unicode
     code point.  The locale encoding character indicating this
     escape should be chosen carefully for each family of character
     encodings, as that character will become unusable in filenames
     for users of that encoding.

> > A mechanism needs to be devised, either in kernel or libc, which
> > allows the conversion of filenames and console i/o to and from
> > the process locale to indeed match the process locale.  A
> > similar or identical mechanism should be put in Xlib.
> 
> I think it might make sense to have common library functions to do stuff
> like this in glibc.
> 

NOT library functions, that is the big MS mistakes.  It must
happen outside individual programs and libraries in order to
avoid creating an unmaintainable mess, where every programmer
must figure out when to apply which conversion to which data,
many create bugs, design improvements are impossible, and all
programmers waste their time doing unnecessary work.

> > 6.  The base software in sarge, such as libc, Xlib, xterm must
> > support UTF8 variants of all locales as soon as possible. 
> > Without this, the rest cannot even begin to be implemented.
> 
> It already does.  I just tried uxterm again for the first time in a
> while, and I'm really impressed with its current level of UTF-8
> support.  It can do almost all of UTF-8-demo.txt on my system.
> 

I already knew that many xterm clones did it right.  But the
item says that ALL the terminal emulators, ALL the local
terminal interfaces (text mode vt, svgatextmode, Xlib text
input/output calls) and ALL the locales defined by the "locales"
package must support UTF8 as the very first step of getting an
environment in which UTF8 versions of packages may ship without
causing massive breakage.

> > P.S. I am not a DD, just trying to be helpful and constructive.
> 
> Thanks for your comments.

You're welcome.

-- 
This message is hastily written, please ignore any unpleasant wordings,
do not consider it a binding commitment, even if its phrasing may
indicate so. Its contents may be deliberately or accidentally untrue.
Trademarks and other things belong to their owners, if any.

Reply to:

References:
- Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: Colin Walters <walters@debian.org>
- Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: Radovan Garabik <garabik@melkor.dnp.fmph.uniba.sk>
- Re: Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: Colin Walters <walters@debian.org>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Colin Walters <walters@debian.org>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Jochen Voss <jvoss2@web.de>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Colin Walters <walters@debian.org>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Jochen Voss <jvoss2@web.de>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Jakob Bohm <jbj@image.dk>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Colin Walters <walters@debian.org>

Prev by Date: Bug#176300: Please use a better From: line for CVS messages.
Next by Date: Bug#176300: Please use a better From: line for CVS messages.
Previous by thread: Bug#99933: second attempt at more comprehensive unicode policy
Next by thread: Bug#99933: second attempt at more comprehensive unicode policy
Index(es):
- Date
- Thread