Bug#99933: second attempt at more comprehensive unicode policy

To: Denis Barbier <barbier@linuxfr.org>, 99933@bugs.debian.org
Subject: Bug#99933: second attempt at more comprehensive unicode policy
From: Colin Walters <walters@debian.org>
Date: 05 Jan 2003 21:12:36 -0500
Message-id: <[🔎] 1041819155.14620.9.camel@space-ghost>
Reply-to: Colin Walters <walters@debian.org>, 99933@bugs.debian.org
In-reply-to: <[🔎] 20030105201303.GA23475@zobe.linuxfr.org>
References: <[🔎] 1041476827.25298.32.camel@space-ghost> <[🔎] 20030102181206.GA24191@atlas15.dnp.fmph.uniba.sk> <[🔎] 1041533855.15063.19.camel@space-ghost> <[🔎] 1041546314.22038.9.camel@space-ghost> <[🔎] 20030103231158.GB8502@tatonka.pfalz.de> <[🔎] 1041648625.21808.28.camel@space-ghost> <[🔎] 87isx4q588.fsf@orcus.priv.at> <[🔎] 1041700241.32717.35.camel@space-ghost> <[🔎] 20030105142317.GB1699@zobe.linuxfr.org> <[🔎] 1041786548.9879.8.camel@space-ghost> <[🔎] 20030105201303.GA23475@zobe.linuxfr.org>

On Sun, 2003-01-05 at 15:13, Denis Barbier wrote:

> Consider a program written in C, which creates new files with open(2);
> if I understand your proposal right, when a filename is not UTF-8
> encoded, it should be converted into UTF-8 according to user's locale.

Well, broadly speaking, there are two cases:

1) Programs which do not look at the contents of filenames, and just
treat them as mostly opaque arguments.  Commands like 'touch' fall into
this category.  We should not need to change them at all; you just start
passing UTF-8 instead of ASCII or ISO-8859-1 to them.  Any change to
glibc would break these programs.

2) Programs which do manipulate filenames. These are trickier.  Now,
there are several ways to make these programs handle UTF-8.  For some of
them, no change will be required; stuff like searching for ASCII
characters still works with UTF-8.  However, if these programs display
them to the user on a tty, it will be necessary to convert them to the
user's locale encoding (of course, once we make UTF-8 terminals
standard, programs will not need to do this.) If they stuff them in a
GUI widget, they will have to be sure to tell the widget that they are
in UTF-8 (if necessary).  

> I am wondering how to perform this task:
>   a. Let open() perform this conversion.

No.  This would certainly ensure corruption.

>   b. Add a utility function in a common library and patch all programs
>      to add calls to this routine.

It depends.  For some programs, instead of converting the filename back
to the user's locale's encoding for internal manipulation (which may
fail, remember, since UTF-8 can encode far more than say ISO-8859-1), it
would be better to change the program to handle all strings internally
as UTF-8.  For some programs this will be fairly trivial, for others it
may be difficult.  Another alternative is to have a small library which
will first try decoding a filename using UTF-8 back into the user's
locale encoding, and only if that fails, then just take the filename
as-is.  The best approach will depend on the program, and how it
manipulates filenames.

> How do you think your proposal should be implemented?

I hope that helps.

Reply to:

Follow-Ups:
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Richard Braakman <dark@xs4all.nl>

References:
- Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: Colin Walters <walters@debian.org>
- Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: Radovan Garabik <garabik@melkor.dnp.fmph.uniba.sk>
- Re: Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: Colin Walters <walters@debian.org>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Colin Walters <walters@debian.org>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Jochen Voss <jvoss2@web.de>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Colin Walters <walters@debian.org>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Robert Bihlmeyer <robbe@orcus.priv.at>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Colin Walters <walters@debian.org>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: barbier@linuxfr.org (Denis Barbier)
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Colin Walters <walters@debian.org>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: barbier@linuxfr.org (Denis Barbier)

Prev by Date: Bug#99933: Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
Next by Date: Bug#99933: second attempt at more comprehensive unicode policy
Previous by thread: Bug#99933: second attempt at more comprehensive unicode policy
Next by thread: Bug#99933: second attempt at more comprehensive unicode policy
Index(es):
- Date
- Thread