Bug#99933: second attempt at more comprehensive unicode policy

To: Robert Bihlmeyer <robbe@orcus.priv.at>
Cc: 99933@bugs.debian.org
Subject: Bug#99933: second attempt at more comprehensive unicode policy
From: Colin Walters <walters@debian.org>
Date: 04 Jan 2003 17:27:05 -0500
Message-id: <[🔎] 1041719224.31647.12.camel@space-ghost>
Reply-to: Colin Walters <walters@debian.org>, 99933@bugs.debian.org
In-reply-to: <[🔎] 874r8oppl5.fsf@orcus.priv.at>
References: <[🔎] 1041476827.25298.32.camel@space-ghost> <[🔎] 20030102181206.GA24191@atlas15.dnp.fmph.uniba.sk> <[🔎] 1041533855.15063.19.camel@space-ghost> <[🔎] 1041546314.22038.9.camel@space-ghost> <[🔎] 20030103231158.GB8502@tatonka.pfalz.de> <[🔎] 1041648625.21808.28.camel@space-ghost> <[🔎] 87isx4q588.fsf@orcus.priv.at> <[🔎] 1041700241.32717.35.camel@space-ghost> <[🔎] 874r8oppl5.fsf@orcus.priv.at>

On Sat, 2003-01-04 at 16:33, Robert Bihlmeyer wrote:

> Don't you think this is a common case? I'd even say more common than
> your scenarios. At least common enough that it should be acknowledged.

I agree, it is common enough.  But previously people had no choice but
to use a broken hack; now we have a solution.

> I am not concerned about RC bugs in mine or others packages. My point
> is that ways how things have worked up to now will no longer, and this
> can be avoided.

It only "worked" for specific regions, and specific cases.  We should of
course try to ensure that for people using filenames with legacy
non-ASCII encodings, the transition is as painless as possible.  I fully
understand and agree with that.

> > First of all, there is no need for 'if and only if'.  Programs can
> > always try to decode filenames in UTF-8, and if that fails, then try the
> > locale's charset.
> 
> This will invariably interpret some non-ASCII non-UTF8 filenames wrong.

That may be true.  However, UTF-8 was designed so that the chance of it
being interpreted as another charset was small, and decreasingly small
as the length of the input increases.  See RFC 2279.  That's why it is a
good strategy to try decoding as UTF-8 first; and if that fails, fall
back to the locale's encoding.

> But it will condone or even suggest broken behaviour like Gnome2's.

The whole point of this proposal is to move Debian more in line with
major chunks of upstream software like GNOME 2.  If you disagree with
their behavior, please suggest an alternative to solve all the problems
I named above.

> > Well, you might have to set G_BROKEN_FILENAMES.
> 
> Considering old standards broken because a newer one exists is just
> ridiculous.

The old "standards" such as they were are were a workaround for the lack
of Unicode support.  Now that we have it, we should stop using the
workaround.

> I still think taking LC_CTYPE unconditionally as a hint is the best
> solution. People who don't care (e.g. USians) are happy with any
> solution. 

No.  Even only English-speaking programmers like me are tired of dealing
with the multitude of national encodings, and having to make our
programs do stuff like unreliable charset autodetection.  ISO-8859-1 and
BIG5 are not solutions for filenames, they are workarounds.

> People that have it at an older encoding get some slack.
> People like you should already have it at UTF8 and get all the fun
> right away.

I'm not sure what you are saying here.

> That's quite an understatement. The commandline editor can't deal with
> multibyte characters in any way. So for example entering an o umlaut
> and then deleting it gets you in trouble, because zsh does not handle
> the two byte sequence as one character.

Ok.  Well, this should not be impossible to fix, I hope.

> FWIW, I am quite content with mandating the contents of some files as
> UTF8. 

Again, there is no mandate involved in my policy proposal.  It is all
just "should"s, except for file names.

> We may want a BOM, at the start, though.

We don't need one for UTF-8.  That's another one of the great things
about it.

Reply to:

Follow-Ups:
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Clint Adams <schizo@debian.org>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Marco d'Itri <md@Linux.IT>

References:
- Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: Colin Walters <walters@debian.org>
- Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: Radovan Garabik <garabik@melkor.dnp.fmph.uniba.sk>
- Re: Bug#174982: [PROPOSAL]: Debian changelogs should be UTF-8 encoded
  - From: Colin Walters <walters@debian.org>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Colin Walters <walters@debian.org>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Jochen Voss <jvoss2@web.de>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Colin Walters <walters@debian.org>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Robert Bihlmeyer <robbe@orcus.priv.at>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Colin Walters <walters@debian.org>
- Bug#99933: second attempt at more comprehensive unicode policy
  - From: Robert Bihlmeyer <robbe@orcus.priv.at>

Prev by Date: Bug#99933: second attempt at more comprehensive unicode policy
Next by Date: Bug#99933: second attempt at more comprehensive unicode policy
Previous by thread: Bug#99933: second attempt at more comprehensive unicode policy
Next by thread: Bug#99933: second attempt at more comprehensive unicode policy
Index(es):
- Date
- Thread