Re: Bug#701081: debian-policy: mandate an encoding for filenames in binary packages

To: debian-policy@lists.debian.org
Subject: Re: Bug#701081: debian-policy: mandate an encoding for filenames in binary packages
From: Simon McVittie <smcv@debian.org>
Date: Thu, 21 Feb 2013 13:03:00 +0000
Message-id: <[🔎] 51261B04.8030700@debian.org>
In-reply-to: <[🔎] 20130221114327.GA19746@alf.mars>
References: <[🔎] 20130221114327.GA19746@alf.mars>

On 21/02/13 11:43, Helmut Grohne wrote:
> The number of exceptions is about 200 contained in about 50 binary
> packages.

Do you have a list handy?

What proportion of them are UTF-8? You can test via, for instance:

  echo "$filename" | isutf8 -q /dev/stdin || echo "not UTF-8: $filename"

with isutf8(1) from moreutils. In theory this could have
false-positives, but UTF-8's design makes it unlikely that meaningful
strings in ISO-8859-* happen to be syntactically valid UTF-8.

> In those packages some filenames are not representable as
> UTF-8 (for example aspell-is)

I assume you mean "are not UTF-8" (presumably they're ISO-8859-1 or
ISO-8859-15?) rather than "not representable"? (Any Latin1 string is
representable in UTF-8 via transcoding, although the resulting bytes
will obviously be different.)

> and others don't make any sense in
> ISO-8859-15 (for example ca-certificates).

These do appear to be UTF-8.

> to mandating a particular encoding (such as UTF-8).

I would personally be inclined to recommend/mandate UTF-8.

I certainly don't think any option other than ASCII, UTF-8 or "they're
just bytestrings, deal with it" would make sense - the third of those
options is what we have at the moment, and this bug is basically a
request to reject it.

Tools typically either assume that filenames are encoded according to
the current locale (traditional Unix behaviour, and GNOME with
G_BROKEN_FILENAMES set) or UTF-8 (probably many tools, but notably
encouraged by GNOME); and I believe Debian has defaulted to UTF-8
locales for quite some time, so the two often coincide.

Also, as far as I know, UTF-8 is the only widely-used encoding that can
represent all Unicode characters and is suitable for Unix filenames.
ISO-8859-* can't represent all characters; UTF-16 and UTF-32 are
unsuitable for Unix filenames because they don't coincide with ASCII
over the ASCII range; and UCS-2 manages to have both problems
simultaneously.

    S

Reply to:

References:
- Bug#701081: debian-policy: mandate an encoding for filenames in binary packages
  - From: Helmut Grohne <helmut@subdivi.de>

Prev by Date: Bug#701081: debian-policy: mandate an encoding for filenames in binary packages
Next by Date: Bug#701081: debian-policy: mandate an encoding for filenames in binary packages
Previous by thread: Bug#701081: debian-policy: mandate an encoding for filenames in binary packages
Next by thread: Bug#701081: debian-policy: mandate an encoding for filenames in binary packages
Index(es):
- Date
- Thread