Re: Bug#701081: debian-policy: mandate an encoding for filenames in binary packages
On 21/02/13 11:43, Helmut Grohne wrote:
> The number of exceptions is about 200 contained in about 50 binary
> packages.
Do you have a list handy?
What proportion of them are UTF-8? You can test via, for instance:
echo "$filename" | isutf8 -q /dev/stdin || echo "not UTF-8: $filename"
with isutf8(1) from moreutils. In theory this could have
false-positives, but UTF-8's design makes it unlikely that meaningful
strings in ISO-8859-* happen to be syntactically valid UTF-8.
> In those packages some filenames are not representable as
> UTF-8 (for example aspell-is)
I assume you mean "are not UTF-8" (presumably they're ISO-8859-1 or
ISO-8859-15?) rather than "not representable"? (Any Latin1 string is
representable in UTF-8 via transcoding, although the resulting bytes
will obviously be different.)
> and others don't make any sense in
> ISO-8859-15 (for example ca-certificates).
These do appear to be UTF-8.
> to mandating a particular encoding (such as UTF-8).
I would personally be inclined to recommend/mandate UTF-8.
I certainly don't think any option other than ASCII, UTF-8 or "they're
just bytestrings, deal with it" would make sense - the third of those
options is what we have at the moment, and this bug is basically a
request to reject it.
Tools typically either assume that filenames are encoded according to
the current locale (traditional Unix behaviour, and GNOME with
G_BROKEN_FILENAMES set) or UTF-8 (probably many tools, but notably
encouraged by GNOME); and I believe Debian has defaulted to UTF-8
locales for quite some time, so the two often coincide.
Also, as far as I know, UTF-8 is the only widely-used encoding that can
represent all Unicode characters and is suitable for Unix filenames.
ISO-8859-* can't represent all characters; UTF-16 and UTF-32 are
unsuitable for Unix filenames because they don't coincide with ASCII
over the ASCII range; and UCS-2 manages to have both problems
simultaneously.
S
Reply to: