[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#1026231: debian-policy: document droppage of support for legacy locales



On Fri, 16 Dec 2022 at 19:21:37 +0100, Adam Borowski wrote:
> As of Bookworm, legacy locales are no longer officially supported.

For clarity, I think when you say "legacy locales" you mean locales
whose character encoding is either explicitly or implicitly something
other than UTF-8 ("legacy national encodings"), like en_US (implicitly
ISO-8859-1 according to /usr/share/i18n/SUPPORTED) and en_GB.ISO-8859-15
(explicitly ISO-8859-15 in its name). True?

Many of the non-UTF-8 encodings are single-byte encodings in the
ISO-8859 family, but if I understand correctly, your reasoning applies
equally to multi-byte east Asian encodings like BIG5, GB18030 and EUC-JP.
Also true?

Meanwhile, locales with a UTF-8 character encoding, like en_AG
(implicitly UTF-8 according to /usr/share/i18n/SUPPORTED) or en_US.UTF-8
(explicitly UTF-8), are the ones you are considering to be non-legacy.
Also true?

I think for Policy use, this would have to say something more precise,
like "locales with a non-UTF-8 character encoding". I wouldn't want to
get en_US speakers trying to argue that en_GB.UTF-8 is a legacy locale,
or en_GB speakers like me trying to argue that en_US.UTF-8 is a legacy
locale :-)

When you say "officially supported" here, do you refer to the extent
to which they are supported by the glibc maintainers, or some other
group? Or are you describing a change request that they *should not*
be officially supported by Debian - something that is not necessarily
true yet, but in this bug you are asking for it to become true?

> * Software may assume they always run in an UTF-8 locale, and emit or
>   require UTF-8 input/output without checking.

I suspect this is already common: for example, ikiwiki is strictly
UTF-8-only and ignores locales' character sets, which is arguably a bug
right now but would become a non-bug with your proposed policy.

This is a "may" so it can't possibly make a package gain bugs. It might
make packages have fewer bugs.

> * The execution environment (usually init system or a container) must
>   default to UTF-8 encoding unless explicitly configured otherwise.

Is this already true? This seems like the sort of thing which should be
fixed in at least the major init systems and container managers before it
goes into Policy, in the interests of not making those init systems and
container managers retroactively buggy.

> * Legacy locales are no longer officially supported, and packages may
>   drop support for them and/or exclude them from their testsuites.
> * Packages may retain support for legacy locales, but related bug reports
>   (unless security related) are considered to be of wishlist severity.

Is the C (aka POSIX) locale still a non-UTF-8 locale (if I remember
correctly its character encoding is officially 7-bit ASCII), or has it
been redefined to be UTF-8? Given the special status of the C locale in
defaults and standards, it might be necessary to say that it's the only
supported locale with a non-UTF-8 character encoding.

> * Filesystems may be configured to reject file names that are not valid
>   printable UTF-8 encoded Unicode.

To put this in terms of the requirements that Policy puts on packages,
is this really a should/must in disguise: packages should/must not
assume that they can successfully read/write filenames that are not valid
printable UTF-8-encoded Unicode?

This seems like a change with a wider scope: not only is it excluding
filenames in Latin-1 or whatever, it's also excluding filenames with
non-printable characters (tabs, control characters etc.), or with
the UTF-8 representation of a noncharacter like U+FDEF. Perhaps that
should be a change orthogonal to de-supporting the non-UTF-8 locales?

> * Human-readable files outside of packages' private data must be encoded
>   in UTF-8.  This applies especially to files in /usr/share/doc and /etc
>   but applies to eg. executable scripts in /bin or /sbin as well.

It's not immediately obvious to me what "human-readable files" means here.
Text files? Text files in ASCII-compatible encodings? Files intended to be
read and written by standard text editors?

I assume the intention here is to make it a policy violation to ship
documentation, scripts, configuration files, etc. encoded in something
like ISO-8859-1 or EUC-JP?

Is this intended to make it a policy violation to ship documentation, etc.
encoded in UTF-16?

> * So-called BOM (U+FEFF) must not be added to plain-text output, and if
>   present, editors/viewers customarily used for editing code should not
>   hide its presence.

This seems to me like it should perhaps be out-of-scope here, and treated
as a separate change: UTF-8 is still UTF-8, whether it starts with U+FEFF
or not, and I think deprecating en_GB in favour of en_GB.UTF-8 (and so on)
is orthogonal to deprecating the use of a U+FEFF prefix on UTF-8 text.

I think "UTF-8 output" is probably a better scope for this than
"plain-text output": my understanding is that when emitting UTF-16, UCS-2
or UCS-4 it's conventional (perhaps even recommended?) to emit a BOM
first, because in those encodings of Unicode, either LE or BE byte order
is reasonable (unlike UTF-8, which is always MSB-first by design). Perhaps
you meant this to be implicit, because to a Unix developer, "plain text"
is implicitly something ASCII-compatible (which rules out every Unicode
encoding except UTF-8), and legacy national encodings cannot represent
U+FEFF (which rules those out), leaving UTF-8 as the only "plain text"
encoding where U+FEFF is even representable?

It seems to me that it shouldn't be a Policy violation for things
like text editors and character set converters to have the option to
emit UTF-8-with-U+FEFF-prefix, but maybe it should instead be a Policy
violation for that to be the default.

    smcv


Reply to: