Bug#1026231: debian-policy: document droppage of support for legacy locales

To: 1026231@bugs.debian.org
Subject: Bug#1026231: debian-policy: document droppage of support for legacy locales
From: Adam Borowski <kilobyte@angband.pl>
Date: Wed, 21 Dec 2022 18:15:11 +0100
Message-id: <Y6M/H4Q/uZso2eca@angband.pl>
Reply-to: Adam Borowski <kilobyte@angband.pl>, 1026231@bugs.debian.org
In-reply-to: <[🔎] Y6C2mUD9J33mD2fi@momentum.pseudorandom.co.uk>
References: <[🔎] 167121489734.50818.6019256914983596626.reportbug@valinor.angband.pl> <[🔎] Y6C2mUD9J33mD2fi@momentum.pseudorandom.co.uk> <[🔎] 167121489734.50818.6019256914983596626.reportbug@valinor.angband.pl>

On Mon, Dec 19, 2022 at 07:08:09PM +0000, Simon McVittie wrote:
> On Fri, 16 Dec 2022 at 19:21:37 +0100, Adam Borowski wrote:
> > As of Bookworm, legacy locales are no longer officially supported.
> 
> For clarity, I think when you say "legacy locales" you mean locales
> whose character encoding is either explicitly or implicitly something
> other than UTF-8 ("legacy national encodings"), like en_US (implicitly
> ISO-8859-1 according to /usr/share/i18n/SUPPORTED) and en_GB.ISO-8859-15
> (explicitly ISO-8859-15 in its name). True?

Aye.

> Many of the non-UTF-8 encodings are single-byte encodings in the
> ISO-8859 family, but if I understand correctly, your reasoning applies
> equally to multi-byte east Asian encodings like BIG5, GB18030 and EUC-JP.
> Also true?

Aye.  Anything but UTF-8.

> Meanwhile, locales with a UTF-8 character encoding, like en_AG
> (implicitly UTF-8 according to /usr/share/i18n/SUPPORTED) or en_US.UTF-8
> (explicitly UTF-8), are the ones you are considering to be non-legacy.
> Also true?

Right.

> I think for Policy use, this would have to say something more precise,
> like "locales with a non-UTF-8 character encoding". I wouldn't want to
> get en_US speakers trying to argue that en_GB.UTF-8 is a legacy locale,
> or en_GB speakers like me trying to argue that en_US.UTF-8 is a legacy
> locale :-)

English (traditional) vs English (simplified) :p

> When you say "officially supported" here, do you refer to the extent
> to which they are supported by the glibc maintainers, or some other
> group? Or are you describing a change request that they *should not*
> be officially supported by Debian - something that is not necessarily
> true yet, but in this bug you are asking for it to become true?

My primary source is glibc, especially the debconf questions from "locales",
although bit-rot and/or outright droppage is widespread in other packages.

> > * Software may assume they always run in an UTF-8 locale, and emit or
> >   require UTF-8 input/output without checking.
> 
> I suspect this is already common: for example, ikiwiki is strictly
> UTF-8-only and ignores locales' character sets, which is arguably a bug
> right now but would become a non-bug with your proposed policy.

Exactly, I want to declare that a non-bug, thus saving developer time.

> This is a "may" so it can't possibly make a package gain bugs. It might
> make packages have fewer bugs.

Aye.

> > * The execution environment (usually init system or a container) must
> >   default to UTF-8 encoding unless explicitly configured otherwise.
> 
> Is this already true? This seems like the sort of thing which should be
> fixed in at least the major init systems and container managers before it
> goes into Policy, in the interests of not making those init systems and
> container managers retroactively buggy.

Systemd does so since version 240, sysvinit relies on settings in /etc/
thus in the case of bare debootstrap the variables might be unset -- which
is mostly moot since glibc 2.35.  We briefly discussed an one-line patch
to ensure there's a fallback default, it's currently not applied (but can
be).  This would be relevant only for corner cases like an unconfigured
system running non-glibc non-musl binaries that rely on LC_*.

I'm less knowledgeable about containers, but they appear to work.  It might
be due to copying variables from the host or having template defaults...

Anyway, my aim is more to tell packages that they are allowed to misbehave
when the settings are missing than to hunt misuse scenarios.  But, if such
a scenario is found, with the current Policy there is no recourse, while
if this rule is added it would be a bug.

> > * Legacy locales are no longer officially supported, and packages may
> >   drop support for them and/or exclude them from their testsuites.
> > * Packages may retain support for legacy locales, but related bug reports
> >   (unless security related) are considered to be of wishlist severity.
> 
> Is the C (aka POSIX) locale still a non-UTF-8 locale (if I remember
> correctly its character encoding is officially 7-bit ASCII), or has it
> been redefined to be UTF-8? Given the special status of the C locale in
> defaults and standards, it might be necessary to say that it's the only
> supported locale with a non-UTF-8 character encoding.

Hmm... if I recall correctly, old POSIX left the behaviour of characters
above 126 undefined, making C.UTF-8 _almost_ match the requirements (with
only exception being iswblank() IIRC), but current version specifies ASCII
(rather than C standard's "portable subset") with no additions to character
classes other than cntrl and punct allowed.

This is the locale all processes start with, until they call setlocale().
I'm still not decided whether it should be allowed as the system locale
(ie, when a process says it wants locale handling enabled).

Having it breaks non-ASCII in GUIs, some text output, causes misalignment,
etc.  Thus maybe we can relegate it to the "you can set it if you want,
but if it breaks, you get to handle both pieces" status?  Which probably
needs no explicit mention in the Policy.

> > * Filesystems may be configured to reject file names that are not valid
> >   printable UTF-8 encoded Unicode.
> 
> To put this in terms of the requirements that Policy puts on packages,
> is this really a should/must in disguise: packages should/must not
> assume that they can successfully read/write filenames that are not valid
> printable UTF-8-encoded Unicode?

AFAIK valid Unicode is the case for eg. remotely mounted SMBFS.  It also
used to be required for JFS, but because of (at the time) widespread
non-Unicode encodings it got changed to an unsightly on-disk double encoded
format.

I wonder how viable a gradual change would be.  For the filesystem behaviour
to change, it would be good to allow it in the Policy, without explicitly
requiring the matching change in packages.

> This seems like a change with a wider scope: not only is it excluding
> filenames in Latin-1 or whatever, it's also excluding filenames with
> non-printable characters (tabs, control characters etc.), or with
> the UTF-8 representation of a noncharacter like U+FDEF. Perhaps that
> should be a change orthogonal to de-supporting the non-UTF-8 locales?

Maybe; you have a point.  I do run my boxen with a kernel that disallows
non-printables (especially tabs/newlines/...), and generally the only
fails I see are testsuites.

Thus you can unsmuggle the word "printable", I reflexively added it as it's
something I care about but is indeed orthogonal to non-UTF8.

> > * Human-readable files outside of packages' private data must be encoded
> >   in UTF-8.  This applies especially to files in /usr/share/doc and /etc
> >   but applies to eg. executable scripts in /bin or /sbin as well.
> 
> It's not immediately obvious to me what "human-readable files" means here.
> Text files? Text files in ASCII-compatible encodings? Files intended to be
> read and written by standard text editors?

There's no clear threshold for "human-readable".  There's so many formats
that sometimes are meant for the user to read/edit and sometimes are not.
Eg. there's HTML you may edit and HTML that's the hellish output of a pile
of templates.  Some folks claim that XML is "human readable".  And shell
scripts produced by autoconf are no more meant for human consumption than
disassembly of a binary executable.

I've thus intentionally left the definition vague.

> I assume the intention here is to make it a policy violation to ship
> documentation, scripts, configuration files, etc. encoded in something
> like ISO-8859-1 or EUC-JP?

Exactly.  They can't be conveniently read by users; and even without the
Policy change that's a problem for 99.9% of users today.

> Is this intended to make it a policy violation to ship documentation, etc.
> encoded in UTF-16?

For an Unix person, UTF-16 doesn't make a text file.  Users already can't
read that without special tools, so that's no change.

> > * So-called BOM (U+FEFF) must not be added to plain-text output, and if
> >   present, editors/viewers customarily used for editing code should not
> >   hide its presence.
> 
> This seems to me like it should perhaps be out-of-scope here, and treated
> as a separate change: UTF-8 is still UTF-8, whether it starts with U+FEFF
> or not, and I think deprecating en_GB in favour of en_GB.UTF-8 (and so on)
> is orthogonal to deprecating the use of a U+FEFF prefix on UTF-8 text.

While I stand by my point that BOMs are harmful, you have a point that this
may be a separate change.  Agreed.

> I think "UTF-8 output" is probably a better scope for this than
> "plain-text output": my understanding is that when emitting UTF-16, UCS-2
> or UCS-4 it's conventional (perhaps even recommended?) to emit a BOM
> first, because in those encodings of Unicode, either LE or BE byte order
> is reasonable (unlike UTF-8, which is always MSB-first by design). Perhaps
> you meant this to be implicit, because to a Unix developer, "plain text"
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> is implicitly something ASCII-compatible (which rules out every Unicode
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> encoding except UTF-8), and legacy national encodings cannot represent
> U+FEFF (which rules those out), leaving UTF-8 as the only "plain text"
> encoding where U+FEFF is even representable?

Exactly, UTF-16/UCS-2/UCS-4 are not plain text.  Any other definition is
Wrong™ and its proponents need to be burned at a stake. :)

> It seems to me that it shouldn't be a Policy violation for things
> like text editors and character set converters to have the option to
> emit UTF-8-with-U+FEFF-prefix, but maybe it should instead be a Policy
> violation for that to be the default.

The distinction between a programmers' editor and a normal people's editor
is vague.  Only the former cares -- but there's no gain for the latter for
having a BOM, either.

An option to allow the user do what he wants is not a sin, indeed: after
all, it's the user who owns the computer.  I'm speaking of defaults.

And I just tested Windows 11 notepad.exe: it defaults to UTF-8, and when
saving it allows "ANSI" "UTF-16 LE" "UTF-16 BE" "UTF-8" (default) and
"UTF-8 with BOM".  Thus even if the Great Enemy has switched to no-BOM
UTF-8, I see no reason to do otherwise.

Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ Quis trollabit ipsos trollos?
⠈⠳⣄⠀⠀⠀⠀

Reply to:

Follow-Ups:
- Bug#1026231: debian-policy: document droppage of support for legacy locales
  - From: Simon McVittie <smcv@debian.org>

References:
- Bug#1026231: debian-policy: document droppage of support for legacy locales
  - From: Adam Borowski <kilobyte@angband.pl>
- Bug#1026231: debian-policy: document droppage of support for legacy locales
  - From: Simon McVittie <smcv@debian.org>

Prev by Date: Bug#1026231: debian-policy: document droppage of support for legacy locales
Next by Date: Bug#1026231: debian-policy: document droppage of support for legacy locales
Previous by thread: Bug#1026231: debian-policy: document droppage of support for legacy locales
Next by thread: Bug#1026231: debian-policy: document droppage of support for legacy locales
Index(es):
- Date
- Thread