[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#1026231: debian-policy: document droppage of support for legacy locales



On Wed, 18 Jan 2023 at 16:30:46 -0700, Anthony Fok wrote:
> In their mind, GB 18030 encompasses a lot more than just
> a character encoding mapping table.  It is the full support package
> (including fonts, display, printing, input methods, etc.) for Han
> Chinese and all other minority languages used in China.

If I'm reading correctly, the character encoding part of GB 18030-2022 is
a subset of a sufficiently new version of Unicode, in the same way that
(say) ISO-8859-15 is a subset of Unicode: for every character representable
in GB 18030-2022, you can point at an equivalent Unicode character and say
"this is the GB 18030-2022 encoding of U+4E00" or similar? Is that true?

If that's the case, then supporting text files written in GB 18030
does not *necessarily* require the internal representation or the
system locale to be GB 18030, the same way I can still work with legacy
en_GB.ISO-8859-15 files on my en_GB.UTF-8 system: it could equally well
be done by using iconv() or equivalent to transcode to UTF-8, UTF-16 or
UCS-4 on input, doing all text editing operations on that Unicode, and
then transcoding back into GB 18030 on output. Most language frameworks
already do this as a matter of API: Qt, Java and Windows tend to work
with UTF-16 internally, while GLib/GTK uses UTF-8 internally.

iconv() seems very unlikely to drop support for GB 18030, ISO-8859-15 and
other non-Unicode encodings altogether. What this bug report is about is
dropping support for locales whose associated encoding is non-Unicode,
such as en_GB.ISO-8859-15 and zh_CN.GB18030, so that the data stream
between a CLI program and the terminal emulator will be assumed to be UTF-8
instead of ISO-8859-15 or GB18030.

The main thing I can see that would be a problem for GB 18030 users
if the zh_CN.GB18030 locale was dropped is that various programs might
assume that the locale encoding is the right one to assume when loading
existing files and unable to guess the encoding, or the right one to
write into new files by default - and so users who have moved from
zh_CN.GB18030 to zh_CN.UTF-8 might find themselves unintentionally
producing new UTF-8 files.

Preferring to use Unicode does seem to be the direction that all of
computing is going in, as a simplifying assumption - for example W3C
advice for HTML is "You should always use the UTF-8 character encoding"[1]
- and as we know, things that aren't tested usually don't work. So I
think the level of functionality for non-UTF-8 locales and encodings in
the software we package is going to decline over time, whether Debian
wants it to or not.

    smcv

[1] https://www.w3.org/International/questions/qa-html-encoding-declarations


Reply to: