[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale

Thorsten Glaser wrote:
For the mksh regression tests, I need a UTF-8 locale working; most
systems either provide “en_US.UTF-8” or “en_US.utf8” with the former
being recommended.

Build-depending on locales-all has worked for me so far, except it
won’t do in Kubuntu where said package does not exist (workaround
is to run 「locale-gen en_US.UTF-8」 in a pbuilder hook, but that’s
almost certainly not allowed in debian/rules *and* requires root),
and fails on hurd-i386 recently (locales-all fails to install).

The promise of the etch release to bring UTF-8 support was not met
because a standard installation of etch does not supply any locale
which can be used for LC_CTYPE with UTF-8 support; only installing
locales-all, or installing locales and debconfing one will do so.
I do not know about lenny, though, I have to admit.

The most light-weight solution would be to
• introduce a “C.UTF-8” locale, as some other OSes did, which is
  equivalent to the “C” (POSIX) locale in all respects *except*
  for LC_CTYPE, where it uses UTF-8 instead of a 7/8-bit charac-
  ter set or encoding
• deliver the “C.UTF-8” locale with the base system
• allow Debian packages to depend on its existence, both at
  build and run time

A more controversial solution would be to do the second and third
point of the above with the “en_US.UTF-8” locale, but that would
be favouring US americanism. (On the other hand, it’s *the* one
most widely spread UTF-8 capable locale available, and as such,
the mksh regression tests use it upstream already.)

I don't understand the problem.
In POSIX the choice of locale and charset is done by user
(in the list of system supported locales/charset).
The default is the locale "C" (alias "POSIX").

If you need a specific locale (as seems from "mksh", not
sure if it is a bug in that program), you need to set it.
Why does mksh need UTF-8? What is wrong with other charsets
or with simple ASCII7?

Debian target is that all program should support (and
possibly display) UTF8 inputs and outputs. Mandate
UTF-8 as default (instead of C/POSIX) would probably
be worse (and non POSIX conformant).

About "C.UTF-8". I really think it is an error. If a user
need a locale, it should set it with the right language
(maybe "en_US.UTF-8").
"C" doesn't mean "default" or "English", but it specify a specific
output, usually for automatic processing. (Check POSIX standard,
and output requirement on "C" locale). en_US could be more user
friendly, but "C" means "old sysadmin gergo".

So, if I interpret right your problem, the right solution is:
- mksh should allow all locales and charsets
and one of:
- Debian should mandate (ev. recommend en_US.UTF-8)
  [ I think it is right on standard installation, but IMHO
  it could be to strong for a minimal essential base (chroot)]
- or a "en_US.UTF-8" package dependency should be required.


Reply to: