[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale



Roger Leigh wrote:
On Tue, Apr 07, 2009 at 10:36:20AM +0200, Giacomo A. Catenazzi wrote:

>> Roger Leigh wrote:

I can't help but feel that your reply completely missed the
purpose of what I want to do, and why.  I hope the following
response clears things up.

I know that I missed the original point, but IMHO you was and still
misunderstandings locale, charset and C language behaviour.

So I'm trying to explain you how these things works, and after
this, we can go to the real problem.
[Note: maybe I am on the wrong side. Often standards are not
so consistent on these behaviours, and thus maybe I interpreted them
wrongly]




- input charset is the source charset (used to parse C code)
- exec charset is the charset of the target machine (which run the program).

That's pretty much what I said.

- C99 must support unicode identifier (written with \uxxxx or in other
  non portable implementation defined way)

OK.  But that's really nothing to do with the fact that you can use
UTF-8 sources directly.  It's akin to having to support trigraphs,
but we don't use trigraphs because they are bloody annoying and nowadays
competelely unnecessary.  But mainly, it doesn't affect the exec charset
whether you use UTF-8 encoded sources or \uxxxx.

ok.

- standard libraries can use locales (but only if you initialized the locale),
  but not all the functions, not all uses.
- wide charaters are yet an other things (as you note in your example,
  the wide string is not in UTF-8, but I think UTF-32)

Same input and exec charset really means: don't translate strings
(e.g. in
   if(c = 'a') printf("bcde\n");
 'a' and "bcde\n" will have the same values as in the input file, else
 it will put in binary the representation of exec charset)

Of course.  However, the test program I posted showed what that if the
locale has been appropriately initialised, there is an additional
translation between the exec charset and the output charset specified
by the locale (see the Latin characters correctly preserved and output
as ISO-8859-1 in an ISO-8859-1 locale).

No ;-)  Ok, it take me some modifications of your program and
looking to POSIX to discover the reason.

You forget to check error codes. In this case we have
"Invalid or incomplete multibyte or wide character" in the
non UTF-8 locale.

So looking to POSIX:
"Wide-character codes for other characters are locale and implementation-defined."
so you (and me) compiled the code with UTF-8, so in binary there is
different wchar representation. Which is invalid on non-UTF-8 locale.

Note that that it is locale dependent, so same charset with different
language could give different results (I don't know if there are such
cases on glibc).


Usually the interpretation of bytes is done by terminal, not by compiler.

It's done at several points:
compiler: source->exec
runtime: locale-dependent exec->output (and optional use of gettext)
terminal: output->display

to go to the point: what is the problem in mksh?
At which level it fails?


yes, in a perfect world we need only one charset (and maybe only
one language and one locale). From all the proposals to reach this
target, unicode and UTF-8 seems the best solution.
But... for now take care about locales and don't assume UTF-8,
or you will cause trouble with a lot of non-UTF-8 users.
Converting locale (from non-UTF-8 to UTF-8) is simple for
English and few European languages, but it is a tedious work
for many user: it need a "flag day", in which I should convert
all my files to UTF-8 or annotate every file with the right
encoding (most of editors and tools understands such annotations).

I have never *ever* suggested that we only use one charset.  I'm only
suggesting that the *C locale* must be UTF-8 in order to allow for
full UTF-8 support.  Normal user locales can use whatever charset
they like.

(see the other mail: what do "full UTF-8" mean)


Non-UTF-8 users won't be disadvantaged because the UTF-8 exec charset
will be recoded to their locale-specific output codeset, either by
libc or gettext.

Not sure to understand. Debian is moving all file to UTF-8
(manual pages, documentation, debian control files, ...).
So I totally agree.
But was not the point of the original proglem?


The C locale is special in that normal users won't use it, but
system programs and code needing locale independence do use it.
Any program wanting to work correctly in a C locale must only use
ASCII or it *breaks*.  This means we are /de facto/ restricted to
ASCII unless we take special effort to work around the fact (and
this was the point of my l10n/i18n comments above).

Most programs do need to work correctly in a C locale, and so can't
use UTF-8 either as a source or exec charset.  This is a severe
limitation.

No. "locale" is not really charset. A program can use
as input and output any charset (note: most of editor handle
different file charsets, indipendently).
The problem are the terminals. If you print a non-ASCII char,
terminal will confuse. It is the reason of "libncurse"
(maybe more oriented to control terminal then charsets).

Debian target is to support UTF-8 on all programs, but
the problem is that I connect to debian machine from
outside Debian and also the contrary, I connect
from my Debian machine to other machines.

So a program which support only UTF-8 could cause problem
to such user, and it is outside Debian control.

On a long term I can imagine that UTF-8 will become nearly
standard, but I think we should wait for other distribution
and vendors before to make such big jump.
But now UTF-8 is nearly default in Debian.

But if mksh don't work on "C", I'm very worried.
The problems are on inputs or on outputs?


 A C.UTF-8 would
be a solution to this problem, and allow full use of the *existing*
UTF-8 string handling which all sources are built with, yet only a
tiny fraction dare to use.  Note that gettext is *completely disabled*
if used in a C locale, and this does additional mangling in addition
to the plain libc damage, resulting in *no output at all*!  (I would
need to double check that; this was the case when I last looked,
and the reason I had to abandon use of UTF-8 string literals.)
Use "en_US.UTF-8".

Why?  Did you actually understand the rationale I provided above.
I could use en_US.UTF-8, or any locale.  But the point is that the
code works in all locales *except* C.

ah. This is strange (considering the huge list of locales).
Why doesn't work in C?


"C.UTF-8" is a bad name. Locale "C" means "no locale, old behaviour,
for machine".

No.  "C" and "POSIX" mean the /default/ POSIX-specified locale.  And
there's nothing written in the standard that restricts that locale
to 7-bit ASCII as its codeset.  There are UNIX systems out there
right now using UTF-8 in their C locale.

No. Default locale is "". "C" is a precise locale with fulfill
precise rules. Yes, it can be UTF-8, but why it matters?
(see the other mail: what you need from UTF-8)


Do we need to translate all strings also on C.UTF-8?

Of course not.  We don't do any translation in the C locale.  The
only difference is the character encoding, which is backward
compatible with ASCII in any case.

"C" and "en" could have different translations. POSIX mandates
output in "C" locale, usually providing a printf like string,
so that it can be used in scripts.
I think user must use en_ or other language and only scripts
"C".
So the frequent question: why does scripts need a UTF-8 locale? ;-)


Which alphabetic characters?  Which numeric characters?  Which
alphabetic order? etc. etc.  You see: it is difficult to create
a new locale, and people must understand the meaning of such locale
(without reading all the locale definition).

For a minimal locale it could just use strict numerical ordering.
It should probably copy what existing systems using UTF-8 C locales do.

Different language use different unicode range.


Regarding the standards conformance of using a UTF-8 C locale:
I've spent some time reading the standards (SUSv3), and see no reason
why C can't use UTF-8 as its default codeset and still remain strictly
conforming.
UTF-8 as a lot of characters (alphabetic, numeric, white).
C locale requires that whitespace are only SPACE and TAB.

Where is this requirement?  Can you point me to the SUSv3 definition?

7.3.1:
"In the POSIX locale, only the <space> and <tab> shall be included."

Ok. I confused "blank" with "white". Anyway in 7.3.1 you see requirement
of "C". So a UTF-8 is ok, but which definition?
It need to be simple (but people don't want us_EN like, because of
collation and other complex rules).

(...)

ciao
	cate



Reply to: