Bug#603914: Please drop non-UTF8 locales

To: Thorsten Glaser <tg@mirbsd.de>
Cc: 603914@bugs.debian.org
Subject: Bug#603914: Please drop non-UTF8 locales
From: Roger Leigh <rleigh@codelibre.net>
Date: Sun, 9 Jan 2011 23:48:35 +0000
Message-id: <[🔎] 20110109234835.GF11671@codelibre.net>
Reply-to: Roger Leigh <rleigh@codelibre.net>, 603914@bugs.debian.org
In-reply-to: <[🔎] Pine.BSM.4.64L.1101092219210.17509@herc.mirbsd.org>
References: <Pine.BSM.4.64L.1011281721290.27885@herc.mirbsd.org> <[🔎] 20110108123254.GB25780@codelibre.net> <[🔎] Pine.BSM.4.64L.1101092219210.17509@herc.mirbsd.org>

On Sun, Jan 09, 2011 at 10:21:50PM +0000, Thorsten Glaser wrote:
> Roger Leigh dixit:
> 
> >From my reading of the standards a UTF-8 C locale would be required
> >to behave identically to the existing ASCII C locale:
> >
> >• will consider all byte sequences valid
> 
> I think it wouldn’t (since UTF-8 mbrtowc/wcrtomb don’t work
> this way, and it can’t be done with “just” the POSIX API
> anyway because they aren’t allowed to not read any input
> byte when outputting (in MirBSD, I’ve added a sister func-
> tion to mbrtowc which can do that), so not everything can
> be accepted in all situations.

If you are using multibyte functions, then I agree these are special
cases.  For these to function correctly, they do require valid input.
They would of course fail when run in a UTF-8 C locale.  However, they
should fail in an ASCII C locale as well (I should test this) given
that the wide character representation is always UCS-4 on GNU/Linux
and an e.g. latin1 sequence wouldn't be valid UTF-8.

I think the "all byte sequences valid" applies mainly to narrow
character I/O.  i.e. printf/puts etc. won't alter, drop or otherwise
mangle any non 7-bit-ASCII codes.  i.e. I think the intent was to
ensure 8-bit cleanliness in a 7-bit locale.  This naturally extends
to UTF-8.  I'm not sure that wide character support is implied here,
given that it implicity requires correct byte sequences to function
where the narrow character I/O does not (all 8-bit codes are correct).

Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.

Attachment: signature.asc
Description: Digital signature

Reply to:

Follow-Ups:
- Bug#603914: Please drop non-UTF8 locales
  - From: Thorsten Glaser <tg@mirbsd.de>

References:
- Bug#603914: Please drop non-UTF8 locales
  - From: Roger Leigh <rleigh@codelibre.net>
- Bug#603914: Please drop non-UTF8 locales
  - From: Thorsten Glaser <tg@mirbsd.de>

Prev by Date: Bug#603914: Please drop non-UTF8 locales
Next by Date: Bug#603914: Please drop non-UTF8 locales
Previous by thread: Bug#603914: Please drop non-UTF8 locales
Next by thread: Bug#603914: Please drop non-UTF8 locales
Index(es):
- Date
- Thread