Re: default character encoding for everything in debian

To: debian-devel@lists.debian.org
Subject: Re: default character encoding for everything in debian
From: Roger Leigh <rleigh@codelibre.net>
Date: Wed, 12 Aug 2009 11:30:50 +0100
Message-id: <[🔎] 20090812103050.GA7714@codelibre.net>
In-reply-to: <[🔎] 20090812075649.GQ5487@const.famille.thibault.fr>
References: <[🔎] 20090811183800.GE5487@const.famille.thibault.fr> <[🔎] 200908111940.n7BJeZQO067901@neskaya.eckenfels.net> <[🔎] 20090811202423.GA31394@wavehammer.waldi.eu.org> <[🔎] 4A825B32.2000009@debian.org> <[🔎] 20090812075649.GQ5487@const.famille.thibault.fr>

On Wed, Aug 12, 2009 at 09:56:49AM +0200, Samuel Thibault wrote:
> Giacomo A. Catenazzi, le Wed 12 Aug 2009 08:03:30 +0200, a écrit :
> > Bastian Blank wrote:
> > > On Tue, Aug 11, 2009 at 09:40:35PM +0200, Bernd Eckenfels wrote:
> > >> In article <[🔎] 20090811183800.GE5487@const.famille.thibault.fr> you wrote:
> > >>> Not necessarily.  Any sane implementation should just use wchar_t
> > >> Which could be UTF16 and therefore still has complicatd length semantics. 
> > > 
> > > No, wchar_t is UCS-4 (or UCS-2 in esoteric implementations like
> > > Windows).
> > 
> > No wchar_t is locale dependent (per POSIX).
> 
> What do you mean?  The compiler can't know the locale in advance for
> the width and endianness.  The value might depend on the locale, yes,
> but that's not a problem as long as you convert into UTF-8 before
> communicating with other applications.
> 
> One same systems (Debian systems are), it's just always UCS-4.

Specifically, __STDC_ISO_10646__ is defined to indicate that wchar_t
is always UCS-4 in all locales.

> > BTW on gcc:
> > 
> > -fwide-exec-charset=charset
> >     Set the wide execution character set, used for wide string and
> > character constants.
> 
> It hurts when I shoot myself in the foot.

This feature of GCC is one of the more obscure areas of locale
handling.  How does the encoding of strings at the level of
individial translation units work with a single per-process
global locale and C formatted I/O?  Curious minds would like to
know!

> > The default is UTF-32 or UTF-16, whichever corresponds to the width of
> > wchar_t.
> 
> This documentation is bogus BTW.  It should read "UCS-4 or UCS-2".

It's "strictly" correct according to the standard.
http://en.wikipedia.org/wiki/UTF-32/UCS-4 for an overview.


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.

Reply to:

Follow-Ups:
- Re: default character encoding for everything in debian
  - From: Samuel Thibault <sthibault@debian.org>

References:
- Re: default character encoding for everything in debian
  - From: Samuel Thibault <sthibault@debian.org>
- Re: default character encoding for everything in debian
  - From: Bernd Eckenfels <bernd-09@eckenfels.net>
- Re: default character encoding for everything in debian
  - From: Bastian Blank <waldi@debian.org>
- Re: default character encoding for everything in debian
  - From: "Giacomo A. Catenazzi" <cate@debian.org>
- Re: default character encoding for everything in debian
  - From: Samuel Thibault <sthibault@debian.org>

Prev by Date: Re: What’s the use for Standards-Version?
Next by Date: MBF for removal of python-gnome2-desktop binary package
Previous by thread: Re: default character encoding for everything in debian
Next by thread: Re: default character encoding for everything in debian
Index(es):
- Date
- Thread