Re: What does charset in locale setting affect?

To: "Dan B." <danb@kempt.net>
Cc: "debian-user@lists.debian.org" <debian-user@lists.debian.org>
Subject: Re: What does charset in locale setting affect?
From: Roger Leigh <rleigh@codelibre.net>
Date: Mon, 3 Sep 2012 12:13:23 +0100
Message-id: <[🔎] 20120903111323.GI3198@codelibre.net>
In-reply-to: <[🔎] 50441FFC.7040507@kempt.net>
References: <[🔎] 50429B20.7030305@kempt.net> <[🔎] 20120902095315.GD3198@codelibre.net> <[🔎] 50441FFC.7040507@kempt.net>

On Sun, Sep 02, 2012 at 11:11:56PM -0400, Dan B. wrote:
> Roger Leigh wrote:
> >On Sat, Sep 01, 2012 at 07:32:48PM -0400, Dan B. wrote:
> >...
> >
> >>Which common programs (e.g., getty, xterm/etc., sed/grep?) do something
> >>different based on the charset portion of the local setting?
> >
> >All of them, in short.
> >
> >When you run a terminal emulator such as xterm, it will get the
> >encoding to use inside the emulator using nl_langinfo(3).    ...
> 
> What about the virtual consoles?

Virtual consoles are slightly different.  Because they start up
/before/ you log in, they switch unicode mode on or off depending
on the default system locale (/etc/default/locale).  See
unicode_start_stop in /etc/init.d/console-screen.kbd.sh.  You can
switch them into unicode mode with unicode_start, which sends an
escape sequence to select the ISO-2022 UTF-8 charset.

> Whether I choose a default system locale of UTF-8 or None (in the
> dialog for "dpkg-reconfigure locales"), and log out and log in (to
> make sure the shell has a chance to get fresh settings), then
> 
>   echo $'\xC2\xA2'
> 
> displays the same thing (the cent sign).

"None" might result in UTF-8 as a default.  Try ISO-8859-1 to
explicitly specify a non-unicode locale.  None that you'll
need to generate a suitable locale e.g. en_GB.ISO-8859-1 with
localegen/localedef.

> Is the virtual console supposed to follow the locale's character
> encoding?  If so, does something else (e.g., something in /etc/init.d/)
> need to be run to make a difference?

/etc/init.d/console-screen.kbd.sh as above.

> Actually, what I really want to know is how to revert the sorting of
> file names from ls (and Emacs dired listings) from the order caused
> by having "en_US" in LANG=en_US.UTF-8 back to the traditional (old)
> Unix order (e.g., what LANG=C would yield) without messing up all the
> UTF-8 support that's all over Linux now.

> First of all, can UTF-8 be combined with the "C" locale as in
> LANG=C.UTF-8?

Yes (and no).  You can certainly generate such a locale.  In fact, I'm
a strong proponent of having a C.UTF-8 locale as the default locale
in glibc.  However, right now if you generate it (which is possible),
it's not completely compatible with a real C locale (i.e. conformant
with the C and POSIX standards).  Hopefully this will be the case in
the future.

> Do I probably want something closer to LANG=en_US.UTF-8 LC_COLLATE=C
> (in order to reduce the amount of locale settings I'm overriding)?

Just set LC_COLLATE=C.  So you keep the UTF-8 LC_CTYPE, but the sort
order is taken from C.  However, this will likely miss-sort any
character outside the ASCII range, since C is a 7-bit ASCII locale.
[Note: you probably do not want this!]  In general, I would advise
using the default collation for your locale, though in code it's
common to switch to C for locale-independent sorting.

> >When you run sed/grep, the encoding will affect how it processes the
> >text.
> 
> Are you sure about sed?
> 
> I tried probing how LANG= vs. LANG=en_US.UTF-8 affected whether
> the regular expression "[a-z]" matched "X".  Grep seems to be
> affected as expected, but sed never matched.  (That's on Squeeze.)

It's the same version in wheezy, so I would not expect a change here.
I'm not sure how [a-z] matches--I'd have to check if it's locale-
independent.  In general, I'd use POSIX character classes like
[:alpha:], [:upper:] and [:lower:] to work properly in all locales.

Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux    http://people.debian.org/~rleigh/
 `. `'   schroot and sbuild  http://alioth.debian.org/projects/buildd-tools
   `-    GPG Public Key      F33D 281D 470A B443 6756 147C 07B3 C8BC 4083 E800

Reply to:

References:
- What does charset in locale setting affect?
  - From: "Dan B." <danb@kempt.net>
- Re: What does charset in locale setting affect?
  - From: Roger Leigh <rleigh@codelibre.net>
- Re: What does charset in locale setting affect?
  - From: "Dan B." <danb@kempt.net>

Prev by Date: Re: can rsync correct dates?
Next by Date: Re: need kernel update for lenny ..
Previous by thread: Re: What does charset in locale setting affect?
Next by thread: Re: What does charset in locale setting affect?
Index(es):
- Date
- Thread