[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: What does charset in locale setting affect?



On Sun, Sep 02, 2012 at 11:11:56PM -0400, Dan B. wrote:
> Roger Leigh wrote:
> >On Sat, Sep 01, 2012 at 07:32:48PM -0400, Dan B. wrote:
> >...
> >
> >>Which common programs (e.g., getty, xterm/etc., sed/grep?) do something
> >>different based on the charset portion of the local setting?
> >
> >All of them, in short.
> >
> >When you run a terminal emulator such as xterm, it will get the
> >encoding to use inside the emulator using nl_langinfo(3).    ...
> 
> What about the virtual consoles?

Virtual consoles are slightly different.  Because they start up
/before/ you log in, they switch unicode mode on or off depending
on the default system locale (/etc/default/locale).  See
unicode_start_stop in /etc/init.d/console-screen.kbd.sh.  You can
switch them into unicode mode with unicode_start, which sends an
escape sequence to select the ISO-2022 UTF-8 charset.

> Whether I choose a default system locale of UTF-8 or None (in the
> dialog for "dpkg-reconfigure locales"), and log out and log in (to
> make sure the shell has a chance to get fresh settings), then
> 
>   echo $'\xC2\xA2'
> 
> displays the same thing (the cent sign).

"None" might result in UTF-8 as a default.  Try ISO-8859-1 to
explicitly specify a non-unicode locale.  None that you'll
need to generate a suitable locale e.g. en_GB.ISO-8859-1 with
localegen/localedef.

> Is the virtual console supposed to follow the locale's character
> encoding?  If so, does something else (e.g., something in /etc/init.d/)
> need to be run to make a difference?

/etc/init.d/console-screen.kbd.sh as above.

> Actually, what I really want to know is how to revert the sorting of
> file names from ls (and Emacs dired listings) from the order caused
> by having "en_US" in LANG=en_US.UTF-8 back to the traditional (old)
> Unix order (e.g., what LANG=C would yield) without messing up all the
> UTF-8 support that's all over Linux now.

> First of all, can UTF-8 be combined with the "C" locale as in
> LANG=C.UTF-8?

Yes (and no).  You can certainly generate such a locale.  In fact, I'm
a strong proponent of having a C.UTF-8 locale as the default locale
in glibc.  However, right now if you generate it (which is possible),
it's not completely compatible with a real C locale (i.e. conformant
with the C and POSIX standards).  Hopefully this will be the case in
the future.

> Do I probably want something closer to LANG=en_US.UTF-8 LC_COLLATE=C
> (in order to reduce the amount of locale settings I'm overriding)?

Just set LC_COLLATE=C.  So you keep the UTF-8 LC_CTYPE, but the sort
order is taken from C.  However, this will likely miss-sort any
character outside the ASCII range, since C is a 7-bit ASCII locale.
[Note: you probably do not want this!]  In general, I would advise
using the default collation for your locale, though in code it's
common to switch to C for locale-independent sorting.

> >When you run sed/grep, the encoding will affect how it processes the
> >text.
> 
> Are you sure about sed?
> 
> I tried probing how LANG= vs. LANG=en_US.UTF-8 affected whether
> the regular expression "[a-z]" matched "X".  Grep seems to be
> affected as expected, but sed never matched.  (That's on Squeeze.)

It's the same version in wheezy, so I would not expect a change here.
I'm not sure how [a-z] matches--I'd have to check if it's locale-
independent.  In general, I'd use POSIX character classes like
[:alpha:], [:upper:] and [:lower:] to work properly in all locales.


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux    http://people.debian.org/~rleigh/
 `. `'   schroot and sbuild  http://alioth.debian.org/projects/buildd-tools
   `-    GPG Public Key      F33D 281D 470A B443 6756 147C 07B3 C8BC 4083 E800


Reply to: