[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: What does charset in locale setting affect?



On Sat, Sep 01, 2012 at 07:32:48PM -0400, Dan B. wrote:
> In a locale setting such as en_US.UTF-8 (e.g., LANG=en_US.UTF-8),
> what exactly does the charset/character encoding part (UTF-8) affect?

This affects the character encoding that programs use for input
and output.  For example, if you want to print the character
‘á’ (Unicode code point 0x00E1), you will output this as UTF-8 as
the byte sequence
  0xc3 0xa1
However, in a Latin 1 (ISO-8859-1) locale, this would be printed
as
  0xe1
and in other encodings, it will be a different byte sequence yet
again.

> Which common programs (e.g., getty, xterm/etc., sed/grep?) do something
> different based on the charset portion of the local setting?

All of them, in short.

When you run a terminal emulator such as xterm, it will get the
encoding to use inside the emulator using nl_langinfo(3).  This returns
the name of the character encoding used in the locale.  This will
ensure that it knows the encoding used by programs so that it can
correctly display them, and likewise for the input it sends to them.
If the encoding was incorrect, it would otherwise display garbage.

When you run sed/grep, the encoding will affect how it processes the
text.  It's therefore important to use the same encoding in your files
as you have set in your locale.  Before we had UTF-8, the old 8-bit
encodings didn't necessarily match your locale, and you couldn't tell
what they were supposed to be, so using UTF-8 everywhere has been a
massive improvement.

This is generally completely transparent.  For example, if you were
to write (in C), the following code:

#include <stdio.h>
#include <locale.h>

int main(void)
{
   setlocale(LC_ALL, "");
   printf("á\n");
   return 0;
}

This will work correctly in any locale.  GCC defaults to using UTF-8
internally, and will translate it to the user's locale encoding on
output.

Nowadays, there's little reason to use any encoding other than UTF-8;
all the others are a subset of UTF-8 and only present for legacy and
compatibility reasons.


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux    http://people.debian.org/~rleigh/
 `. `'   schroot and sbuild  http://alioth.debian.org/projects/buildd-tools
   `-    GPG Public Key      F33D 281D 470A B443 6756 147C 07B3 C8BC 4083 E800


Reply to: