Re: BoF at Debconf5 about glibc locale file format

To: debian-glibc@lists.debian.org, Petter Reinholdtsen <pere@hungry.com>
Subject: Re: BoF at Debconf5 about glibc locale file format
From: Denis Barbier <barbier@linuxfr.org>
Date: Mon, 27 Jun 2005 22:45:34 +0200
Message-id: <[🔎] 20050627204534.GA7661@linuxfr.org>
Mail-followup-to: barbier@linuxfr.org, debian-glibc@lists.debian.org, Petter Reinholdtsen <pere@hungry.com>
Reply-to: barbier@linuxfr.org, debian-glibc@lists.debian.org, Petter Reinholdtsen <pere@hungry.com>
In-reply-to: <20050531200517.GB6783@linuxfr.org>
References: <20050531200517.GB6783@linuxfr.org>

On Tue, May 31, 2005 at 10:05:17PM +0200, I wrote:
> Hi,
> 
> I just submitted a proposal to give a tutorial about glibc locale file
> format at Debconf5.

Here is a rough overview of the proposed talk.  I will convert it
from mgp to some LaTeX format, my skills for using mgp seems to be
too limited.  Please let me know if you believe that some issues
should be discussed/discarded, or if you have any other comments.
As mentioned before, co-authors are welcome if someone wants to
talk about a specific part.

Denis

%deffont "standard"   xfont "serif", tfont "standard.ttf"
%deffont "thick"      xfont "sans-serif", tfont "arial.ttf"
%deffont "typewriter" xfont "monospace", tfont "typewriter.ttf"
%default 1 center, size 5, fore "yellow", font "standard"
%default 2 leftfill, size 4, fore "white", font "standard"
%tab 1 size 5, prefix "  ", icon box "green" 30
%tab 2 size 4, prefix "      ", icon arc "yellow" 30
%tab 3 size 3, prefix "            ", icon delta3 "white" 20
%tab category size 4, prefix "  ", icon box "green" 30, font "typewriter"
%tab def size 4, prefix "            "
%tab lst size 3, font "typewriter", prefix 10
%tab cmt size 4, prefix "  ", icon box "green" 30
%page
GNU libc locale data format
%nodefault, center, size 5
Denis Barbier <barbier@debian.org>
Debconf 5

%page
POSIX locale categories

%font "typewriter", size 3
  http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html

%size 4
&category LC_COLLATE
&def Collation order
&category LC_CTYPE
&def Character classification and case conversion
&category LC_MESSAGES
&def Formats of informative and diagnostic messages and interactive responses.
&category LC_MONETARY
&def Monetary formatting
&category LC_NUMERIC
&def Numeric, non-monetary formatting
&category LC_TIME
&def Date and time formats

%page
Extra categories from ISO 14652 draft

#  New drafts contain LC_VERSIONS but glibc is based on older drafts
#  with LC_IDENTIFICATION.  Note also that the meaning of 'category'
#  keywords in this section is plain wrong, it should
#    a. list all present (and only present) categories
#    b. define the specification to which this category is conforming.
#       Valid values are posix:1993 and i18n:1998 (or any later
#       drafts)
&category LC_VERSIONS LC_IDENTIFICATION
&def Versions and status of categories
&category LC_ADDRESS
&def Format of postal addresses
&category LC_MEASUREMENT
&def Information on measurement system
&category LC_NAME
&def Format of writing personal names
&category LC_PAPER
&def Paper format
&category LC_TELEPHONE
&def Format for telephone numbers, and other telephone information

%page
POSIX locale data specifications

%leftfill, size 4
	One or more locale category source definitions (LC_*)
%leftfill, size 4
	A category source definition contains either the definition of a category or a 'copy' directive
%leftfill, size 4
	Lines beginning with the comment character are ignored
%leftfill, size 4
	A trailing escape character is a continuation character
%leftfill, size 4
	All characters shall be represented using symbolic names (<name>)
%leftfill, size 4
	They may be represented using the characters themselves or numeral constants (not portable)
%leftfill, size 4
	Each line contains an identifier, followed by zero or more operands
%leftfill, size 4
	Strings shall be enclosed in double-quotes.
%leftfill, size 4
	When a keyword is followed by more than one operand, the operands shall be separated by semicolons

%page
Old locale data format in GNU libc 2.0

&lst escape_char     /
&lst comment_char    %
&lst repertoiremap mnemonic.ds
&lst 
&lst % Finnish language locale for Finland
&lst ...
&lst LC_MONETARY
&lst int_curr_symbol      "<F><I><M><SP>"
&lst currency_symbol      "<m><k>"
&lst mon_decimal_point    "<,>"

%size 4
>From /usr/share/i18n/repertoiremaps/mnemonic.ds

&lst <SP>                   <U0020> SPACE
&lst <A>                    <U0041> LATIN CAPITAL LETTER A
&lst <B>                    <U0042> LATIN CAPITAL LETTER B
&lst <C>                    <U0043> LATIN CAPITAL LETTER C

%page
New locale data format since GNU libc 2.1

New format, inspired by drafts of ISO14652
	Predefined symbolic character names: <Uxxxx> and <Uxxxxxxxx>
	Addition of ellipsis <U0030>..<U0039> and with step <U0100>..(2)..<U0136>
	Transliteration
	New categories

&lst LC_MONETARY
&lst int_curr_symbol      "<U0045><U0055><U0052><U0020>"
&lst currency_symbol      "<U20AC>"
&lst mon_decimal_point    "<U002C>"

	Less readable and error prone, we need dedicated tools!

%page
Overview of /usr/share/i18n/locales/fi_FI

&lst escape_char /
&lst comment_char %

&cmt The slash is an escape character (default is backslash \\)
&cmt Comments begin with a percent sign (default is number sign #)

&lst % Finnish language locale for Finland
&lst % sorting according to SFS 4600 (1986-06-09)
&lst ...
&lst % Application: general
&lst % Users: general
&lst % Charset: ISO-8859-1
&lst % Distribution and use is free, also
&lst % for commercial purposes.
&lst %
&lst % Useful sources:
&lst %   Locale info for Finnish in Finland
&lst %     http://std.dkuug.dk/cultreg/registrations/narrative/fi_FI,_1.0

%page
Locale info for Finnish in Finland
  http://std.dkuug.dk/cultreg/registrations/narrative/fi_FI,_1.0

Ordering of characters:
     a,  b,  c,  d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s,
     s-caron, t, u, v, w, x, y, z, z-caron, å ,ä, ö

%page
LC_IDENTIFICATION

	This keyword appeared in old ISO 14652 drafts and have been replaced by LC_VERSIONS, but GNU libc locale data files have not been updated.
	The "category" field should define what specification the category is claiming conformance to.  Valid values are "posix:1993" and "i18n:1998" (referring to ISO 14652 drafts), year may be changed to specify updates.
	But GNU localedef handles only its own format!

%page
LC_IDENTIFICATION (cont'd)

Example: fi_FI

&lst LC_IDENTIFICATION
&lst title      "Finnish locale for Finland"
&lst source     "RAP"
&lst address    "Sankt J<U00F8>rgens Alle 8, DK-1615 K<U00F8>benhavn V, Danmark"
&lst ...
&lst revision   "1.0"
&lst date       "2000-06-29"
&lst %
&lst category  "fi_FI:2000";LC_IDENTIFICATION
&lst category  "fi_FI:2000";LC_CTYPE
&lst category  "fi_FI:2000";LC_COLLATE
&lst category  "fi_FI:2000";LC_TIME
&lst category  "fi_FI:2000";LC_NUMERIC
&lst category  "fi_FI:2000";LC_MONETARY
&lst category  "fi_FI:2000";LC_MESSAGES
&lst category  "fi_FI:2000";LC_PAPER
&lst category  "fi_FI:2000";LC_NAME
&lst category  "fi_FI:2000";LC_ADDRESS
&lst category  "fi_FI:2000";LC_TELEPHONE
&lst END LC_IDENTIFICATION

%page
LC_CTYPE

	Predefined classes:
		upper lower alpha alnum (POSIX) digit space cntrl punct graph print xdigit blank outdigit (ISO14652)
	Predefined mappings:
		toupper tolower tosymmetric (ISO14652)
	User defined classes:
		charclass (POSIX) class (ISO14652)
	User defined mappings:
		charconv (POSIX) map (ISO14652)
	Transliteration

%page
LC_CTYPE (cont'd)

Example: i18n

&lst upper /
&lst % TABLE 1 BASIC LATIN/
&lst    <U0041>..<U005A>;/
&lst % TABLE 2 LATIN-1 SUPPLEMENT/
&lst    <U00C0>..<U00D6>;<U00D8>..<U00DE>;/
&lst % TABLE 3 LATIN EXTENDED-A/
&lst    <U0100>..(2)..<U0136>;/
&lst    <U0139>..(2)..<U0147>;/
&lst ...
&lst toupper /
&lst    (<U0061>,<U0041>);(<U0062>,<U0042>);(<U0063>,<U0043>);(<U0064>,<U0044>);/
&lst    (<U0065>,<U0045>);(<U0066>,<U0046>);(<U0067>,<U0047>);(<U0068>,<U0048>);/
&lst ...
&lst class "combining"; /
&lst    <U0300>..<U034F>;<U0360>..<U036F>;<U0483>..<U0486>;<U0488>..<U0489>;/
&lst    <U0591>..<U05A1>;<U05A3>..<U05B9>;<U05BB>..<U05BD>;<U05BF>;/


%page
LC_CTYPE (cont'd)

Example: ja_JP

&lst charclass   jspace;jhira;jkata;jkanji;jdigit
&lst 
&lst charconv    tojhira;tojkata
&lst 
&lst jspace  <U3000>
&lst 
&lst jdigit  <UFF10>;<UFF11>;<UFF12>;<UFF13>;<UFF14>;/
&lst         <UFF15>;<UFF16>;<UFF17>;<UFF18>;<UFF19>
&lst ...
&lst tojhira (<U30A1>,<U3041>);(<U30A2>,<U3042>);(<U30A3>,<U3043>);/
&lst         (<U30A4>,<U3044>);(<U30A5>,<U3045>);(<U30A6>,<U3046>);/
&lst         (<U30A7>,<U3047>);(<U30A8>,<U3048>);(<U30A9>,<U3049>);/
&lst ...
&lst tojkata (<U3041>,<U30A1>);(<U3042>,<U30A2>);(<U3043>,<U30A3>);/
&lst         (<U3044>,<U30A4>);(<U3045>,<U30A5>);(<U3046>,<U30A6>);/
&lst         (<U3047>,<U30A7>);(<U3048>,<U30A8>);(<U3049>,<U30A9>);/

%page
LC_CTYPE (cont'd)

Example: fa_IR

&lst % Persian uses the alternate digits U+06F0..U+06F9
&lst outdigit <U06F0>..<U06F9>
&lst 
&lst % This is used in the scanf family of functions to read Persian numbers
&lst % using "%Id" and such.
&lst map to_inpunct; /
&lst   (<U0030>,<U06F0>); /
&lst   (<U0031>,<U06F1>); /

TIMTOWTDI!

%page
LC_CTYPE (cont'd)

Example: da_DK

&lst LC_CTYPE
&lst copy "i18n"
&lst 
&lst translit_start
&lst 
&lst include "translit_combining";""
&lst 
&lst % Danish.
&lst % LATIN CAPITAL LETTER A WITH RING ABOVE.
&lst <U00C5> "<U0041><U030A>";"<U0041><U0041>"
&lst % LATIN SMALL LETTER A WITH RING ABOVE.
&lst <U00E5> "<U0061><U030A>";"<U0061><U0061>"
&lst 
&lst translit_end
&lst 
&lst END LC_CTYPE

%page
LC_MONETARY

&lst LC_MONETARY
&lst int_curr_symbol      "<U0045><U0055><U0052><U0020>"
&lst currency_symbol      "<U20AC>"
&lst mon_decimal_point    "<U002C>"
&lst mon_thousands_sep    "<U00A0>"
&lst mon_grouping         3;3
&lst positive_sign        ""
&lst negative_sign        "<U002D>"
&lst int_frac_digits      2
&lst frac_digits          2
&lst % 1 if currency symbol precedes amount for positive values
&lst p_cs_precedes        0
&lst % 1 if a space separates currency symbol from the positive values
&lst p_sep_by_space       2
&lst % 1 if currency symbol precedes amount for negotive values
&lst n_cs_precedes        0
&lst % 1 if a space separates currency symbol from the negotive values
&lst n_sep_by_space       2
&lst p_sign_posn          1
&lst n_sign_posn          1
&lst END LC_MONETARY

%page
LC_NUMERIC

Example: ta_IN

&lst LC_NUMERIC
&lst decimal_point        "<U002E>"
&lst thousands_sep        "<U002C>"
&lst grouping             3;2
&lst END LC_NUMERIC

    1234567.89 --> 12,34,567.89

%page
LC_TIME

	Highly controversial:
		Changes in ISO 14652 are incompatible with POSIX
		Additions in ISO 14652 are too complex
	ISO 14652 promotes solutions which are not in use

Example: abday
	POSIX: first string corresponds to Sunday, then Monday, etc.
	ISO 14652: first string corresponds to the first day of the week

GNU libc developers have no intention to follow ISO 14652 there.

%page
LC_TIME
Definition of the first day of the week in ISO 14652

%leftfill
week         Shall be used to define the number of days in a
             week, which is the first weekday - the first
             weekday has the value 1, and which week is to be
             considered the first in a year. The first operand
             is an integer specifying the number of days in the
             week, The second operand is an integer specifying
             the gregorian date in the format YYYYMMDD with a
             leading <hyphen-minus> if before Christ. The third
             operand is an integer specifying the weekday number
             to be contained in the first week of the year. If
             the keyword is not specified the values are taken
             as 7,  19971130 (a Sunday), and 7 (Saturday),
             respectively. ISO 8601 conforming applications
             should use the values 7, 19971201 (a Monday), and 4
             (Thursday), respectively. 

%page
LC_TIME (cont'd)

Example: fr_FR

&lst #  "dim";"lun";"mar";"mer";"jeu";"ven";"sam"
&lst abday   "<U0064><U0069><U006D>";"<U006C><U0075><U006E>";/
&lst         "<U006D><U0061><U0072>";"<U006D><U0065><U0072>";/
&lst         "<U006A><U0065><U0075>";"<U0076><U0065><U006E>";/
&lst         "<U0073><U0061><U006D>"
&lst #  "dimanche";"lundi";"mardi";"mercredi";"jeudi";"vendredi";"samedi"
&lst day     "<U0064><U0069><U006D><U0061><U006E><U0063><U0068><U0065>";/
&lst         "<U006C><U0075><U006E><U0064><U0069>";/
&lst         "<U006D><U0061><U0072><U0064><U0069>";/
&lst         "<U006D><U0065><U0072><U0063><U0072><U0065><U0064><U0069>";/
&lst         "<U006A><U0065><U0075><U0064><U0069>";/
&lst         "<U0076><U0065><U006E><U0064><U0072><U0065><U0064><U0069>";/
&lst         "<U0073><U0061><U006D><U0065><U0064><U0069>"

%page
LC_TIME (cont'd)

Example: fr_FR

&lst abmon   "<U006A><U0061><U006E>";"<U0066><U00E9><U0076>";/
&lst         "<U006D><U0061><U0072>";"<U0061><U0076><U0072>";/
&lst         "<U006D><U0061><U0069>";"<U006A><U0075><U006E>";/
&lst         "<U006A><U0075><U0069>";"<U0061><U006F><U00FB>";/
&lst         "<U0073><U0065><U0070>";"<U006F><U0063><U0074>";/
&lst         "<U006E><U006F><U0076>";"<U0064><U00E9><U0063>"
&lst mon     "<U006A><U0061><U006E><U0076><U0069><U0065><U0072>";/
&lst         "<U0066><U00E9><U0076><U0072><U0069><U0065><U0072>";/
&lst         "<U006D><U0061><U0072><U0073>";/
&lst         "<U0061><U0076><U0072><U0069><U006C>";/
&lst         "<U006D><U0061><U0069>";/
&lst         "<U006A><U0075><U0069><U006E>";/
&lst         "<U006A><U0075><U0069><U006C><U006C><U0065><U0074>";/
&lst         "<U0061><U006F><U00FB><U0074>";/
&lst         "<U0073><U0065><U0070><U0074><U0065><U006D><U0062><U0072><U0065>";/
&lst         "<U006F><U0063><U0074><U006F><U0062><U0072><U0065>";/
&lst         "<U006E><U006F><U0076><U0065><U006D><U0062><U0072><U0065>";/
&lst         "<U0064><U00E9><U0063><U0065><U006D><U0062><U0072><U0065>"


%page
LC_TIME (cont'd)

Example: fr_FR

&lst # "%a %d %b %Y %T %Z"
&lst d_t_fmt "<U0025><U0061><U0020><U0025><U0064><U0020><U0025><U0062><U0020>/
&lst <U0025><U0059><U0020><U0025><U0054><U0020><U0025><U005A>"
&lst # "%d.%m.%Y"
&lst d_fmt   "<U0025><U0064><U002E><U0025><U006D><U002E><U0025><U0059>"
&lst # "%T"
&lst t_fmt   "<U0025><U0054>"
&lst am_pm   "";""
&lst t_fmt_ampm ""
&lst # "%a %b %e %H:%M:%S %Z %Y"
&lst date_fmt "<U0025><U0061><U0020><U0025><U0062><U0020><U0025><U0065>/
&lst <U0020><U0025><U0048><U003A><U0025><U004D><U003A><U0025><U0053><U0020>/
&lst <U0025><U005A><U0020><U0025><U0059>"

%size 3
  $ date +"%a %b %e %H:%M:%S %Z %Y"
  ven jun 17 23:50:03 CEST 2005

%size 4
Not really a French format!

%page
LC_TIME (cont'd)
Most common date specifiers

See strftime(3) for full list

%leftfill, size 3
  %a abday
  %A day
  %b abmon
  %B month
  %d the day of the month as a decimal number (range 01 to 31)
  %e like %d, but a leading zero is replaced by a space
  %H the hour as a decimal number using a 24-hour clock (range 00 to 23)
  %m the month as a decimal number (range 01 to 12)
  %M the minute as a decimal number (range 00 to 59)
  %p AM/PM string
  %S the second as a decimal number (range 00 to 60)
  %T the time in 24-hour notation (%H:%M:%S)
  %Y the year as a decimal number including the century
  %Z the time zone or name or abbreviation

%page
LC_MESSAGES

Example: fi_FI

&lst LC_MESSAGES
&lst #  "^[KkJjYy].*"
&lst yesexpr  "<U005E><U005B><U004B><U006B><U004A><U006A>/
&lst <U0059><U0079><U005D><U002E><U002A>"
&lst #  "^[NnEe].*"
&lst noexpr   "<U005E><U005B><U004E><U006E><U0045><U0065>/
&lst <U005D><U002E><U002A>"
&lst END LC_MESSAGES

Note: ".*" is useless

%page
LC_COLLATE

ISO 14651
   http://www.open-std.org/jtc1/sc22/wg20/docs/14651.html (old draft)
Unicode Collation Algorithms
   http://www.unicode.org/reports/tr10/
Application for Tibetan script
   http://www.columbia.edu/~ph2046/iats/it/Chilton_slides.pdf

Key point: sequences of characters are replaced by other sequences
of characters so that strcmp() returns the expected result.

%page
LC_COLLATE

Example: one level
  A <- 1   B <- 2  C <- 3 ... Y <- 25  Z <- 26

  C A R    <  C A T
  3 1 18      3 1 20

Handling character case

  Two solutions:
    Interleave lowercase and uppercase characters.  Do not scale!
    Add another level

          L1     L2             L1     L2
        +----+  +---+         +----+  +---+
   car: 3 1 18  1 1 1    cat: 3 1 20  1 1 1
   CAR: 3 1 18  2 2 2    CAT: 3 1 20  2 2 2
   Car: 3 1 18  2 1 1    Cat: 3 1 20  2 1 1

    car < Car < CAR < cat < Cat < CAT

%page
LC_COLLATE

Example:
&lst LC_COLLATE
&lst collating-symbol <MIN>
&lst collating-symbol <CAP>
&lst collating-symbol <a>
&lst collating-symbol <c>
&lst collating-symbol <r>
&lst collating-symbol <t>
&lst 
&lst %  Level 2
&lst <MIN>   % 2
&lst <CAP>   % 3
&lst 
&lst %  Level 1
&lst <a>     % 2
&lst <c>     % 3
&lst <r>     % 4
&lst <t>     % 5

%page
LC_COLLATE (cont'd)

&lst order_start forward;forward
&lst <U0061> <a>;<MIN>
&lst <U0041> <a>;<CAP>
&lst <U0063> <c>;<MIN>
&lst <U0043> <c>;<CAP>
&lst <U0072> <r>;<MIN>
&lst <U0052> <r>;<CAP>
&lst <U0074> <t>;<MIN>
&lst <U0054> <t>;<CAP>
&lst order_end
&lst END LC_COLLATE

%page
LC_COLLATE (cont'd)

Test program to display collation weights: tst-show-weights.c
&lst #include <stdio.h>
&lst #include <string.h>
&lst #include <locale.h>
&lst int main(int argc, char *argv[])
&lst {
&lst         char buf[4096];
&lst         int i;
&lst         unsigned char *cp;
&lst         setlocale(LC_COLLATE, "");
&lst         for (i = 1; i < argc; i++) {
&lst                 strxfrm(buf, argv[i], 4096);
&lst                 for (cp = buf; *cp != 0; cp++)
&lst                         printf(" %u", *cp);
&lst                 printf("\\n");
&lst         }
&lst         return 0;
&lst }

  $ LC_ALL=TEST tst-show-weights car cat CAR CAT Car Cat
  car:  3 2 4 1 2 2 2
  cat:  3 2 5 1 2 2 2
  CAR:  3 2 4 1 3 3 3
  CAT:  3 2 5 1 3 3 3
  Car:  3 2 4 1 3 2 2
  Cat:  3 2 5 1 3 2 2

    car < Car < CAR < cat < Cat < CAT

%page
LC_COLLATE (cont'd)

Handling of diacritics

&lst LC_COLLATE
&lst collating-symbol <c>
&lst collating-symbol <e>
&lst collating-symbol <o>
&lst collating-symbol <t>
&lst collating-symbol <BASE>
&lst collating-symbol <CIRC>
&lst collating-symbol <ACUTE>
&lst 
&lst <BASE>  % 2
&lst <CIRC>  % 3
&lst <ACUTE> % 4
&lst <c>     % 2
&lst <e>     % 3
&lst <o>     % 4
&lst <t>     % 5

%page
LC_COLLATE (cont'd)

&lst order_start forward;forward
&lst <U0063> <c>;<BASE>
&lst <U0065> <e>;<BASE>
&lst <U00E9> <e>;<ACUTE>
&lst <U006F> <o>;<BASE>
&lst <U00F4> <o>;<CIRC>
&lst <U0074> <t>;<BASE>
&lst order_end
&lst END LC_COLLATE

  $ LC_ALL=TEST tst-show-weights cote coté côté côte
  cote:  2 4 5 3 1 2 2 2 2
  coté:  2 4 5 3 1 2 2 2 4
  côté:  2 4 5 3 1 2 3 2 4
  côte:  2 4 5 3 1 2 3 2 2

   cote < coté < côte < côté

%page
LC_COLLATE (cont'd)

BUT:  In French, quasi-homographs are sorted from right to left:
   cote < côte < coté < côté

This is achieved by replacing
&lst order_start forward;forward
by
&lst order_start forward;backward
(NOTE: GNU localedef is broken: see BZ645)

  $ LC_ALL=TEST tst-show-weights cote coté côté côte
  cote:  2 4 5 3 1 2 2 2 2
  coté:  2 4 5 3 1 4 2 2 2
  côté:  2 4 5 3 1 4 2 3 2
  côte:  2 4 5 3 1 2 2 3 2

%page
LC_COLLATE (cont'd)

We need at least 3 levels: base characters, diacritics, caseness.
A fourth level is added to make punctuation unambiguous.

The more level, the longer keys and the slower collation.
In practice, four levels are sufficient, but locales can define
more if needed.

%page
LC_COLLATE (cont'd)

Example:
&lst collating-symbol <MIN>
&lst collating-symbol <CAP>
&lst collating-symbol <BAS>
&lst collating-symbol <ACA>
&lst collating-symbol <GRA>
&lst collating-symbol <a>
&lst <CAP>
&lst <MIN>
&lst <BAS>
&lst <ACA>
&lst <GRA>
&lst <a>
&lst order_start forward;forward;forward;forward,position
&lst <U0061> <a>;<BAS>;<MIN>;IGNORE
&lst <U00E1> <a>;<ACA>;<MIN>;IGNORE
&lst <U00E0> <a>;<GRA>;<MIN>;IGNORE
&lst <U0041> <a>;<BAS>;<CAP>;IGNORE
&lst <U00C1> <a>;<ACA>;<CAP>;IGNORE
&lst <U00C0> <a>;<GRA>;<CAP>;IGNORE
&lst order_end

%page
LC_COLLATE (cont'd)

Collation cannot always be defined on single characters:
	In German, ß is sorted like the sequence "ss" 
	In traditional Spanish, "ch" sequence is a letter sorted between c and d

Examples: one-to-many mapping
&lst <U00DF> "<s><s>";"<LIG><LIG>";"<MIN><MIN>";IGNORE
&lst <U00E6> "<a><e>";"<LIG><LIG>";"<MIN><MIN>";IGNORE

%page
LC_COLLATE (cont'd)

Example: many-to-one mapping
&lst collating-element <C-H> from "<U0043><U0048>"
&lst collating-element <c-h> from "<U0063><U0068>"
&lst collating-element <C-h> from "<U0043><U0068>"
&lst collating-element <c-H> from "<U0063><U0048>"
&lst collating-symbol <ch>
&lst reorder-after <MIN>
&lst <MIN-CAP>
&lst <CAP-MIN>
&lst reorder-after <c>
&lst <ch>
&lst reorder-after <U0063>
&lst <c-H>   <ch>;<BAS>;<MIN-CAP>;IGNORE
&lst <c-h>   <ch>;<BAS>;<MIN>;IGNORE
&lst reorder-after <U0043>
&lst <C-H>   <ch>;<BAS>;<CAP>;IGNORE
&lst <C-h>   <ch>;<BAS>;<CAP-MIN>;IGNORE
&lst reorder-end

%page
LC_COLLATE (cont'd)

Many-to-many mappings can be obtained by combining both ways, but
keys can become quite large and sorting takes much longer.

%page
LC_COLLATE (cont'd)

POSIX standard is very rigid, LC_COLLATE sections can not load
values from another locale and override only few of them.
ISO 14651 promoted the creation of a default table (called iso14651_t1
in GNU libc), which can be modified for certain locales.
ISO 14652 proposed such a tailoring scheme for LC_COLLATE section.

  reorder-after reorder-end

%page
LC_COLLATE (cont'd)

Example: fi_FI

&lst LC_COLLATE
&lst copy "iso14651_t1"
&lst
&lst collating-symbol <a-ring>
&lst collating-symbol <a-diaerisis>
&lst collating-symbol <o-diaerisis>
&lst
&lst reorder-after <z>
&lst <a-ring>
&lst <a-diaerisis>
&lst <o-diaerisis>
&lst
&lst reorder-after <U005A>
&lst <U00E5> <a-ring>;<BAS>;<MIN>;IGNORE
&lst <U00C5> <a-ring>;<BAS>;<CAP>;IGNORE
&lst <U00E4> <a-diaerisis>;<BAS>;<MIN>;IGNORE
&lst <U00C4> <a-diaerisis>;<BAS>;<CAP>;IGNORE
&lst <U00E6> <a-diaerisis>;<REU>;<MIN>;IGNORE
&lst <U00C6> <a-diaerisis>;<REU>;<CAP>;IGNORE
&lst <U00F6> <o-diaerisis>;<BAS>;<MIN>;IGNORE
&lst <U00D6> <o-diaerisis>;<BAS>;<CAP>;IGNORE
&lst <U00F8> <o-diaerisis>;<U00D8>;<MIN>;IGNORE
&lst <U00D8> <o-diaerisis>;<U00D8>;<CAP>;IGNORE
&lst <U00F5> <o-diaerisis>;<TIL>;<MIN>;IGNORE
&lst <U00D5> <o-diaerisis>;<TIL>;<CAP>;IGNORE
&lst reorder-end
&lst 
&lst END LC_COLLATE

#LC_PAPER
#height   297
#width    210
#END LC_PAPER
#
#LC_TELEPHONE
#tel_int_fmt    "<U002B><U0025><U0063><U0020><U0025><U0061><U0020><U0025>/
#<U006C>"
#int_prefix     "<U0033><U0035><U0038>"
#int_select     "<U0030><U0030>"
#END LC_TELEPHONE
#
#LC_MEASUREMENT
#measurement    1
#END LC_MEASUREMENT
#
#LC_NAME
#name_fmt    "<U0025><U0064><U0025><U0074><U0025><U0067><U0025><U0074>/
#<U0025><U006D><U0025><U0074><U0025><U0066>"
#END LC_NAME
#
#LC_ADDRESS
#postal_fmt    "<U0025><U0066><U0025><U004E><U0025><U0061><U0025><U004E>/
#<U0025><U0064><U0025><U004E><U0025><U0062><U0025><U004E><U0025><U0073>/
#<U0020><U0025><U0068><U0020><U0025><U0065><U0020><U0025><U0072><U0025>/
#<U004E><U0025><U0025><U007A><U0020><U0025><U0054><U0025>/
#<U004E><U0025><U0063><U0025><U004E>"
#country_ab2 "<U0046><U0049>"
#country_ab3 "<U0046><U0049><U004E>"
#country_num 246
#END LC_ADDRESS
#

Reply to:

Prev by Date: Bug#316001: glibc-doc: pthread_rwlock_* are not documented (manpage)
Next by Date: Let your computer be the PRO!
Previous by thread: Re: BoF at Debconf5 about glibc locale file format
Next by thread: Bug#312036: libc6: Valgrind reports invalid memory access for printf("%1$e", 1.);
Index(es):
- Date
- Thread