[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Locales and belocs-locales-data, some explanations



I began working on belocs-locales-data in October 2004, to include most
patches sent to Debian BTS and upstream bugzilla.  From the beginning,
it was pretty obvious to me that I needed also to fork localedef,
because having iso-*.def hardcoded in localedef makes transitions much
harder, e.g. users cannot change their currencies on an old system when
a country makes such a change.  For a similar reason, I also dislike the
strong correlation between locales and libc6 packages.
When working on improving locales, it became more and more obvious that
localedef is really buggy.  The first encountered problem was with
Dzongkha locale, it needs many collation elements and GNU localedef
loops forever when it encounters more than 256 collating elements.
After sending patches to BZ368, I implemented these changes as a concept
of proof, and so far I was told that Dzongkha locale works fine with
belocs-locales-data.
After more digging into GNU localedef internals about collation, I filed
BZ645 with test cases.  I did not receive any answer to bugreports sent
against localedef, and from there did not sent all my patches to
upstream.

In this mail, I will describe the patches applied against
belocs-locales-bin (which ships locale, localedef and locale-gen
programs), so that we can discuss which ones could be merged into libc6
or pushed upstream.  I will prepare dpatches to apply to glibc-package
on the issues which you believe are worth getting from
belocs-locales-bin.
In a later mail, I will explain the changes applied against
belocs-locales-data, but some changes (like the Dzongkha locale) need
patches to be applied against localedef, so I prefer to discuss these
ones first.

Instead of discussing several issues in the same thread, it would
certainly be a good idea to have an issue per reply.  Feel free to start
a new thread if you prefer.

Patches in belocs-locales-bin are maintained with quilt, which means
that there is a debian/patches/series file listing patches in the
desired order.  The debian/patches directory is temporarily online at
  http://people.debian.org/~barbier/tmp/belocs-locales-bin/patches/

  A. Changes in locale-gen
  =-=-=-=-=-=-=-=-=-=-=-=-

As locale-gen is not an upstream program, there is no patch here.
The current script is temporarily available at
  http://people.debian.org/~barbier/tmp/belocs-locales-bin/locale-gen
The main differences with locale-gen from the locales package are:

  * It accepts few command-line options, and can also be driven
    by a configuration file /etc/belocs/locale-gen.conf:
  --purge        remove existing locales before processing
  --archive      store compiled locale data inside a single archive
  --no-archive   do not store compiled locale data inside a single archive
                 (default)
  --aliases=FILE read locale aliases from FILE. (Default: /etc/locale.alias)

  * It detects the magic number currently used by GNU libc for compiled
    locale data, and tells localedef to write compiled locale data
    suitable for this format, if it is supported.  E.g. my localedef
    supports both 20000828 (glibc >= 2.1.96) and 20031115 (glibc >= 2.3.3)
    magic numbers for a long time; when I upgraded to glibc 2.3.5-3, I was
    not forced to upgrade belocs-locales-data, I only needed to re-run
    locale-gen so that locales are compiled into the right format.
    Of course it would be much better if this rebuild was triggered by
    libc6, but this is another story ;)

  * It keeps tracks of dependencies between generated locales and locale
    source files, so that locales are generated only if some source
    files needed for this locale have been modified, or if the magic
    number changed.  The --purge command-line flag tells locale-gen to
    purge everything before processing locales.
    This is very convenient on slow machines (yes, mine is very slow).

  * By default, locales are written into the old format (not into an
    archive file).  My motivation was that if someone needs to add a
    local locale, she can compile her locale into $HOME/share/locale
    and set LOCPATH to $HOME/share/locale:/usr/lib/locale if she
    wants to use either her preferred locale or a system one, e.g.
    with LANGUAGE=xx_XX:de
    But this will work only if system locales are compiled in old
    style, not with archive.  I also made benchmarks to see if
    archive was faster, and IIRC noticed no significant difference.
    This behavior can be overridden by the --archive flag.

  B. General changes to compile locale and locale-gen
  =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

  * autotoolize.diff standalone_build.diff
    These 2 patches are to compile belocs-locales-bin outside of glibc,
    and have no interest for you.

  C. Changes in locale
  =-=-=-=-=-=-=-=-=-=-

  * locale_print_LANGUAGE,diff
    This patch prints the LANGUAGE variable on output, if it is set,
    when locale is called without argument.

  D. Changes in localedef
  =-=-=-=-=-=-=-=-=-=-=-=

  * debian-localedef-fix-trampoline.diff
    Stolen from your localedef-fix-trampoline.dpatch

  * compatibility_magic_number.diff
    Add a --magic command-line flag, to specify output format.  Add
    black magic to write data into the right format.

  * read_isocodes_at_run_time.diff
    If /etc/belocs/iso-{3166,4217,639}.def files are found at run-time,
    their content is parsed to override compile-time defaults.  This
    way, users can change values if needed without having to recompile
    localedef.  Another way would be to fully remove checks on these
    values, but I am not yet decided on the best approach.

  * allow_duplicate_country_num.diff
    Allow several countries to share the same country number in
    iso-3166.def; this may help transitions when country numbers do
    change.  Again, these checks may alternatively be fully removed.

  * localedef_LC_COLLATE_do_not_copy_locales.diff
    In LC_COLLATE section, if the first keyword is "copy", the matching
    locale is not parsed, but instead directly loaded into memory if
    this locale does exist.  This may cause some mismatch with my cache
    system, because the loaded locale may be outdated.  Moreover this
    memory loading is much slower with very large archive files, so
    there is no performance loss (well, this is a moot argument since
    archive files are not used by default ;)).  And third, the state
    machine used when parsing LC_COLLATE is more clean without this
    special casing of "copy", it was pretty difficult to understand why
    om_ET does not fail with 2 copy keywords whereas other locales have
    hard time using only one "copy".
    For all these reasons, locales are never copied from compiled data
    into memory by "copy" keywords.

  * localedef_complex_collate.diff               [BZ368]
    Allow more than 256 collating-element definitions.  This is needed
    for dz_BT.

  * localedef_fix_LC_COLLATE_rules.diff          [BZ645]
    As shown in this bugreport, localedef does not respect order_start
    keywords, the same ruleset is assigned to all scripts.

  * localedef_preprocessor_collate.diff          [BZ686]
    ISO 14652 defines preprocessor-like directives to help tailoring
    tables.  E.g. in belocs-locales-data, locales which sort uppercase
    before lowercase do
      define UPPERCASE_FIRST
      copy "iso14651_t1"
    because my "iso14651_t1" replaced
      <RES-1>
      <MIN>
      <ANO>
      ...
      <AME>
      <CAP>
    by
      <RES-1>
      ifdef UPPERCASE_FIRST
      <CAP>
      else
      <MIN>
      endif
      <ANO>
      ...
      <AME>
      ifdef UPPERCASE_FIRST
      <MIN>
      else
      <CAP>
      endif
    In the locales package, "iso14651_t1" has to be copied into such
    locales and edited to swap 2 lines.
    These keywords were already recognized by GNU localedef, I only
    assigned actions to them.

  * localedef_LC_COLLATE_keywords_ordering.diff  [BZ690]
    The current state machine is too strict, e.g. it should allow
      copy "iso14651_t1"
      script <FOOBAR>
      order_start <FOOBAR>;forward;forward;forward;forward,position
      ...
      order_end
    so that scripts not yet in "iso14651_t1" can be added to it.
    This patch is also needed to allow preprocessor directives before
    "copy" keywords, see above.
    
  * localedef_LC_IDENTIFICATION_optional_fields.diff
    In LC_IDENTIFICATION, audience, application and abbreviation
    keywords are optional, thus do not report an error (with -v flag)
    if they are not defined.

  * localedef_fix_exhausted_memory.diff
    Localedef aborts if a symbol name has exactly 55 characters in
    charmap file or in LC_COLLATE section:
$ cat << EOT | localedef -i - -f UTF-8 -c /tmp/FOO
LC_COLLATE
collating-symbol <abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabc>
order_start forward
<abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabc>
order_end
END LC_COLLATE
EOT
memory clobbered past end of allocated block
    (Actually the message was "memory exhausted" when I wrote this patch)

  * localedef_check_unknown_symbols.diff
    Detect and report undeclared symbols in collation rules.  They
    always are the sign that something went wrong: a typo had been made,
    some declarations were erroneously removed, etc.  These checks
    let me find several bugs in collation rules.

  * localedef_fix_lang_lib_test.diff
    I wrote this patch to fix compilation of nds_DE, a locale shipped
    by Mandriva, but I am no more convinced that this is the right fix.
    Do not consider, more investigations are needed on my side.

  E. Changes out of my scope
  =-=-=-=-=-=-=-=-=-=-=-=-=-

  Please fix BZ968/BTS#310635, strxfrm can segfault when above localedef
  bugs are fixed.

Denis



Reply to: