Locales and belocs-locales-data, some explanations
I began working on belocs-locales-data in October 2004, to include most
patches sent to Debian BTS and upstream bugzilla. From the beginning,
it was pretty obvious to me that I needed also to fork localedef,
because having iso-*.def hardcoded in localedef makes transitions much
harder, e.g. users cannot change their currencies on an old system when
a country makes such a change. For a similar reason, I also dislike the
strong correlation between locales and libc6 packages.
When working on improving locales, it became more and more obvious that
localedef is really buggy. The first encountered problem was with
Dzongkha locale, it needs many collation elements and GNU localedef
loops forever when it encounters more than 256 collating elements.
After sending patches to BZ368, I implemented these changes as a concept
of proof, and so far I was told that Dzongkha locale works fine with
belocs-locales-data.
After more digging into GNU localedef internals about collation, I filed
BZ645 with test cases. I did not receive any answer to bugreports sent
against localedef, and from there did not sent all my patches to
upstream.
In this mail, I will describe the patches applied against
belocs-locales-bin (which ships locale, localedef and locale-gen
programs), so that we can discuss which ones could be merged into libc6
or pushed upstream. I will prepare dpatches to apply to glibc-package
on the issues which you believe are worth getting from
belocs-locales-bin.
In a later mail, I will explain the changes applied against
belocs-locales-data, but some changes (like the Dzongkha locale) need
patches to be applied against localedef, so I prefer to discuss these
ones first.
Instead of discussing several issues in the same thread, it would
certainly be a good idea to have an issue per reply. Feel free to start
a new thread if you prefer.
Patches in belocs-locales-bin are maintained with quilt, which means
that there is a debian/patches/series file listing patches in the
desired order. The debian/patches directory is temporarily online at
http://people.debian.org/~barbier/tmp/belocs-locales-bin/patches/
A. Changes in locale-gen
=-=-=-=-=-=-=-=-=-=-=-=-
As locale-gen is not an upstream program, there is no patch here.
The current script is temporarily available at
http://people.debian.org/~barbier/tmp/belocs-locales-bin/locale-gen
The main differences with locale-gen from the locales package are:
* It accepts few command-line options, and can also be driven
by a configuration file /etc/belocs/locale-gen.conf:
--purge remove existing locales before processing
--archive store compiled locale data inside a single archive
--no-archive do not store compiled locale data inside a single archive
(default)
--aliases=FILE read locale aliases from FILE. (Default: /etc/locale.alias)
* It detects the magic number currently used by GNU libc for compiled
locale data, and tells localedef to write compiled locale data
suitable for this format, if it is supported. E.g. my localedef
supports both 20000828 (glibc >= 2.1.96) and 20031115 (glibc >= 2.3.3)
magic numbers for a long time; when I upgraded to glibc 2.3.5-3, I was
not forced to upgrade belocs-locales-data, I only needed to re-run
locale-gen so that locales are compiled into the right format.
Of course it would be much better if this rebuild was triggered by
libc6, but this is another story ;)
* It keeps tracks of dependencies between generated locales and locale
source files, so that locales are generated only if some source
files needed for this locale have been modified, or if the magic
number changed. The --purge command-line flag tells locale-gen to
purge everything before processing locales.
This is very convenient on slow machines (yes, mine is very slow).
* By default, locales are written into the old format (not into an
archive file). My motivation was that if someone needs to add a
local locale, she can compile her locale into $HOME/share/locale
and set LOCPATH to $HOME/share/locale:/usr/lib/locale if she
wants to use either her preferred locale or a system one, e.g.
with LANGUAGE=xx_XX:de
But this will work only if system locales are compiled in old
style, not with archive. I also made benchmarks to see if
archive was faster, and IIRC noticed no significant difference.
This behavior can be overridden by the --archive flag.
B. General changes to compile locale and locale-gen
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
* autotoolize.diff standalone_build.diff
These 2 patches are to compile belocs-locales-bin outside of glibc,
and have no interest for you.
C. Changes in locale
=-=-=-=-=-=-=-=-=-=-
* locale_print_LANGUAGE,diff
This patch prints the LANGUAGE variable on output, if it is set,
when locale is called without argument.
D. Changes in localedef
=-=-=-=-=-=-=-=-=-=-=-=
* debian-localedef-fix-trampoline.diff
Stolen from your localedef-fix-trampoline.dpatch
* compatibility_magic_number.diff
Add a --magic command-line flag, to specify output format. Add
black magic to write data into the right format.
* read_isocodes_at_run_time.diff
If /etc/belocs/iso-{3166,4217,639}.def files are found at run-time,
their content is parsed to override compile-time defaults. This
way, users can change values if needed without having to recompile
localedef. Another way would be to fully remove checks on these
values, but I am not yet decided on the best approach.
* allow_duplicate_country_num.diff
Allow several countries to share the same country number in
iso-3166.def; this may help transitions when country numbers do
change. Again, these checks may alternatively be fully removed.
* localedef_LC_COLLATE_do_not_copy_locales.diff
In LC_COLLATE section, if the first keyword is "copy", the matching
locale is not parsed, but instead directly loaded into memory if
this locale does exist. This may cause some mismatch with my cache
system, because the loaded locale may be outdated. Moreover this
memory loading is much slower with very large archive files, so
there is no performance loss (well, this is a moot argument since
archive files are not used by default ;)). And third, the state
machine used when parsing LC_COLLATE is more clean without this
special casing of "copy", it was pretty difficult to understand why
om_ET does not fail with 2 copy keywords whereas other locales have
hard time using only one "copy".
For all these reasons, locales are never copied from compiled data
into memory by "copy" keywords.
* localedef_complex_collate.diff [BZ368]
Allow more than 256 collating-element definitions. This is needed
for dz_BT.
* localedef_fix_LC_COLLATE_rules.diff [BZ645]
As shown in this bugreport, localedef does not respect order_start
keywords, the same ruleset is assigned to all scripts.
* localedef_preprocessor_collate.diff [BZ686]
ISO 14652 defines preprocessor-like directives to help tailoring
tables. E.g. in belocs-locales-data, locales which sort uppercase
before lowercase do
define UPPERCASE_FIRST
copy "iso14651_t1"
because my "iso14651_t1" replaced
<RES-1>
<MIN>
<ANO>
...
<AME>
<CAP>
by
<RES-1>
ifdef UPPERCASE_FIRST
<CAP>
else
<MIN>
endif
<ANO>
...
<AME>
ifdef UPPERCASE_FIRST
<MIN>
else
<CAP>
endif
In the locales package, "iso14651_t1" has to be copied into such
locales and edited to swap 2 lines.
These keywords were already recognized by GNU localedef, I only
assigned actions to them.
* localedef_LC_COLLATE_keywords_ordering.diff [BZ690]
The current state machine is too strict, e.g. it should allow
copy "iso14651_t1"
script <FOOBAR>
order_start <FOOBAR>;forward;forward;forward;forward,position
...
order_end
so that scripts not yet in "iso14651_t1" can be added to it.
This patch is also needed to allow preprocessor directives before
"copy" keywords, see above.
* localedef_LC_IDENTIFICATION_optional_fields.diff
In LC_IDENTIFICATION, audience, application and abbreviation
keywords are optional, thus do not report an error (with -v flag)
if they are not defined.
* localedef_fix_exhausted_memory.diff
Localedef aborts if a symbol name has exactly 55 characters in
charmap file or in LC_COLLATE section:
$ cat << EOT | localedef -i - -f UTF-8 -c /tmp/FOO
LC_COLLATE
collating-symbol <abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabc>
order_start forward
<abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabc>
order_end
END LC_COLLATE
EOT
memory clobbered past end of allocated block
(Actually the message was "memory exhausted" when I wrote this patch)
* localedef_check_unknown_symbols.diff
Detect and report undeclared symbols in collation rules. They
always are the sign that something went wrong: a typo had been made,
some declarations were erroneously removed, etc. These checks
let me find several bugs in collation rules.
* localedef_fix_lang_lib_test.diff
I wrote this patch to fix compilation of nds_DE, a locale shipped
by Mandriva, but I am no more convinced that this is the right fix.
Do not consider, more investigations are needed on my side.
E. Changes out of my scope
=-=-=-=-=-=-=-=-=-=-=-=-=-
Please fix BZ968/BTS#310635, strxfrm can segfault when above localedef
bugs are fixed.
Denis
Reply to: