[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#570929: Hungarian locale: "zs" is treated as a single letter, with undesirable consequences



Hi again,

Odd names for collating elements
--------------------------------

I wrote:

>  $ echo 'ch and more' | LANG=cy_GB.UTF-8 sed 's/[[.ch.]]/<MATCHED>/'
>  sed: -e expression #1, char 21: Invalid collation character
> 
> Odd, no?

It did seem odd, especially since the POSIX documentation uses
examples like this all the time (usually [.ch.] from pre-1994
Spanish).  For example [1]:

 collating-element <ch> from "<c><h>"
 collating-element <e-acute> from "<acute><e>"
 collating-element <ll> from "ll"

I was missing something obvious: in GNU locales, the collating element
has a hyphenated name.

>  collating-symbol  <zs>
>  collating-element <z-s> from "<U007A><U0073>"

 $ echo 'ch and more' | LANG=cy_GB.UTF-8 sed 's/[[.c-h.]]/<MATCHED>/'
 <MATCHED> and more

So there’s the workaround.  I think this is a real bug: POSIX 1.2008
says [2]:

	A collating symbol is a collating element enclosed within
	bracket-period ( "[." and ".]" ) delimiters. Collating
	elements are defined as described in Collation Order .
	Conforming applications shall represent multi-character
	collating elements as collating symbols when it is
	necessary to distinguish them from a list of the
	individual characters that make up the multi-character
	collating element. For example, if the string "ch" is a
	collating element defined using the line:

	collating-element <ch-digraph> from "<c><h>"

	in the locale definition, the expression "[[.ch.]]" shall
	be treated as an RE containing the collating symbol 'ch',
	while "[ch]" shall be treated as an RE matching 'c' or
	'h' . Collating symbols are recognized only inside
	bracket expressions. If the string is not a collating
	element in the current locale, the expression is invalid.

In other words, in the “collating-element <z-s> from "<U007A><U0073>"”
line, it is not the z-s that names the collating symbol in
regexps.

This makes sense, since otherwise how could anyone write portable
regular expressions?

Writing [:alpha:] in Hungarian
------------------------------

Andras wrote:

> "zs" in particular is causing trouble for grep:
>
> % echo zs | LANG=C grep '^[^a-z]*$'
> % echo zs | LANG=hu_HU.UTF-8 grep '^[^a-z]*$'
> zs

Any program using such constructions without LC_COLLATE=C or similar
is IMHO buggy because of exactly this problem.  With some C libraries
(though not current glibc, luckily), in English, [^a-z] matches A but
not Z or vice versa [3].  (Current POSIX leaves the behavior
unspecified.)

The . notation seems to work here:

 % echo zs | LANG=hu_HU.UTF-8 grep '^[^a-[.z-s.]]*$'
 %

Once the regexp engine is fixed, that regexp would become
'^[^a-[.zs.]]*$'.

Bracket expressions match collating elements
--------------------------------------------

Andras wrote:

> % echo ty | LANG=C grep '^[s-u]*$'
> % echo ty | LANG=hu_HU.UTF-8 grep '^[s-u]*$'
> ty

POSIX is unambiguous about this: bracket expressions match collating
elements, not characters.

I can imagine situations where this would be helpful and situations
where it would be unhelpful.  Mostly, it just seems difficult to do
any other way, since otherwise what would the ranges mean?  The
simplest workaround is to use LC_COLLATE=C (or en_US.UTF-8, or C.UTF-8
once glibc learns that, or whatever locale has the behavior you want).

Computers are dumb
------------------

Andras wrote:

> 1. grep has no way of knowing whether a "zs" sequence is a "single letter"
> or two letters, because the combination can occur in compound words without
> becoming a "zs" letter; for example, in "fúvószenekar" ("fúvós" +
> "zenekar"), it's simply an "s" and a "z" letter next to each other. There
> may even exist words that make (a different) sense either way, but I can't
> think of any right now.

Are there simple heuristics that would make this condition easy to
discover?  For example, vowels that would never appear before a true
"sz" letter, things like that?  I am just curious; please feel free to
e-mail me privately about this.

This sounds like a (hard to fix) bug in the collation algorithm, but
not a reason not to make 'sort' follow the conventions of the language.

An argument could be made that although 'sort' should use the
customary collation order, regexp matching should not.  The strongest
counterargument I know of is that it is hard to find a different rule
that would be useful for regular expressions in, e.g., Hebrew.

. matches a character
---------------------

Andras wrote:

> % echo zs | LANG=hu_HU.UTF-8 grep "^[a-z]*$"
> zs
> % echo azsa | LANG=hu_HU.UTF-8 grep "^a.a$"
> % echo azsa | LANG=hu_HU.UTF-8 grep "^a[^a-z]a$"
> azsa

POSIX is unambiguous about this, too: . matches a single character,
not a collating element.

I assume this is mostly for speed.  If you want to match an arbitrary
collating element, it is not obvious to me how to.  [[:print:]] would
capture the most important ones.

A related collating element bug
-------------------------------

There is some other ugliness: any single-byte character like [.e.]
works fine, but multi-byte characters like [.é.] do not.

 $ echo 'é and more' | LANG=en_US.UTF-8 sed 's/[[.é.]]/<MATCHED>/'
 sed: -e expression #1, char 21: Invalid collation character

I think a fix to the [[.zs.]] bug would automatically fix this as
well.

Hope that helps,
Jonathan

[1] http://www.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag_07_03_02_01
[2] item 4 from
http://www.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05
[3] http://mail-index.netbsd.org/tech-userlevel/2008/08/08/msg000986.html



Reply to: