[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#570929: Hungarian locale: "zs" is treated as a single letter, with undesirable consequences



On Mon, Feb 22, 2010 at 11:07:21AM +0100, Andras Korn wrote:

> 1. grep has no way of knowing whether a "zs" sequence is a "single letter"
> or two letters, because the combination can occur in compound words without
> becoming a "zs" letter; for example, in "fúvószenekar" ("fúvós" +
> "zenekar"), it's simply an "s" and a "z" letter next to each other. There
> may even exist words that make (a different) sense either way, but I can't
> think of any right now.

Uh, sorry, wrong example ("sz" instead of "zs"). Some examples for "zs" are
"község", "egészség" (especially interesting because it contains an "sz"
followed by an "s", not an "s" followed by a "zs"), "gazság" etc.

> 2. "zs" is the last letter of the Hungarian alphabet; therefore, no sane
> character range in a regular expression can include it ("[a-zs]" would be
> ambiguous because there isn't a "zs" glyph).

It actually gets even more confusing, because grep's behaviour is
inconsistent:

% echo zs | LANG=hu_HU.UTF-8 grep "^[a-z]*$"
zs
% echo azsa | LANG=hu_HU.UTF-8 grep "^a.a$"
% echo azsa | LANG=hu_HU.UTF-8 grep "^a[^a-z]a$"
azsa

So is "zs" a member of the [a-z] class or not? The first attempt matches z
and s individually, because "zs" doesn't match "." (as shown in the second
example). However, in the last example, "zs" matches [^a-z], which is also
only supposed to match a single character.

The problem also affects sed(1) similarly:

% echo azsa | LANG=hu_HU.UTF-8 sed -n "/^a[^a-z]a$/p"
azsa

Therefore, I believe this is a bug in locales, not grep.

Andras

-- 
 Andras Korn <korn at elan.rulez.org> - <http://chardonnay.math.bme.hu/~korn/>
                Never say 'OOPS!' Always say 'Ah, Interesting!'



Reply to: