Bug#570929: Hungarian locale: "zs" is treated as a single letter, with undesirable consequences

To: 570929@bugs.debian.org
Cc: Pirity Tamas Gabor <ptg@apaczai.elte.hu>
Subject: Bug#570929: Hungarian locale: "zs" is treated as a single letter, with undesirable consequences
From: Andras Korn <korn-debbugs@elan.rulez.org>
Date: Wed, 24 Feb 2010 00:11:33 +0100
Message-id: <20100223231133.GA21466@hellgate.intra.guy>
Reply-to: Andras Korn <korn-debbugs@elan.rulez.org>, 570929@bugs.debian.org
In-reply-to: <20100222100721.GA3620@hellgate.intra.guy>
References: <20100222100721.GA3620@hellgate.intra.guy>

On Mon, Feb 22, 2010 at 11:07:21AM +0100, Andras Korn wrote:

> 1. grep has no way of knowing whether a "zs" sequence is a "single letter"
> or two letters, because the combination can occur in compound words without
> becoming a "zs" letter; for example, in "fúvószenekar" ("fúvós" +
> "zenekar"), it's simply an "s" and a "z" letter next to each other. There
> may even exist words that make (a different) sense either way, but I can't
> think of any right now.

Uh, sorry, wrong example ("sz" instead of "zs"). Some examples for "zs" are
"község", "egészség" (especially interesting because it contains an "sz"
followed by an "s", not an "s" followed by a "zs"), "gazság" etc.

> 2. "zs" is the last letter of the Hungarian alphabet; therefore, no sane
> character range in a regular expression can include it ("[a-zs]" would be
> ambiguous because there isn't a "zs" glyph).

It actually gets even more confusing, because grep's behaviour is
inconsistent:

% echo zs | LANG=hu_HU.UTF-8 grep "^[a-z]*$"
zs
% echo azsa | LANG=hu_HU.UTF-8 grep "^a.a$"
% echo azsa | LANG=hu_HU.UTF-8 grep "^a[^a-z]a$"
azsa

So is "zs" a member of the [a-z] class or not? The first attempt matches z
and s individually, because "zs" doesn't match "." (as shown in the second
example). However, in the last example, "zs" matches [^a-z], which is also
only supposed to match a single character.

The problem also affects sed(1) similarly:

% echo azsa | LANG=hu_HU.UTF-8 sed -n "/^a[^a-z]a$/p"
azsa

Therefore, I believe this is a bug in locales, not grep.

Andras

-- 
 Andras Korn <korn at elan.rulez.org> - <http://chardonnay.math.bme.hu/~korn/>
                Never say 'OOPS!' Always say 'Ah, Interesting!'

Reply to:

Follow-Ups:
- Bug#570929: Hungarian locale: "zs" is treated as a single letter, with undesirable consequences
  - From: Jonathan Nieder <jrnieder@gmail.com>

References:
- Bug#570929: Hungarian locale: "zs" is treated as a single letter, with undesirable consequences
  - From: Andras Korn <korn-debbugs@elan.rulez.org>

Prev by Date: r4205 - in glibc-package/trunk/debian: . patches patches/any
Next by Date: Bug#570929: Hungarian locale: "zs" is treated as a single letter, with undesirable consequences
Previous by thread: Bug#570929: Hungarian locale: "zs" is treated as a single letter, with undesirable consequences
Next by thread: Bug#570929: Hungarian locale: "zs" is treated as a single letter, with undesirable consequences
Index(es):
- Date
- Thread