[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#570929: Hungarian locale: "zs" is treated as a single letter, with undesirable consequences



Package: locales
Version: 2.10.2-6
Severity: normal

Hi,

in Hungarian, "zs" (as well as "sz", "cs", "ty", "dz", "dzs", "gy" and "ly")
are said to be part of the alphabet and each combination is considered to be
a single letter; however, they are represented by two or more characters;
there aren't single glyphs for them.

"zs" in particular is causing trouble for grep:

% echo zs | LANG=C grep '^[^a-z]*$'
% echo zs | LANG=hu_HU.UTF-8 grep '^[^a-z]*$'
zs

It's possible to come up with expressions that lead to similarly unexpected
results for the other multi-char letters as well, but these don't occur
frequently:

% echo ty | LANG=C grep '^[s-u]*$'
% echo ty | LANG=hu_HU.UTF-8 grep '^[s-u]*$'
ty

This is undesirable and dumb, for several reasons:

1. grep has no way of knowing whether a "zs" sequence is a "single letter"
or two letters, because the combination can occur in compound words without
becoming a "zs" letter; for example, in "fúvószenekar" ("fúvós" +
"zenekar"), it's simply an "s" and a "z" letter next to each other. There
may even exist words that make (a different) sense either way, but I can't
think of any right now.

2. "zs" is the last letter of the Hungarian alphabet; therefore, no sane
character range in a regular expression can include it ("[a-zs]" would be
ambiguous because there isn't a "zs" glyph).

"zs" and the other multi-char letters play an important role in sorting
("zs" has to be sorted after "za" and so on), but please can we treat them
as two characters in all other contexts?

I can also make a socio-ergonomic point: I think most people who deal with
regular expressions don't expect Hungarian multi-character "letters" to be
treated as single characters in regular expressions, whether they are
Hungarian or not.

Andras

-- System Information:
Debian Release: squeeze/sid
  APT prefers unstable
  APT policy: (500, 'unstable'), (1, 'experimental')
Architecture: amd64 (x86_64)

Kernel: Linux 2.6.32.7-vs2.3.0.36.28-hellgate (SMP w/3 CPU cores; PREEMPT)
Locale: LANG=C, LC_CTYPE=hu_HU.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/bash

Versions of packages locales depends on:
ii  debconf [debconf-2.0]         1.5.28     Debian configuration management sy
ii  libc6 [glibc-2.10-1]          2.10.2-2   GNU C Library: Shared libraries

locales recommends no packages.

locales suggests no packages.

-- debconf information:
* locales/default_environment_locale: None
* locales/locales_to_be_generated: en_GB ISO-8859-1, en_GB.ISO-8859-15 ISO-8859-15, en_GB.UTF-8 UTF-8, en_US ISO-8859-1, en_US.ISO-8859-15 ISO-8859-15, en_US.UTF-8 UTF-8, hu_HU ISO-8859-2, hu_HU.UTF-8 UTF-8

-- 
 Andras Korn <korn at elan.rulez.org> - <http://chardonnay.math.bme.hu/~korn/>
                A stitch in time would have confused Einstein.



Reply to: