[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#570929: Hungarian locale: "zs" is treated as a single letter, with undesirable consequences



On Fri, Feb 26, 2010 at 12:38:08AM -0600, Jonathan Nieder wrote:

> Computers are dumb
> ------------------
> 
> Andras wrote:
> 
> > 1. grep has no way of knowing whether a "zs" sequence is a "single letter"
> > or two letters, because the combination can occur in compound words without
> > becoming a "zs" letter; for example, in "fúvószenekar" ("fúvós" +
> > "zenekar"), it's simply an "s" and a "z" letter next to each other. There
> > may even exist words that make (a different) sense either way, but I can't
> > think of any right now.
> 
> Are there simple heuristics that would make this condition easy to
> discover?  For example, vowels that would never appear before a true
> "sz" letter, things like that?  I am just curious; please feel free to
> e-mail me privately about this.
> 
> This sounds like a (hard to fix) bug in the collation algorithm, but
> not a reason not to make 'sort' follow the conventions of the language.

Sorting is actually also tricky with dumb computers, because there is no way
for sort to know whether e.g. "nyolcszáz" contains a "cs" collating symbol
followed by "z" or a "c" followed by an "sz" collating symbol (the latter is
in fact the case).

cs+z would be sorted after "cz" (because "cs" comes after "c"), but c+sz
would be sorted _before_ "cz" because "sz" precedes "z".

I'd say this is unfixable. There is no way, short of understanding the
natural language, for a program to determine whether two (or three)
characters represent a single collating symbol or themselves.

Clearly, the Hungarian language must be fixed, either by introducing
separate glyphs for the composite "letters", or by no longer insisting that
something represented by more than one character is "a single letter". :)

Andras

-- 
 If Chuck Norris had been Spartan, the movie would simply have been called 1.


Reply to: