Re: console translator set without encoding
Today at 14:04, Samuel Thibault wrote:
> Normalized form take care of glyphs that really can be coded several
> different ways: for instance, latin e with acute accent may be directly
> coded as 'Ã©', but in unicode, may also be coded as 'e' followed by the
> combining acute accent. These are really *two* ways to code *exactly*
> the same thing (on the displaying point of view: an 'e' with an acute
> accent above it). Hence normalization is needed to match both.
You missed my point: POSIX is not concerned with this, and they're not
the same if you're asking POSIX (try doing a stat() on two such files,
and let me know the results :). It's up to the "handler" to make
choice on one normalisation form and make the most of it (i.e.
optimise around it). They are same from ISO-10646 and Unicode POV,
but just like "A" and "a" are same from ISO-14561 (in *certain*
contexts, eg. when you're doing case-insensitive collation) POV, it's
What are you going to do when you come across a filesystem where you
have two files with such names which only differ in normalisation form
used (i.e. fully decomposed or fully composed)? Yeah, you can ensure
that no filesystem created via GNU/Hurd is going to have such
instances, but what about filesystems created elsewhere?
Are you going to treat such filesystems as erroneous?
Filenames are 8-bit ASCII compatible strings (UTF-FS as in
"filesystem-safe" originally), and that's all you need to know to make
My example above was simply this: if you go the route of treating
several different things as one (from implementation POV), you'll end
up with the mess Microsoft has on Windows with case-insensitive