[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: console translator set without encoding

Hi Marcus,

Yesterday at 5:56, Marcus Brinkmann wrote:

> At 21 Jan 2005 19:31:13 -0800,
> Thomas Bushnell BSG wrote:
>> Marcus Brinkmann <marcus.brinkmann@ruhr-uni-bochum.de> writes:
>> > UTF-8 is an insanely complex standard, if you start to look down its
>> > depths.  
>> UTF-8 is a complex standard.  It is not insanely so.  It is complex
>> because it is representing a very complex problem.  

Now, UTF-8 is an extremely simple standard, but Unicode is not so :)
Proper UTF-8 transformation functions usually take no more than a
couple of dozen lines, and that's including error checking :)

Or, I may be missing what UTF-8 standard you're talking about (RFC
something :).

> Oh, sure.  The insanity starts if you talk about using "UTF-8" for
> things like filenames without being very exact in what you mean by
> that.  The implications of putting the complex system UTF-8 into a
> POSIX-like operating systems as they exist today are not well
> understood, and the resulting lose ends, conflicts, etc are not
> resolved as of today.

POSIX has never used "equivalences" for characters (i.e. 
case-differences), so I don't see what's so different in using 
UTF-8 instead of ISO-8859-1 for filenames: after all, one can treat
UTF-8 as ISO-8859-1 without any problem at all, so from POSIX point of
view, it all works, just displays as gibberish :)

Using normalized forms would then simply be up to the writer and
reader, just as it is up to the writer and reader today to check for
all of "Music", "music", "mUSIC" and similar when a user actually
searches for his music directory.  Of course, going a step further and
doing this in libdiskfs or wherever is nice as well.

Users' expectations are that they can use their own characters.
Character set is only an implementation detail, and whoever cares
about it is not a regular user, but a technical computer user (a
programmer most commonly).  UTF-8 in that sense simplifies the
implementation, instead of complicating it (as you seem to be
suggesting), and it further improves the portability.

Of course, UTF-8 is no hammer for every nail, as you put it, but it's 
clearly an improvement over any 8-bit character set in the POSIX world.

Well, this is just my opinion at least :)


Reply to: