[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: console translator set without encoding

At 21 Jan 2005 19:31:13 -0800,
Thomas Bushnell BSG wrote:
> Marcus Brinkmann <marcus.brinkmann@ruhr-uni-bochum.de> writes:
> > UTF-8 is an insanely complex standard, if you start to look down its
> > depths.  
> UTF-8 is a complex standard.  It is not insanely so.  It is complex
> because it is representing a very complex problem.  

Oh, sure.  The insanity starts if you talk about using "UTF-8" for
things like filenames without being very exact in what you mean by
that.  The implications of putting the complex system UTF-8 into a
POSIX-like operating systems as they exist today are not well
understood, and the resulting lose ends, conflicts, etc are not
resolved as of today.

So, the phrase "do the right thing with UTF-8" is subject to
substantial interpretation.  My summary was intended to show that
given todays understanding of the above situation, I believe we do the
"right thing with UTF-8".  More specifically (and please also see the
quote below), we only support specific scripts at Unicode Level 1
(ISO 10646-1).

I don't think we disagree, and I am not really ranting, so there is
not much left to say I guess.  But just to be clear: I am just as much
as any geek gung ho about seeing tibetian quotations in a russian mail
about some math problems that's in my inbox along with the korean spam
- and everything out of the box on the text console.  The essence of
what I wrote is just that neither is UTF-8 the hammer for every nail
(you will always find people who feel their script is misrepresented
in Unicode), nor is it really clear what practical UTF-8 support means
nowadays.  To some substantial amount, it is still experimental and
work in progress.

People are working on it of course, and if POSIX demands that file
name lookups are done by comparing the Normalization Form C of each
string we should and will implement this in libdiskfs etc.  We should
walk this march in lock-step with the rest of the world, and let them
do the work for us figuring out what needs to be done.  No more and no
less, I think.

The UTF-8 and Unicode FAQ for Unix/Linux can be found here:


One paragraph is particularly interesting:

"Full Unicode functionality with all bells and whistles
(e.g. high-quality typesetting of the Arabic and Indic scripts) can
only be expected from sophisticated multi-lingual word-processing
packages. What Linux supports today on a broad base is far simpler and
mainly aimed at replacing the old 8- and 16-bit character sets. Linux
terminal emulators and command line tools usually only support a Level
1 implementation of ISO 10646-1 (no combining characters), and only
scripts such as Latin, Greek, Cyrillic, Armenian, Georgian, CJK, and
many scientific symbols are supported that need no further processing
support. At this level, UCS support is very comparable to ISO 8859
support and the only significant difference is that we have now
thousands of different characters available, that characters can be
represented by multibyte sequences, and that ideographic
Chinese/Japanese/Korean characters require two terminal character
positions (double-width)."

We don't have support for ideographic CJK characters, I didn't know
how to implement that and thought it would be better left to somebody
actually writes such things (I still try to write my ideographic CJK
characters with a calligraphy brush).  But apart from that, this is
about the level we support things, and I did that pretty purposefully.

Take it easy ;)


Reply to: