[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Make Unicode bugs release critical?



On Fri, Feb 11, 2011 at 12:59:46PM +0100, Klaus Ethgen wrote:
> Am Fr den 11. Feb 2011 um 10:37 schrieb Lars Wirzenius:
> > The first Unicode standard was published in 1991. That's twenty years
> > ago. Any software that processes text at all and is incapable of dealing
> > with UTF-8 should be considered with extreme suspicion. Making all such
> > bugs be release critical (which includes the notion that release
> > managers may ignore the bug in particular cases) sounds like a good way
> > to get things under control.
> 
> I think you are mixing stuff together. First there is unicode. There are
> several definitions for unicode (unicode-16, unicode-32, ...) but UTF-8
> is not unicode it is just one implementation of unicode and in my eyes
> the most problematic as it has undefined states and is variable length.

There is just one definition of Unicode, any new versions merely add extra
characters, collating rules, etc.

There are several ways to represent Unicode as a stream of bytes.  Only one
of them is fit for external storage, and that's UTF-8 since it doesn't break
the assumptions that are true for text files:
1. no null bytes
2. basic newlines, etc are always newlines, never a part of a bigger
   character (not true for some ancient multibyte encodings)
3. not affected by endianness or any other internal detail

Also, _all_ Unicode encodings are of variable length.

> However, UTF-8 was created to allow using unicode in non-unicode
> environments. For me that was always a pointless plan and the unreadable
> UTF-8 characters all around buggy software that cannot handle encodings
> correct (and there are many around) and ignorant users who are using
> UTF-8 in environments that are not specified for multibyte charsets
> (IRC) is the most annoying one.

UTF-8 was never meant as merely a tool to "allow using unicode in
non-unicode environments".

UTF-32 is useful only as an internal representation if you do care about a
string of code points.  Since a single character can consist of multiple
such code points, it doesn't give you much unless you have to pass every
code point through a function like wcwidth() -- ie, you are implementing
something low-level which cares about properties of characters and their
parts.  You should never place UTF-32 into external storage that is not
private to your program or can possibly be moved.

UTF-16 is never, ever useful.  It is a sad trap for win32 and Java
developers, due to a bad engineering decision suggested, as I was told, by
delegates from Microsoft and Sun, who wanted to "conserve disk space and
memory" by storing separately code points and a language tag -- ie, exactly
the thing Unicode was supposed to get us rid of.  Even on day one, it was
known that you can't fit all characters into 16 bits, and the decision to
put all "rare characters" into a "private" area that needs out of band
information was pretty ridiculous.  The end result is, you have an encoding
with all downsides of UTF-8 but none of the advantages.

Since neither UTF-16 nor UTF-32 can be considered text, the decision all
UNIX systems made was to use UTF-8 in the libc's API in all Unicode locales. 
Otherwise, you'd need separate APIs like FooBarA()/FooBarW() on Windows,
which cause no end of problems.

> So specifying to be UTF-8 capable is somewhat inconsequent. Software has
> to be capable to handle every encoding as long as they are specified for
> that encodings.

No, there is only one encoding left, as long as you don't have to talk to
Windows.  We can start purging away all the support for ancient charsets in
places that do not need to handle foreign data.  Debian has used UTF-8 as
default for 5 releases already, and if you try to use an ancient locale, do
not expect good results since no one bothers fixing bugs there.  And
maintaining unused code costs time and causes a risk of bugs, so good
riddance!

-- 
1KB		// Microsoft corollary to Hanlon's razor:
		//	Never attribute to stupidity what can be
		//	adequately explained by malice.


Reply to: