[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#99324: Default charset should be UTF-8



On Thu, May 31, 2001 at 09:06:10PM +0200, Christian Kurz wrote:
> On 01-05-30 Cesar Eduardo Barros wrote:
> > Package: debian-policy
> > Version: 3.5.4.0
> > Severity: wishlist
> > 
> > I think Debian should start to move into using UTF-8 by default everywhere.
> 
> May I ask why we want to choose UTF-8 instead of UTF-5 or UTF-16? And
> why should we exactly switch to Unicode? How many real world system or
> applications currently support and/or use unicode?

Ask the IETF. They seem to like UTF8 a lot.

Ask Linus too. The UTF8 support is in the kernel since, what, 2.0.x?

Seriously, tough:

- UTF8 preserves the meaning of all the 0-127 range, so no problems with
  special chars and escapes (as long as you use only ASCII you don't even
  notice it's there). The other UTF encodings weren't designed for that. The
  RFC which describes UTF8 (RFC 2279) says:

  # UTF-8, the object of this memo, has
  # the characteristic of preserving the full US-ASCII range, providing
  # compatibility with file systems, parsers and other software that rely
  # on US-ASCII values but are transparent to other values.

- UTF8 allow the full range of Unicode (which means that any charset should be
  a subset of it. All other UTFs do it too.)
- Where did you find that UTF-5? I just knew of UTF-7, UTF-8 and UTF-16...
  UTF7 is a standard which doesn't uses the high bit, so if an app isn't aware
  of it it might misinterpret the high characters are a string of normal chars.
  UTF16, AFAIK, is a standard for enconding the 32-bit UCS4 in 16 bit words.

As to why to switch to UTF8: first, it's only a default (Debian is all about
choice) and a standard charset for /usr/share/doc to be into. Second, it's a
general tendency to be moving away from having to switch charsets all the time
and towards using Unicode.

Applications which currently use Unicode which I know of are Mozilla (since
UTF8 is the default charset of XML IIRC, and used in a lot of places in the
net), and the Win32 systems (including Wine. M$ seems to have done it in a
horribly messy way, but I can't judge since I'm not a Windows programmer). I'm
sure there are many more. But support is not a problem; we have the source
don't we? Using UTF8 as a default is more of a long-term suggestion to be done
in 1 or 2 Debian releases (which probably means at least four full years =) )

-- 
Cesar Eduardo Barros
cesarb@nitnet.com.br
cesarb@dcc.ufrj.br



Reply to: