Re: default character encoding for everything in debian

To: debian-devel@lists.debian.org
Subject: Re: default character encoding for everything in debian
From: Roger Leigh <rleigh@codelibre.net>
Date: Mon, 10 Aug 2009 21:04:37 +0100
Message-id: <[🔎] 20090810200436.GB5869@codelibre.net>
In-reply-to: <[🔎] 4A800D54.7050203@debian.org>
References: <[🔎] 200908101309.22076.thomas@koch.ro> <[🔎] 4A800D54.7050203@debian.org>

On Mon, Aug 10, 2009 at 02:06:44PM +0200, Giacomo A. Catenazzi wrote:
> Thomas Koch wrote:
> >I've an issue, that I forgot to set the character encoding of
> >tomcat to utf-8 after reinstalling a server.
> >Now, before I report a wishlist(?) bug to tomcat, I want to ask
> >(and invite to discuss) shouldn't utf8 be the default character
> >set everywhere? So when installing a package from Debian I can
> >assume that where a character encoding can be set, it't set to
> >utf8.
> >MySQL would be another example, which to my knowledge uses isoXYZ
> >as default character encoding.
> 
> Future debian systems will have a UTF-8 charset as default.
> Look at debian-policy archives.

For system users, yes, assuming you are talking about the C.UTF-8
proposal.  For normal users, UTF-8 has been the default since
Lenny.

If having a C.UTF-8 locale always available for system services is
required for them to fully support UTF-8, then that needs adding to
glibc. For a locale available after /usr is mounted, a simple localedef
invocation is all that's needed; for all times, after starting init,
it needs the tables compiling into glibc as for the standard C locale.
I've been looking at how to do the latter, but I'm not expert with the
"3-level" locale tables and other glibc internals, so if anyone who
knows the details of glibc locales could provide me with
assistance/guidance here, that would be much appreciated.

For reference, this is bug #522776.  This would be great to have as a
release goal for Squeeze, and (speculatively) a native C UTF-8 locale
for Squeeze+1 to give us a default pure UTF-8 system from end-to-end.

> A lot of debian files will be encoded in utf-8 (control, changelog
> and manpages), and transformed in the needed charset runtime.

I think "will" here implies it's something to be done in the future,
but it's a requirement right now, and all but a few exceptions are
already converted.

> But for databases there are different issues. I think the best solution
> is to do it as mediawiki: the UTF-8 data in put as binary blob: it is
> difficult to have database engines and system libraries syncronized, and
> it is also difficult to implement support for all Unicode characters.

PostgreSQL seems to manage it without problems.  Putting text in as a
binary blob obviates most uses for having in a database in the first
place.  Sorting, indexing and querying requires being able to read it!

Note that there are separate client and server (database) encodings for
text as well.  You may well get recoding between what the user sees and
what's actually stored in the database, potentially at several points.
Having UTF-8 on the server does not require it on the client (and vice
versa).

> But let to concentrate to the first task: having a good UTF-8 support
> in all programs/terminals/etc.

I think that part was already done quite some time ago.  Any program
that doesn't support UTF-8 is an exception, and should be fixed or
removed.

For the specific case of databases, what's being proposed here is
making the default UTF-8.  Existing databases should not be affected,
since they would retain their current encoding.  New databases should,
however, use UTF-8.  If a specific application needs a specific
encoding in order to function correctly, then it's that application's
responsibility to specify that when creating it i.e. overriding the
default.  If it doesn't do that already, it's already broken since it's
currently unspecified.

Regards,
Roger
-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.

Attachment: signature.asc
Description: Digital signature

Reply to:

Follow-Ups:
- Re: default character encoding for everything in debian
  - From: Adam Borowski <kilobyte@angband.pl>

References:
- default character encoding for everything in debian
  - From: Thomas Koch <thomas@koch.ro>
- Re: default character encoding for everything in debian
  - From: "Giacomo A. Catenazzi" <cate@debian.org>

Prev by Date: Re: eiskaltdc extra licence
Next by Date: Re: eiskaltdc extra licence
Previous by thread: Re: default character encoding for everything in debian
Next by thread: Re: default character encoding for everything in debian
Index(es):
- Date
- Thread