Re: default character encoding for everything in debian

To: debian-devel@lists.debian.org
Subject: Re: default character encoding for everything in debian
From: Adam Borowski <kilobyte@angband.pl>
Date: Tue, 11 Aug 2009 23:34:39 +0200
Message-id: <[🔎] 20090811213439.GA1310@angband.pl>
In-reply-to: <[🔎] 20090810200436.GB5869@codelibre.net>
References: <[🔎] 200908101309.22076.thomas@koch.ro> <[🔎] 4A800D54.7050203@debian.org> <[🔎] 20090810200436.GB5869@codelibre.net>

On Mon, Aug 10, 2009 at 09:04:37PM +0100, Roger Leigh wrote:
> If having a C.UTF-8 locale always available for system services is
> required for them to fully support UTF-8, then that needs adding to
> glibc.

It would also bring significant speed increase.  Since about everything
calls setlocale(), having the locale internal speeds up the typical process
startup sequence by 20%!  And that's 20% of the whole thing from fork(),
through link, up to getopt(), so it's not a speedup you can shake a stick at.
I'm speaking about having the locale supported natively by glibc, of course;
what the udeb does is merely shipping a generated locale file.

> For a locale available after /usr is mounted, a simple localedef
> invocation is all that's needed; for all times, after starting init,
> it needs the tables compiling into glibc as for the standard C locale.
> I've been looking at how to do the latter, but I'm not expert with the
> "3-level" locale tables and other glibc internals, so if anyone who
> knows the details of glibc locales could provide me with
> assistance/guidance here, that would be much appreciated.
> 
> For reference, this is bug #522776.  This would be great to have as a
> release goal for Squeeze, and (speculatively) a native C UTF-8 locale
> for Squeeze+1 to give us a default pure UTF-8 system from end-to-end.

I'm not an expert with glibc internals too, but a couple of years ago I
researched the issue a bit.  Apparently, there are only two first-class
locales: C and POSIX, all other get loaded from the disk.  In the past,
en_US.ISO-8859-1 and ru_RU.KOI8-R were such first-class ones as well, but
that's no more.  What I'd propose would be making C.UTF-8 built in.

Another possible optimization would be building the table used by 8-bit
isalpha/etc on the fly for all locales.  Iconving 128 characters is
certainly faster than opening a file on the disk, and (sanely) glibc doesn't
support character classification contrary to Unicode so this could result in
completely nuking all LC_CTYPE files for other locales as well.

-- 
1KB		// Microsoft corollary to Hanlon's razor:
		//	Never attribute to stupidity what can be
		//	adequately explained by malice.

Reply to:

References:
- default character encoding for everything in debian
  - From: Thomas Koch <thomas@koch.ro>
- Re: default character encoding for everything in debian
  - From: "Giacomo A. Catenazzi" <cate@debian.org>
- Re: default character encoding for everything in debian
  - From: Roger Leigh <rleigh@codelibre.net>

Prev by Date: Re: Automatic Debug Packages
Next by Date: Re: Automatic Debug Packages
Previous by thread: Re: default character encoding for everything in debian
Next by thread: Re: default character encoding for everything in debian
Index(es):
- Date
- Thread