[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: default character encoding for everything in debian

On Mon, Aug 10, 2009 at 09:04:37PM +0100, Roger Leigh wrote:
> If having a C.UTF-8 locale always available for system services is
> required for them to fully support UTF-8, then that needs adding to
> glibc.

It would also bring significant speed increase.  Since about everything
calls setlocale(), having the locale internal speeds up the typical process
startup sequence by 20%!  And that's 20% of the whole thing from fork(),
through link, up to getopt(), so it's not a speedup you can shake a stick at.
I'm speaking about having the locale supported natively by glibc, of course;
what the udeb does is merely shipping a generated locale file.

> For a locale available after /usr is mounted, a simple localedef
> invocation is all that's needed; for all times, after starting init,
> it needs the tables compiling into glibc as for the standard C locale.
> I've been looking at how to do the latter, but I'm not expert with the
> "3-level" locale tables and other glibc internals, so if anyone who
> knows the details of glibc locales could provide me with
> assistance/guidance here, that would be much appreciated.
> For reference, this is bug #522776.  This would be great to have as a
> release goal for Squeeze, and (speculatively) a native C UTF-8 locale
> for Squeeze+1 to give us a default pure UTF-8 system from end-to-end.

I'm not an expert with glibc internals too, but a couple of years ago I
researched the issue a bit.  Apparently, there are only two first-class
locales: C and POSIX, all other get loaded from the disk.  In the past,
en_US.ISO-8859-1 and ru_RU.KOI8-R were such first-class ones as well, but
that's no more.  What I'd propose would be making C.UTF-8 built in.

Another possible optimization would be building the table used by 8-bit
isalpha/etc on the fly for all locales.  Iconving 128 characters is
certainly faster than opening a file on the disk, and (sanely) glibc doesn't
support character classification contrary to Unicode so this could result in
completely nuking all LC_CTYPE files for other locales as well.

1KB		// Microsoft corollary to Hanlon's razor:
		//	Never attribute to stupidity what can be
		//	adequately explained by malice.

Reply to: