Re: Squeeze can't fit on 512MiB

To: debian-devel@lists.debian.org
Subject: Re: Squeeze can't fit on 512MiB
From: Adam Borowski <kilobyte@angband.pl>
Date: Sat, 30 Oct 2010 00:23:56 +0200
Message-id: <[🔎] 20101029222356.GA27560@angband.pl>
In-reply-to: <[🔎] 20101029130932.GF6128@codelibre.net>
References: <[🔎] 20101027152056.GN4725@const.famille.thibault.fr> <[🔎] 20101029095723.bae27b41.codehelp@debian.org> <[🔎] 20101029093659.GA10585@angband.pl> <[🔎] 20101029130932.GF6128@codelibre.net>

On Fri, Oct 29, 2010 at 02:09:32PM +0100, Roger Leigh wrote:
> On Fri, Oct 29, 2010 at 11:36:59AM +0200, Adam Borowski wrote:
> > 
> > I really wonder why you still need to install "locales" to get UTF-8.  Even
> > in current glibc, it's a second class citizen.  Several years ago, I
> > benchmarked a mockup of hard-coding UTF-8 the way ISO-8859-1 and KOI8-R were
> > done in the past, and it shaved 20% of the whole
> > fork-exec-ld-setlocale-getopt-...-exit sequence almost every program does.
> > The character classification tables are needlessly duplicated for every
> > locale as well -- try an ISO-8859-1 and look at iswfoo() for chars >0xFF,
> > even though there's a separate copy per locale, for all but C and POSIX it's
> > identical.
> 
> #522776 has quite a bit of information about basic UTF-8 support without
> locales (creation of C.UTF-8).

C.UTF-8 would carry another copy of that big table and provide no
performance benefits, but indeed, having a guaranteed UTF-8 locale would be
really, really useful.

I've read #522776 and it provides compelling reasons to add C.UTF-8 right
now, for squeeze -- we can discuss better implementations later.

> From the end of the report, there was talk of getting C.UTF-8 into
> squeeze, but I'm not sure what the status of that work is at present (it's
> a trivial glibc tweak to generate and package the additional locale).

Especially that it's already done for an udeb.

> Do you still have your patch for hard-coding UTF-8?  I did start doing this,
> but didn't get as far as having a working locale.  It might be a good
> starting point if it still works with current glibc.

1. It was several major versions of glibc before.
2. It was merely a mockup, not the proper code.  These were stubs like:
     if (ch < 128)
         return value_for_C(ch);
     else
         return 0;
I assumed that having the library bigger by a large table in its data
segment should not make a noticeable difference in speed, as it's merely
mmapping a bigger chunk without even a single additional syscall.

3. I did not investigate anything but character classification.  I suspect
uppercasing would work, but I didn't test that.
4. It broke legacy locales.
5. I don't seem to have that anymore, just some test programs for character
classes.

> I agree the duplication of character tables in glibc is totally insane; a
> single copy of each character set is more than plenty, and having both
> ASCII and UTF-8 hard-coded into glibc would be a major performance
> improvement, though it would require eliminating the duplication on locale
> loading.  Having the entire UTF-8 table duplicated for each different
> locale you use is just mad.

At least in that version (unstable in July 2006), all wctype() functions
returned the same value for all loadable locales.  The two hardcoded ones,
C and POSIX, had the data for characters 0..127 only, being the only ones
that differ.  The only function I found that was actually locale dependent
was wcwidth().

For 8 bit classification routines, legacy locales would need to iconv at
most 128 characters -- the API can't support multiple byte CJK encodings
anyway.  That's still a lot faster than opening a file.

Meow?
-- 
1KB		// Microsoft corollary to Hanlon's razor:
		//	Never attribute to stupidity what can be
		//	adequately explained by malice.

Reply to:

References:
- Squeeze can't fit on 512MiB
  - From: Samuel Thibault <sthibault@debian.org>
- Re: Squeeze can't fit on 512MiB
  - From: Neil Williams <codehelp@debian.org>
- Re: Squeeze can't fit on 512MiB
  - From: Adam Borowski <kilobyte@angband.pl>
- Re: Squeeze can't fit on 512MiB
  - From: Roger Leigh <rleigh@codelibre.net>

Prev by Date: Re: Processed (with 66 errors): Please change me email address
Next by Date: Re: Squeeze can't fit on 512MiB
Previous by thread: Re: Squeeze can't fit on 512MiB
Next by thread: Re: Squeeze can't fit on 512MiB
Index(es):
- Date
- Thread