[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Squeeze can't fit on 512MiB



On Fri, Oct 29, 2010 at 02:09:32PM +0100, Roger Leigh wrote:
> On Fri, Oct 29, 2010 at 11:36:59AM +0200, Adam Borowski wrote:
> > 
> > I really wonder why you still need to install "locales" to get UTF-8.  Even
> > in current glibc, it's a second class citizen.  Several years ago, I
> > benchmarked a mockup of hard-coding UTF-8 the way ISO-8859-1 and KOI8-R were
> > done in the past, and it shaved 20% of the whole
> > fork-exec-ld-setlocale-getopt-...-exit sequence almost every program does.
> > The character classification tables are needlessly duplicated for every
> > locale as well -- try an ISO-8859-1 and look at iswfoo() for chars >0xFF,
> > even though there's a separate copy per locale, for all but C and POSIX it's
> > identical.
> 
> #522776 has quite a bit of information about basic UTF-8 support without
> locales (creation of C.UTF-8).

C.UTF-8 would carry another copy of that big table and provide no
performance benefits, but indeed, having a guaranteed UTF-8 locale would be
really, really useful.

I've read #522776 and it provides compelling reasons to add C.UTF-8 right
now, for squeeze -- we can discuss better implementations later.

> From the end of the report, there was talk of getting C.UTF-8 into
> squeeze, but I'm not sure what the status of that work is at present (it's
> a trivial glibc tweak to generate and package the additional locale).

Especially that it's already done for an udeb.

> Do you still have your patch for hard-coding UTF-8?  I did start doing this,
> but didn't get as far as having a working locale.  It might be a good
> starting point if it still works with current glibc.

1. It was several major versions of glibc before.
2. It was merely a mockup, not the proper code.  These were stubs like:
     if (ch < 128)
         return value_for_C(ch);
     else
         return 0;
I assumed that having the library bigger by a large table in its data
segment should not make a noticeable difference in speed, as it's merely
mmapping a bigger chunk without even a single additional syscall.

3. I did not investigate anything but character classification.  I suspect
uppercasing would work, but I didn't test that.
4. It broke legacy locales.
5. I don't seem to have that anymore, just some test programs for character
classes.

> I agree the duplication of character tables in glibc is totally insane; a
> single copy of each character set is more than plenty, and having both
> ASCII and UTF-8 hard-coded into glibc would be a major performance
> improvement, though it would require eliminating the duplication on locale
> loading.  Having the entire UTF-8 table duplicated for each different
> locale you use is just mad.

At least in that version (unstable in July 2006), all wctype() functions
returned the same value for all loadable locales.  The two hardcoded ones,
C and POSIX, had the data for characters 0..127 only, being the only ones
that differ.  The only function I found that was actually locale dependent
was wcwidth().

For 8 bit classification routines, legacy locales would need to iconv at
most 128 characters -- the API can't support multiple byte CJK encodings
anyway.  That's still a lot faster than opening a file.


Meow?
-- 
1KB		// Microsoft corollary to Hanlon's razor:
		//	Never attribute to stupidity what can be
		//	adequately explained by malice.


Reply to: