Re: Bug#467249: man-db/groff and locales

To: debian-devel@lists.debian.org
Cc: 467249@bugs.debian.org, linux4michelle@freenet.de
Subject: Re: Bug#467249: man-db/groff and locales
From: Adam Borowski <kilobyte@angband.pl>
Date: Fri, 29 Feb 2008 00:32:29 +0100
Message-id: <[🔎] 20080228233229.GA17631@angband.pl>
Mail-followup-to: debian-devel@lists.debian.org, 467249@bugs.debian.org, linux4michelle@freenet.de
In-reply-to: <[🔎] 20080228221032.GA3668@crustytoothpaste.ath.cx>
References: <[🔎] 20080228094230.GN3053@freenet.de> <[🔎] 20080228202141.GA9675@angband.pl> <[🔎] 20080228213055.GP16526@riva.ucam.org> <[🔎] 20080228221032.GA3668@crustytoothpaste.ath.cx>

On Thu, Feb 28, 2008 at 10:10:32PM +0000, brian m. carlson wrote:
> On Thu, Feb 28, 2008 at 09:30:55PM +0000, Colin Watson wrote:
> >On Thu, Feb 28, 2008 at 09:21:41PM +0100, Adam Borowski wrote:
> >man-db really does have some special-casing here. Trust me. It was
> >necessary at the time. There are a finite number of known aliases for
> >the very small number of locales in question, and until it becomes
> >unnecessary I will simply support those.

Of, course, encodings for _source_ pages are those we can't get away with. 

But for all intermediate steps, I don't see any reason to not go to a
well-known encoding, do everything there and finally convert to whatever
locale is set -- and you don't even need to name the charset there.

Special-casing _output_ locales seems quite strange to me.

> >(And I agree that it should go away, but can't easily just yet.)

Could you tell us what keeps us with all the old cruft?  By adding
groff-1.19 like -K<charset> to our groff, I was able replace all special-
casing except for source.  In my ugly preliminary code most functions in
src/encodings.c start with 'return "UTF-8";' -- and it seems to work just
fine in all locales I tested, which include zh_CN.GB2312 and similar.

It's very likely I missed something, I hardly know anything about groff, but
at least at the first glance, ripping away most of the file seems to be a
win.

> Is there some way to query what character set a locale uses?  If not, I 
> think that man-db should default to UTF-8 (since that *is* the standard 
> on Debian) and handle exceptions to that.  Processing an ASCII manpage 
> as UTF-8 is a no-op.  And it's pretty easy to tell if something isn't 
> valid UTF-8, and man-db can handle that as it normally would.

AOL.  I agree with Brian 100%.  As you already added code to detect if the
source is valid UTF-8 or not, all that needs to be done is using UTF-8
instead of ISO-8859-1 as the intermediate format.

> >>Too bad, groff doesn't have real Unicode support, and supports only
> >>several special-cased locales (which may then be transcoded as UTF-8,
> >>but they still get wrapped into their old-style charsets).
> 
> AIUI, PostScript doesn't have UTF-8 support either, yet it seems to work 
> just fine.  Anyway, newer versions of groff have a conversion tool that 
> maps UTF-8 (or any arbitrary character set) input into glyph names.

I see.  So, in very short term, groff would be able to output PostScript
only for limited locales.  That's no regression.

And on tty and html, which are 99.99% of uses of man, suddenly all bugs like
"man iso-8859-2", Kanji names in English manpages, regressions in KOI-8R
(#424655) or no support for Indic scripts would dissappear overnight with a
minimal patch.

> >Are you working with Brian M. Carlson on this?

Not yet, I preferred to have some code to show first.

> >He has been working on a solution acceptable to groff upstream, which is,
> >frankly, the only way I want to go now. He has already made substantial
> >progress with character class support.

Sounds great.  And that's the way to go.

For example, when selecting width, groff 1.18 does:
  u2E00..u9FFF 48 0
  uAC00..uD7AF 48 0
  uFF00..uFFEF 48 0
which supports only CJK.

My temporary solution has a hard-coded table (to minimize patching code):
  u0100..u10FF 24 0
  u1100..u115F 48 0
  u1160..u2328 24 0
  u2329..u232A 48 0
  u232B..u2E7F 24 0
  [...]
  u10000..u1FFFD 24 0
  u20000..u2FFFD 48 0
  u30000..u3FFFD 48 0
  u40000..u10FFFF 24 0
This supports all other code ranges, and is forward-compatible with when
proper character class support and other goodies go in.

> Please be aware that I have little time with school right now, so this 
> may not be implemented soon.  In fact, it may not be ready in time for 
> lenny's release.  I will sit down and work on it some more soon, but my 
> time is limited.  If people want more information on my plan of attack, 
> please do let me know, and I'll be happy to share.

Likewise, I'm nearly unavailable for the next two days.  I'll be able to
help later, but bear in mind that groff is not my area of expertise, and I
plan only minimal changes.

-- 
1KB		// Microsoft corollary to Hanlon's razor:
		//	Never attribute to stupidity what can be
		//	adequately explained by malice.

Reply to:

References:
- FW by lidaobing@gmail.com : Bug#467249: man-db: over sensitive on the spell of locale
  - From: Michelle Konzack <linux4michelle@freenet.de>
- Re: FW by lidaobing@gmail.com : Bug#467249: man-db: over sensitive on the spell of locale
  - From: Adam Borowski <kilobyte@angband.pl>
- Re: Bug#467249: FW by lidaobing@gmail.com : Bug#467249: man-db: over sensitive on the spell of locale
  - From: Colin Watson <cjwatson@debian.org>
- Re: Bug#467249: FW by lidaobing@gmail.com : Bug#467249: man-db: over sensitive on the spell of locale
  - From: "brian m. carlson" <sandals@crustytoothpaste.ath.cx>

Prev by Date: Re: Bug#467249: FW by lidaobing@gmail.com : Bug#467249: man-db: over sensitive on the spell of locale
Next by Date: Re: Bug#468408: ITP: libgenome -- toolkit for developing bioinformatic related software
Previous by thread: Re: Bug#467249: FW by lidaobing@gmail.com : Bug#467249: man-db: over sensitive on the spell of locale
Next by thread: Re: FW by lidaobing@gmail.com : Bug#467249: man-db: over sensitive on the spell of locale
Index(es):
- Date
- Thread