Re: Bug#467249: man-db/groff and locales

To: debian-devel@lists.debian.org, 467249@bugs.debian.org, linux4michelle@freenet.de
Subject: Re: Bug#467249: man-db/groff and locales
From: Colin Watson <cjwatson@debian.org>
Date: Sat, 1 Mar 2008 23:56:28 +0000
Message-id: <[🔎] 20080301235628.GC16526@riva.ucam.org>
Mail-followup-to: debian-devel@lists.debian.org, 467249@bugs.debian.org, linux4michelle@freenet.de
In-reply-to: <20080228233229.GA17631@angband.pl>
References: <20080228094230.GN3053@freenet.de> <20080228202141.GA9675@angband.pl> <20080228213055.GP16526@riva.ucam.org> <20080228221032.GA3668@crustytoothpaste.ath.cx> <20080228233229.GA17631@angband.pl>

On Fri, Feb 29, 2008 at 12:32:29AM +0100, Adam Borowski wrote:
> On Thu, Feb 28, 2008 at 10:10:32PM +0000, brian m. carlson wrote:
> > On Thu, Feb 28, 2008 at 09:30:55PM +0000, Colin Watson wrote:
> > >man-db really does have some special-casing here. Trust me. It was
> > >necessary at the time. There are a finite number of known aliases for
> > >the very small number of locales in question, and until it becomes
> > >unnecessary I will simply support those.
> 
> Of, course, encodings for _source_ pages are those we can't get away with. 
> 
> But for all intermediate steps, I don't see any reason to not go to a
> well-known encoding, do everything there and finally convert to whatever
> locale is set -- and you don't even need to name the charset there.
> 
> Special-casing _output_ locales seems quite strange to me.

        /* An ugly special case is needed here. The utf8 device normally
         * takes ISO-8859-1 input. However, with the multibyte patch, when
         * recoding from CJK character sets it takes UTF-8 input instead.
         * This is evil, but there's not much that can be done about it
         * apart from waiting for groff 2.0.
         */

> > >(And I agree that it should go away, but can't easily just yet.)
> 
> Could you tell us what keeps us with all the old cruft?

Sanity. I am not interested in making the groff package even more
incredibly difficult to update to a new upstream in the future.

Official groff does not yet support proper CJK typography. Until that is
in place it is not a viable replacement.

(I'm also really fed up of explaining this again and again. I think I'm
fairly clearly active in man-db; could you please accept that I have my
reasons beyond laziness, and look up what has been said on this topic
over and over again in the past?)

> > Is there some way to query what character set a locale uses?  If not, I 
> > think that man-db should default to UTF-8 (since that *is* the standard 
> > on Debian) and handle exceptions to that.  Processing an ASCII manpage 
> > as UTF-8 is a no-op.  And it's pretty easy to tell if something isn't 
> > valid UTF-8, and man-db can handle that as it normally would.
> 
> AOL.  I agree with Brian 100%.  As you already added code to detect if the
> source is valid UTF-8 or not, all that needs to be done is using UTF-8
> instead of ISO-8859-1 as the intermediate format.

There is a lot more to it than that or upstream would be recommending
that already; the version of groff we are using does not have the
internal capabilities that are needed (our changes are a band-aid at
best). Reading this thread may be a helpful summary:

  http://www.mail-archive.com/groff@gnu.org/msg01378.html

In short, I am not interested in doing this on top of our current groff
package. I want to do it on top of a whole new upstream that actually
has the features we need with an upstream maintainer prepared to support
them (note that nobody has stepped forward to do any maintenance work on
the Debian multibyte patch for years). Doing that without also
forward-porting our patches for features such as kinsoku shori would
introduce regressions. Forward-porting these patches hackily is
incredibly difficult (I've tried). Forward-porting those patches in a
way that is consistent with upstream's direction (i.e. reimplementing
them) is essentially Brian's work.

> I see.  So, in very short term, groff would be able to output PostScript
> only for limited locales.  That's no regression.
> 
> And on tty and html, which are 99.99% of uses of man, suddenly all bugs like
> "man iso-8859-2", Kanji names in English manpages, regressions in KOI-8R
> (#424655) or no support for Indic scripts would dissappear overnight with a
> minimal patch.

I would love to have these new features, but I want them on top of a
sane, supportable upstream release. I am sick of the mess we have now
and don't want to make it worse. I also want to actually have us
contribute something useful to groff upstream beyond confused users
showing up on their mailing list and having to be told that this is a
weirdness of Debian's groff package.

I am honestly not willing to support a backport of -K/preconv to our
groff package, with all of the other Unicode support that should come
along with it in order to do a good job. I also enjoy maintaining this
stuff too much to resign. Therefore I must encourage you to help
upstream with the last few pieces needed in order to get this all merged
properly.

Finally, I suspect you'll find that e.g. the specialised kerning code
that's in Debian's groff for proper rendering of ASCII/EUC-JP boundaries
will cause problems with generalised UTF-8 rendering unless properly
forward-ported. I'm fairly sure there are more such examples; that's
just the first I could find easily having been away from that particular
code for a while. If you don't speak all the languages in question, you
might not notice this kind of thing on casual inspection of the output.
Typography involves more than just getting all the characters into the
right encoding.

> > >He has been working on a solution acceptable to groff upstream, which is,
> > >frankly, the only way I want to go now. He has already made substantial
> > >progress with character class support.
> 
> Sounds great.  And that's the way to go.

Of course. But wholesale, not with temporary hacks that just make my
life harder. I am still the maintainer and have to consider my ability
to merge future upstream releases, which is already all but impossible;
introducing yet more divergence will make it even less likely that we'll
ever get to a clean upstream state.

I appreciate your research into this. But please, I beg you, focus your
energies on upstream. There is really not much left to do; Brian's done
the heavy lifting of character class support (or most of it, anyway),
and now somebody just needs to take the specialised typographic rules
and make them sufficiently general for inclusion.

> Likewise, I'm nearly unavailable for the next two days.  I'll be able to
> help later, but bear in mind that groff is not my area of expertise, and I
> plan only minimal changes.

I hope you will take my advice born of nearly seven years of maintaining
groff in Debian.

Thanks,

-- 
Colin Watson                                       [cjwatson@debian.org]

Reply to:

Follow-Ups:
- Re: Bug#467249: man-db/groff and locales
  - From: Adam Borowski <kilobyte@angband.pl>

Prev by Date: Re: Bug#468183: ITP: monkey -- small webserver based on the?HTTP/1.1 protocol
Next by Date: Re: How to cope with patches sanely
Previous by thread: Re: Bug#467249: FW by lidaobing@gmail.com : Bug#467249: man-db: over sensitive on the spell of locale
Next by thread: Re: Bug#467249: man-db/groff and locales
Index(es):
- Date
- Thread