Re: UTF-8 manual pages

To: debian-devel@lists.debian.org
Cc: debian-i18n@lists.debian.org
Subject: Re: UTF-8 manual pages
From: Adam Borowski <kilobyte@angband.pl>
Date: Fri, 12 Oct 2007 15:57:03 +0200
Message-id: <[🔎] 20071012135703.GB30157@angband.pl>
In-reply-to: <[🔎] 20071011112218.GC1666@riva.ucam.org>
References: <[🔎] 20071011112218.GC1666@riva.ucam.org>

On Thu, Oct 11, 2007 at 12:22:18PM +0100, Colin Watson wrote:
> I have uploaded man-db 2.5.0-1, which includes the following changes of
> note:
[...] 
> 
> groff does not yet support UTF-8 input, so at the moment this is
> implemented by recoding in man. For the time being, the implementation
> requires that the page be convertible to the legacy encoding for the
> language using iconv (it uses //TRANSLIT so that it will make an attempt
> at characters that aren't directly convertible, but that isn't perfect);
> so a German manual page should avoid using UTF-8 characters without an
> equivalent in ISO-8859-1. I do not expect this to be particularly
> onerous for the time being, though there are a few cases (particularly
> proper names) where it may be a problem. I ask for your patience in
> those cases. If you need to use a character not in the corresponding
> legacy encoding, then I recommend using named character escapes as
> documented in groff_char(7).

Actually, groff is _almost_ capable of supporting UTF-8.  It understands it
internally, and has problems just on input and output.  For input, a minimal
patch can be as simple as:

--- src/libs/libgroff/encoding.cc       (revision 6)
+++ src/libs/libgroff/encoding.cc       (revision 8)
@@ -369,6 +369,9 @@
   // groff 1 defines ISO-8859-1 as the input encoding, so this is required
   // for compatibility. groff 2 will define UTF-8 (or possibly officially
   // allow it to be switchable?)
+  select_input_encoding_handler("UTF-8");
+  select_output_encoding_handler("UTF-8");
+  return;
   select_input_encoding_handler("ISO-8859-1");
   select_output_encoding_handler("C");
  (no longer relevant special cases for CJK follow)

and then instead of:
source -[?]-manconv-[ISO-8859-1]-> groff -[ISO-8859-1]-iconv-[$LOCALE]-> less
 man-db could do:
source -[?]-manconv-[UTF-8]-> groff -[UTF-8]-iconv-[$LOCALE]-> less


Too bad, output is harder.  By adjusting char widths
(http://angband.pl/deb/man/groff-devutf8.diff) I've got terminal output
working neatly for everything but arabic/hebrew (not a regression), but I
have neither the time nor knowledge to fix PostScript and such.

Yet, since the current groff supports only ISO-8859-? and CJK, I guess at
least a no-regression change could be easy to do.

-- 
1KB		// Microsoft corollary to Hanlon's razor:
		//	Never attribute to stupidity what can be
		//	adequately explained by malice.

Reply to:

Follow-Ups:
- Re: UTF-8 manual pages
  - From: Clytie Siddall <clytie@riverland.net.au>

References:
- UTF-8 manual pages
  - From: Colin Watson <cjwatson@debian.org>

Prev by Date: Listas P.U
Next by Date: Re: Intent to NMU fontconfig to fix pending po-debconf l10n bugs
Previous by thread: Re: UTF-8 manual pages
Next by thread: Re: UTF-8 manual pages
Index(es):
- Date
- Thread