Re: UTF-8 manual pages

On Thu, Oct 11, 2007 at 12:22:18PM +0100, Colin Watson wrote:
> I have uploaded man-db 2.5.0-1, which includes the following changes of
> note:
> groff does not yet support UTF-8 input, so at the moment this is
> implemented by recoding in man. For the time being, the implementation
> requires that the page be convertible to the legacy encoding for the
> language using iconv (it uses //TRANSLIT so that it will make an attempt
> at characters that aren't directly convertible, but that isn't perfect);
> so a German manual page should avoid using UTF-8 characters without an
> equivalent in ISO-8859-1. I do not expect this to be particularly
> onerous for the time being, though there are a few cases (particularly
> proper names) where it may be a problem. I ask for your patience in
> those cases. If you need to use a character not in the corresponding
> legacy encoding, then I recommend using named character escapes as
> documented in groff_char(7).

Actually, groff is _almost_ capable of supporting UTF-8.  It understands it
internally, and has problems just on input and output.  For input, a minimal
patch can be as simple as:

--- src/libs/libgroff/encoding.cc       (revision 6)
+++ src/libs/libgroff/encoding.cc       (revision 8)
@@ -369,6 +369,9 @@
   // groff 1 defines ISO-8859-1 as the input encoding, so this is required
   // for compatibility. groff 2 will define UTF-8 (or possibly officially
   // allow it to be switchable?)
+  select_input_encoding_handler("UTF-8");
+  select_output_encoding_handler("UTF-8");
+  return;
  (no longer relevant special cases for CJK follow)

and then instead of:
source -[?]-manconv-[ISO-8859-1]-> groff -[ISO-8859-1]-iconv-[$LOCALE]-> less
 man-db could do:
source -[?]-manconv-[UTF-8]-> groff -[UTF-8]-iconv-[$LOCALE]-> less

Too bad, output is harder.  By adjusting char widths
(http://angband.pl/deb/man/groff-devutf8.diff) I've got terminal output
working neatly for everything but arabic/hebrew (not a regression), but I
have neither the time nor knowledge to fix PostScript and such.

Yet, since the current groff supports only ISO-8859-? and CJK, I guess at
least a no-regression change could be easy to do.

1KB		// Microsoft corollary to Hanlon's razor:
		//	Never attribute to stupidity what can be
		//	adequately explained by malice.

