[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: UTF-8 manual pages



Sorry I couldn't reply to this earllier.

On 12/10/2007, at 11:27 PM, Adam Borowski wrote:

On Thu, Oct 11, 2007 at 12:22:18PM +0100, Colin Watson wrote:
I have uploaded man-db 2.5.0-1, which includes the following changes of
note:
[...]

groff does not yet support UTF-8 input, so at the moment this is
implemented by recoding in man. For the time being, the implementation
requires that the page be convertible to the legacy encoding for the
language using iconv (it uses //TRANSLIT so that it will make an attempt at characters that aren't directly convertible, but that isn't perfect); so a German manual page should avoid using UTF-8 characters without an
equivalent in ISO-8859-1. I do not expect this to be particularly
onerous for the time being, though there are a few cases (particularly
proper names) where it may be a problem. I ask for your patience in
those cases. If you need to use a character not in the corresponding
legacy encoding, then I recommend using named character escapes as
documented in groff_char(7).

Actually, groff is _almost_ capable of supporting UTF-8. It understands it internally, and has problems just on input and output. For input, a minimal
patch can be as simple as:

--- src/libs/libgroff/encoding.cc       (revision 6)
+++ src/libs/libgroff/encoding.cc       (revision 8)
@@ -369,6 +369,9 @@
// groff 1 defines ISO-8859-1 as the input encoding, so this is required // for compatibility. groff 2 will define UTF-8 (or possibly officially
   // allow it to be switchable?)
+  select_input_encoding_handler("UTF-8");
+  select_output_encoding_handler("UTF-8");
+  return;
   select_input_encoding_handler("ISO-8859-1");
   select_output_encoding_handler("C");
  (no longer relevant special cases for CJK follow)

and then instead of:
source -[?]-manconv-[ISO-8859-1]-> groff -[ISO-8859-1]-iconv- [$LOCALE]-> less
 man-db could do:
source -[?]-manconv-[UTF-8]-> groff -[UTF-8]-iconv-[$LOCALE]-> less


Too bad, output is harder.  By adjusting char widths
(http://angband.pl/deb/man/groff-devutf8.diff) I've got terminal output working neatly for everything but arabic/hebrew (not a regression), but I
have neither the time nor knowledge to fix PostScript and such.

Yet, since the current groff supports only ISO-8859-? and CJK, I guess at
least a no-regression change could be easy to do.

Can we package groff-utf8 [1] instead?

Another note about Colin's original, and very well-thought-out post: I think Yelp _does_ support UTF-8. I'm pretty sure I tested my pilot Vietnamese manpage in it (as well as groff-utf8) a year or two back. There was quite a lot of discussion about UTF-8 display, on the GNOME i18n list then.

Thankyou for all your efforts to get UTF-8 manpages supported and encouraged. Manpages have notoriously lagged behind other translations in this regard. It's time they caught up.

from Clytie

Vietnamese Free Software Translation Team
http://vnoss.net/dokuwiki/doku.php?id=projects:l10n

[1] http://www.haible.de/bruno/packages-groff-utf8.html

Attachment: PGP.sig
Description: This is a digitally signed message part


Reply to: