Re: UTF-8 manual pages
On Thu, Oct 11, 2007 at 12:22:18PM +0100, Colin Watson wrote:
> I have uploaded man-db 2.5.0-1, which includes the following changes of
> note:
[...]
>
> groff does not yet support UTF-8 input, so at the moment this is
> implemented by recoding in man. For the time being, the implementation
> requires that the page be convertible to the legacy encoding for the
> language using iconv (it uses //TRANSLIT so that it will make an attempt
> at characters that aren't directly convertible, but that isn't perfect);
> so a German manual page should avoid using UTF-8 characters without an
> equivalent in ISO-8859-1. I do not expect this to be particularly
> onerous for the time being, though there are a few cases (particularly
> proper names) where it may be a problem. I ask for your patience in
> those cases. If you need to use a character not in the corresponding
> legacy encoding, then I recommend using named character escapes as
> documented in groff_char(7).
Actually, groff is _almost_ capable of supporting UTF-8. It understands it
internally, and has problems just on input and output. For input, a minimal
patch can be as simple as:
--- src/libs/libgroff/encoding.cc (revision 6)
+++ src/libs/libgroff/encoding.cc (revision 8)
@@ -369,6 +369,9 @@
// groff 1 defines ISO-8859-1 as the input encoding, so this is required
// for compatibility. groff 2 will define UTF-8 (or possibly officially
// allow it to be switchable?)
+ select_input_encoding_handler("UTF-8");
+ select_output_encoding_handler("UTF-8");
+ return;
select_input_encoding_handler("ISO-8859-1");
select_output_encoding_handler("C");
(no longer relevant special cases for CJK follow)
and then instead of:
source -[?]-manconv-[ISO-8859-1]-> groff -[ISO-8859-1]-iconv-[$LOCALE]-> less
man-db could do:
source -[?]-manconv-[UTF-8]-> groff -[UTF-8]-iconv-[$LOCALE]-> less
Too bad, output is harder. By adjusting char widths
(http://angband.pl/deb/man/groff-devutf8.diff) I've got terminal output
working neatly for everything but arabic/hebrew (not a regression), but I
have neither the time nor knowledge to fix PostScript and such.
Yet, since the current groff supports only ISO-8859-? and CJK, I guess at
least a no-regression change could be easy to do.
--
1KB // Microsoft corollary to Hanlon's razor:
// Never attribute to stupidity what can be
// adequately explained by malice.
Reply to: