Re: groff: radical re-implementation
[I'm CC'ing this mail to the groff@ mailing list. May I ask to move
the discussion about improvments/changings of groff to this list?]
From: Tomohiro KUBOTA <email@example.com>
Subject: groff: radical re-implementation
Date: Mon, 16 Oct 2000 11:35:20 +0900
> This thread is about the 'radical re-implementation' of groff.
> We first thought that we have to fix Japanese patch so that it
> checks EUC encoding only in Japanese mode (-Tnippon). However, we
> immediately found that the original design of groff is very
> confusing. I think the developer of Japanese patch could not avoid
> inheriting the confusingness.
> Why 'ascii' and 'latin1' are treated as 'device type'? The device
> type should be 'tty' or so. Because of this confusing design, we
> have no way to treat, for example, Japanese X11 output or Korean
> PostScript. You know, THIS IS NOT DUE TO LACK OF IMPLEMENTATION BUT
> DUE TO CONFUSED DESIGN. Should we type 'groff -Tlatin1 -Tx75' for
> X11 output with latin1 encoding? Entirely No!
As you may know, this confusion has historical origins. I'm not
willing to add new `devices' like `latin-2' or even `nippon' due to
I plan to separate input encodings, output encodings, and character
sets from devices. Then, we will have real devices like tty, ps, or
dvi. Input characters will be converted to glyph names by troff, and
these glyph names will be mapped to output encodings (for ttys)
resp. fonts (for everything else) according to the device and font
> The ideal implementation will be using 'wchar_t' for reading.
But this will fail for some compilers...
> Abolish device types of 'ascii', 'ascii8', 'latin1', 'nippon', and
> 'utf8' and introduce a new device type such as 'tty'.
This will come. Of course, patches are welcome :-) If you or your
partners have serious interest in improving groff I can give you write
access to the CVS.
> Ukai has surveyed roughly the source code of groff and posted
> a brief but long list of needed works (in firstname.lastname@example.org
> mailing list in Japanese).
> Fortunately, fgetwc(), putwchar(), wprintf(), swprintf(), and so on
> are available in new Glibc 2.2. mbstowcs() and so on are also
> available since older Glibc. These functions are locale-sensible
> and can handle any encodings. Note that they can also treat UTF-8
> under UTF-8 locale, though the current Debian locales package does
> not include any UTF-8 locales. We should not give UTF-8 special
> treatment. Discussion is in progress about this new design of groff
> at email@example.com mailing list (in Japanese) and personal
Please bear in mind that groff shall work on non-GNU systems also! My
idea is to only accept UTF8, ascii, latin1, and ebcdic as input
encodings (the latter three for historical reasons only).
Maybe on systems with a recent glibc, iconv() and friends can be used
to do more, but generally I prefer an iconv-preprocessor so that groff
itself has not to deal with encoding conversions.
> --- * --- * ---
> However, I found a big problem. I can set my LANG as ja_JP.eucJP,
> ja_JP.SJIS, or ja_JP.UTF-8. An American person can set LANG as
> en_US.ISO8859-1 and en_US.UTF-8. Note that only changing LANG
> variable must change the behavior of the OS. However, the source
> of manpages are written in a certain encoding for each language.
> For example, when a user sets LANG as en_US.ISO8859-1 and the
> manpages are written in UTF-8, Groff will read manpages as
> ISO8859-1 and the manpages will not displayed properly.
> The solution may be like this:
> - Groff to have a new command option, like '--input-encoding'
> to specify the encoding of the input file.
Yes, an iconv preprocessor for GNU systems to convert input files to
> - Man-db to specify the encoding when invoking Groff.
> - To have a Policy on encoding of manpages for each language.
> That will be one of them:
> (a) All manpages to be written in UTF-8.
> (b) To decide encoding for each language. This may be
> same as the present state.
> This Policy will be implemented to Man-db package.
> I can imagine another solution:
> - Groff assumes the input as the encoding of current locale.
This is probably not correctly set everywhere.
> - Man-db to invoke Groff through iconv(1).
> Problem of the latter idea is that the current version of locales
> package does not have any UTF-8 locale. How UTF-8 --> wchar_t
> conversion can be achieved without UTF-8 locale? WE MUST NOT
> ASSUME THAT INTERNAL EXPRESSION OF WCHAR_T IS UCS-4, though this
> is true for Glibc.