groff: radical re-implementation
This thread is about the 'radical re-implementation' of groff.
We first thought that we have to fix Japanese patch so that
it checks EUC encoding only in Japanese mode (-Tnippon). However,
we immediately found that the original design of groff is
very confusing. I think the developer of Japanese patch could
not avoid inheriting the confusingness.
Why 'ascii' and 'latin1' are treated as 'device type'? The device
type should be 'tty' or so. Because of this confusing design,
we have no way to treat, for example, Japanese X11 output or
Korean PostScript. You know, THIS IS NOT DUE TO LACK OF
IMPLEMENTATION BUT DUE TO CONFUSED DESIGN. Should we type
'groff -Tlatin1 -Tx75' for X11 output with latin1 encoding?
The ideal implementation will be using 'wchar_t' for reading.
Abolish device types of 'ascii', 'ascii8', 'latin1', 'nippon',
and 'utf8' and introduce a new device type such as 'tty'.
Ukai has surveyed roughly the source code of groff and posted
a brief but long list of needed works (in email@example.com
mailing list in Japanese).
Fortunately, fgetwc(), putwchar(), wprintf(), swprintf(), and so on
are available in new Glibc 2.2. mbstowcs() and so on are also
available since older Glibc. These functions are locale-sensible
and can handle any encodings. Note that they can also treat UTF-8
under UTF-8 locale, though the current Debian locales package does
not include any UTF-8 locales. We should not give UTF-8 special
treatment. Discussion is in progress about this new design of
groff at firstname.lastname@example.org mailing list (in Japanese)
and personal communication.
--- * --- * ---
However, I found a big problem. I can set my LANG as ja_JP.eucJP,
ja_JP.SJIS, or ja_JP.UTF-8. An American person can set LANG as
en_US.ISO8859-1 and en_US.UTF-8. Note that only changing LANG
variable must change the behavior of the OS. However, the source
of manpages are written in a certain encoding for each language.
For example, when a user sets LANG as en_US.ISO8859-1 and the
manpages are written in UTF-8, Groff will read manpages as
ISO8859-1 and the manpages will not displayed properly.
The solution may be like this:
- Groff to have a new command option, like '--input-encoding'
to specify the encoding of the input file.
- Man-db to specify the encoding when invoking Groff.
- To have a Policy on encoding of manpages for each language.
That will be one of them:
(a) All manpages to be written in UTF-8.
(b) To decide encoding for each language. This may be
same as the present state.
This Policy will be implemented to Man-db package.
I can imagine another solution:
- Groff assumes the input as the encoding of current locale.
- Man-db to invoke Groff through iconv(1).
Problem of the latter idea is that the current version of locales
package does not have any UTF-8 locale. How UTF-8 --> wchar_t
conversion can be achieved without UTF-8 locale? WE MUST NOT
ASSUME THAT INTERNAL EXPRESSION OF WCHAR_T IS UCS-4, though this
is true for Glibc.
Tomohiro KUBOTA <email@example.com>