Re: groff: radical re-implementation
At Mon, 16 Oct 2000 16:41:35 +0200 (CEST),
Werner LEMBERG <firstname.lastname@example.org> wrote:
> [I'm CC'ing this mail to the groff@ mailing list. May I ask to move
> the discussion about improvments/changings of groff to this list?]
Ok, I joined email@example.com mailing list, though I send this message
also for debian-i18n list to inform that I agreed to move.
>> The ideal implementation will be using 'wchar_t' for reading.
> But this will fail for some compilers...
Now wchar_t is supported by many systems. It is a mandatory for
The merit of wchar_t is that: write once and work for every
encodings, uncluding UTF-8. Otherwise, you have to write
similar source codes many times for Latin-1, EBCDIC, UTF-8,
and so on so on. Especially, I will insist that Groff should
support EUC-* multibyte encodings for CJK languages. This is
what the current Groff cannot handle entirely. (CJK people
also uses ISO-2022-* encodings.)
The other merit of wchar_t is user-friendliness. Once a user
set LANG variable, every softwares work under the specified
encoding. If not, you have to specify encodings for every software.
We don't want to have ~/.groffrc, ~/.greprc, ~/.bashrc, ~/.xtermrc,
and so on so on to specify 'encoding=ISO8859-1' or 'encoding=UTF-8'.
>> Abolish device types of 'ascii', 'ascii8', 'latin1', 'nippon', and
>> 'utf8' and introduce a new device type such as 'tty'.
I suppose you don't know about 'ascii8' device. This is a local
patch for Debian's Groff that is 8-bit clean (like latin1) but
doesn't assume that 8-bit part is latin1 encoding. For example,
'-' is used for hyphenation and '\(co' is converted into '(C)'.
This is for 8-bit encodings other than latin1, i.e., ISO8859-2,3,..,
and KOI8-R. (Not for CJK multibyte languages).
> Please bear in mind that groff shall work on non-GNU systems also! My
> idea is to only accept UTF8, ascii, latin1, and ebcdic as input
> encodings (the latter three for historical reasons only).
I wrote about Glibc because the message is to Debian mailing list.
Of course I think of portability. wchar_t is portable. I recommend
to implement wchar_t as a new architecture and ascii, latin-1, and
ebcdic as historical encodings. (We may add 'UTF8' as a historical
I think what is 'historical' is systems which don't support wchar_t.
> Maybe on systems with a recent glibc, iconv() and friends can be used
> to do more, but generally I prefer an iconv-preprocessor so that groff
> itself has not to deal with encoding conversions.
I think this works well. However, who invokes iconv-preprocessor?
A user or wrapper-software? What determines the command option for
>> - Groff assumes the input as the encoding of current locale.
> This is probably not correctly set everywhere.
How a system can be configured by a user, in ways other than locale?
A user who want to specify his/her language and encoding will set
LANG variable. Or, having many ~/.foobarrc for every softwares or
specifying --encoding=foobar everytime (s)he invokes a software?
I think setting LANG is a reasonable way.
One compromise is that:
- to use UCS-4 for internal processing, not wchar_t.
- a small part of input and output to be encoding-sensible.
- command options for encodings of input and output to be added.
- a compile-time option I18N to be introduced.
- when I18N is off, default input is latin-1 and default output
is also latin-1.
- when I18N is on, default input and default output are sensible
to LC_CTYPE locale.
- Of course these default encodings can be overrided by command
- Groff can be compiled with I18N off for systems without
internationalization functions such as setlocale().
- iconv(3) to be used for converting between input/output encodings
and internal UCS-4 encoding, if available (I18N=true).
- if I18N is false, conversion process to be hard-coded for
Latin-1, EBCDIC, and UTF-8.
Do you think this can be achieved?
Tomohiro KUBOTA <firstname.lastname@example.org>