Re: groff: radical re-implementation

To: debian-i18n@lists.debian.org, cjh@kr.FreeBSD.org, groff@ffii.org
Subject: Re: groff: radical re-implementation
From: Tomohiro KUBOTA <tkubota@riken.go.jp>
Date: Tue, 17 Oct 2000 10:27:12 +0900
Message-id: <[🔎] 14827.43760.126423.72159A@surfchem0.riken.go.jp>
In-reply-to: In your message of "Mon, 16 Oct 2000 16:41:35 +0200 (CEST)" <[🔎] 20001016.164135.88348722.wl@gnu.org>
References: <[🔎] 14826.25637.472475.26482Z@surfchem0.riken.go.jp> <[🔎] 14826.26984.796923.29321F@surfchem0.riken.go.jp> <[🔎] 20001016.164135.88348722.wl@gnu.org>

Hi,

At Mon, 16 Oct 2000 16:41:35 +0200 (CEST),
Werner LEMBERG <wl@gnu.org> wrote:

> [I'm CC'ing this mail to the groff@ mailing list.  May I ask to move
> the discussion about improvments/changings of groff to this list?]

Ok, I joined groff@ffii.org mailing list, though I send this message
also for debian-i18n list to inform that I agreed to move.

>> The ideal implementation will be using 'wchar_t' for reading.
> But this will fail for some compilers...

Now wchar_t is supported by many systems.  It is a mandatory for
internationalization.

The merit of wchar_t is that:  write once and work for every
encodings, uncluding UTF-8.  Otherwise, you have to write
similar source codes many times for Latin-1, EBCDIC, UTF-8,
and so on so on.  Especially, I will insist that Groff should
support EUC-* multibyte encodings for CJK languages.  This is
what the current Groff cannot handle entirely.  (CJK people
also uses ISO-2022-* encodings.)

The other merit of wchar_t is user-friendliness.  Once a user
set LANG variable, every softwares work under the specified 
encoding.  If not, you have to specify encodings for every software.
We don't want to have ~/.groffrc, ~/.greprc, ~/.bashrc, ~/.xtermrc,
and so on so on to specify 'encoding=ISO8859-1' or 'encoding=UTF-8'.

>> Abolish device types of 'ascii', 'ascii8', 'latin1', 'nippon', and
>> 'utf8' and introduce a new device type such as 'tty'.

I suppose you don't know about 'ascii8' device.  This is a local
patch for Debian's Groff that is 8-bit clean (like latin1) but
doesn't assume that 8-bit part is latin1 encoding.  For example,
'-' is used for hyphenation and '\(co' is converted into '(C)'.
This is for 8-bit encodings other than latin1, i.e., ISO8859-2,3,..,
and KOI8-R.  (Not for CJK multibyte languages).

> Please bear in mind that groff shall work on non-GNU systems also!  My
> idea is to only accept UTF8, ascii, latin1, and ebcdic as input
> encodings (the latter three for historical reasons only).

I wrote about Glibc because the message is to Debian mailing list.
Of course I think of portability.  wchar_t is portable.  I recommend
to implement wchar_t as a new architecture and ascii, latin-1, and
ebcdic as historical encodings.  (We may add 'UTF8' as a historical
one.)

I think what is 'historical' is systems which don't support wchar_t.

> Maybe on systems with a recent glibc, iconv() and friends can be used
> to do more, but generally I prefer an iconv-preprocessor so that groff
> itself has not to deal with encoding conversions.

I think this works well.  However, who invokes iconv-preprocessor?
A user or wrapper-software?  What determines the command option for
iconv?

>> - Groff assumes the input as the encoding of current locale.
> This is probably not correctly set everywhere.

How a system can be configured by a user, in ways other than locale?
A user who want to specify his/her language and encoding will set 
LANG variable.  Or, having many ~/.foobarrc for every softwares or 
specifying --encoding=foobar everytime (s)he invokes a software?  
I think setting LANG is a reasonable way.

One compromise is that:
 - to use UCS-4 for internal processing, not wchar_t.
 - a small part of input and output to be encoding-sensible.
 - command options for encodings of input and output to be added.
 - a compile-time option I18N to be introduced.
 - when I18N is off, default input is latin-1 and default output
   is also latin-1.
 - when I18N is on, default input and default output are sensible
   to LC_CTYPE locale.
 - Of course these default encodings can be overrided by command
   options.
 - Groff can be compiled with I18N off for systems without 
   internationalization functions such as setlocale().
 - iconv(3) to be used for converting between input/output encodings
   and internal UCS-4 encoding, if available (I18N=true).
 - if I18N is false, conversion process to be hard-coded for
   Latin-1, EBCDIC, and UTF-8.

Do you think this can be achieved?

---
Tomohiro KUBOTA <kubota@debian.org>
http://surfchem0.riken.go.jp/~kubota/

Reply to:

Follow-Ups:
- Re: [Groff] Re: groff: radical re-implementation
  - From: Tomohiro KUBOTA <tkubota@riken.go.jp>
- Re: [Groff] Re: groff: radical re-implementation
  - From: Werner LEMBERG <wl@gnu.org>

References:
- groff problems
  - From: Tomohiro KUBOTA <tkubota@riken.go.jp>
- groff: radical re-implementation
  - From: Tomohiro KUBOTA <tkubota@riken.go.jp>
- Re: groff: radical re-implementation
  - From: Werner LEMBERG <wl@gnu.org>

Prev by Date: Re: groff: radical re-implementation
Next by Date: Re: [Groff] Re: groff: radical re-implementation
Previous by thread: Re: groff: radical re-implementation
Next by thread: Re: [Groff] Re: groff: radical re-implementation
Index(es):
- Date
- Thread