Re: [Groff] Re: groff: radical re-implementation

To: tkubota@riken.go.jp
Cc: debian-i18n@lists.debian.org, cjh@kr.FreeBSD.org, groff@ffii.org
Subject: Re: [Groff] Re: groff: radical re-implementation
From: Werner LEMBERG <wl@gnu.org>
Date: Tue, 17 Oct 2000 09:20:37 +0200 (CEST)
Message-id: <[🔎] 20001017.092037.48403856.wl@gnu.org>
In-reply-to: <[🔎] 14827.43760.126423.72159A@surfchem0.riken.go.jp>
References: <[🔎] 14826.26984.796923.29321F@surfchem0.riken.go.jp> <[🔎] 20001016.164135.88348722.wl@gnu.org> <[🔎] 14827.43760.126423.72159A@surfchem0.riken.go.jp>

> The merit of wchar_t is that: write once and work for every
> encodings, uncluding UTF-8.  Otherwise, you have to write similar
> source codes many times for Latin-1, EBCDIC, UTF-8, and so on so on.
> Especially, I will insist that Groff should support EUC-* multibyte
> encodings for CJK languages.  This is what the current Groff cannot
> handle entirely.  (CJK people also uses ISO-2022-* encodings.)

Well, I insist that GNU troff doesn't support multibyte enodings at
all :-) troff itself should work on a glyph basis only.  It has to
work with *glyph names*, be it CJK entities or whatever.  Currently,
the conversion from input encoding to glyph entities and the further
processing of glyphs is not clearly separated.  From a modular point
of view it makes sense if troff itself is restricted to a single input
encoding (UTF-8) which is basically only meant as a wrapper to glyph
names (cf. \U'xxxx' to enter Unicode encoded characters).  Everything
else should be moved to a preprocessor.

> The other merit of wchar_t is user-friendliness.  Once a user set
> LANG variable, every softwares work under the specified encoding.
> If not, you have to specify encodings for every software.  We don't
> want to have ~/.groffrc, ~/.greprc, ~/.bashrc, ~/.xtermrc, and so on
> so on to specify 'encoding=ISO8859-1' or 'encoding=UTF-8'.

I must admit that I've never worked with wchar_t and friends.  I see
the benefits, though.  And with glibc, we have a nice ready-to-use
library which enormously simplifies the implementation.

> >> Abolish device types of 'ascii', 'ascii8', 'latin1', 'nippon', and
> >> 'utf8' and introduce a new device type such as 'tty'.
> 
> I suppose you don't know about 'ascii8' device.

I know it :-)  I've studied the Japanese patch in detail.
 
> Of course I think of portability.  wchar_t is portable.  I recommend
> to implement wchar_t as a new architecture and ascii, latin-1, and
> ebcdic as historical encodings.  (We may add 'UTF8' as a historical
> one.)

Yes.

> > Maybe on systems with a recent glibc, iconv() and friends can be
> > used to do more, but generally I prefer an iconv-preprocessor so
> > that groff itself has not to deal with encoding conversions.
> 
> I think this works well.  However, who invokes iconv-preprocessor?

groff (the program).

> A user or wrapper-software?

groff *is* the wrapper of troff.

> What determines the command option for iconv?

I suggest various possibilities.

  . By default, locales should be used.  A --locale option might be a
    good idea also since this has implications on text processing;
    example: Portuguese doesn't use fi and fl ligatures.  The problem
    is that a locale influences both input encoding and text
    processing which might not always be the right choice.

  . Another approach to influence text processing (not input encoding)
    is the OpenType way: --script, --language, --feature.

  . Command lines shall be able to override input encoding
    (--input-encoding).

  . Finally, we need to divide the -T option into a --device and
    --output-encoding.

> > > - Groff assumes the input as the encoding of current locale.
> > This is probably not correctly set everywhere.
> 
> How a system can be configured by a user, in ways other than locale?
> A user who want to specify his/her language and encoding will set
> LANG variable.  Or, having many ~/.foobarrc for every softwares or
> specifying --encoding=foobar everytime (s)he invokes a software?  I
> think setting LANG is a reasonable way.

Believe me, most professional UNIX users in Germany don't have LANG
set correctly (including me).  For example, I don't like to see German
error messages since I'm used to the English ones.  In fact, I never
got used to the German way of handling computers.  The German keyboard
is awkward for programmers...

Note that I consider groff not only as a tool to format man pages but
as a text processing tool also which must be able to work completely
independent from the locale.

> One compromise is that:
>  - to use UCS-4 for internal processing, not wchar_t.

The internal processing within groff should use glyph IDs resp. glyph
names.

>  - a small part of input and output to be encoding-sensible.

???  Please explain.

>  - command options for encodings of input and output to be added.

Yes.

>  - a compile-time option I18N to be introduced.

Yes.  The `iconv' preprocessor would then do some trivial, hard-coded
conversion.

>  - when I18N is off, default input is latin-1 and default output
>    is also latin-1.

Sounds reasonable.

>  - when I18N is on, default input and default output are sensible
>    to LC_CTYPE locale.

Default output?  Really?  Please explain.

>  - iconv(3) to be used for converting between input/output encodings
>    and internal UCS-4 encoding, if available (I18N=true).
>  - if I18N is false, conversion process to be hard-coded for
>    Latin-1, EBCDIC, and UTF-8.

Yes.

> Do you think this can be achieved?

Yes it can, but not tomorrow...

My roadmap is roughly as follows.  It won't introduce radical changes
immediately; probably this results in even more work.  But given my
time constraints I believe this route is better.

Volunteers are highly welcome.  And please comment whether you like my
ideas at all.

I'll start with that after I've finished my reimplementation of the
mdoc package.  Don't expect this happen too quickly :-(

  . Separate input encodings from font encodings.  Remove all
    references to charXXX from the font definition files.  Provide
    input encoding files similar to LaTeX.

  . Make troff work internally with 32bit glyph entities.  I believe
    this is the hardest job of all subtasks due to the complexity of
    changes.

  . Write the input encoding preprocessor.  If you like, you can
    immediately start with it since it is completely independent from
    troff itself.

  . Update all groff utilities.

My ideas follow Omega, one of TeX's successors.  I don't intend to
make groff that powerful (for example, I don't have plans to support
various writing directions), but the conceptual separation will be
similar: The iconv preprocessor will be equal to Omega's input OTPs.
Options like --language or --feature will influence troff's internal
engine.  If necessary (for tty devices only, I think), iconv-like
routines are used to map glyph names back to characters so that groff
can act as a filter.


    Werner

Reply to:

Follow-Ups:
- Re: [Groff] Re: groff: radical re-implementation
  - From: Edmund GRIMLEY EVANS <edmundo@rano.org>
- Re: [Groff] Re: groff: radical re-implementation
  - From: Tomohiro KUBOTA <tkubota@riken.go.jp>
- Re: [Groff] Re: groff: radical re-implementation
  - From: (Ted Harding) <Ted.Harding@nessie.mcc.ac.uk>

References:
- groff: radical re-implementation
  - From: Tomohiro KUBOTA <tkubota@riken.go.jp>
- Re: groff: radical re-implementation
  - From: Werner LEMBERG <wl@gnu.org>
- Re: groff: radical re-implementation
  - From: Tomohiro KUBOTA <tkubota@riken.go.jp>

Prev by Date: Re: [Groff] Re: groff: radical re-implementation
Next by Date: Re: [Groff] Re: groff: radical re-implementation
Previous by thread: Re: [Groff] Re: groff: radical re-implementation
Next by thread: Re: [Groff] Re: groff: radical re-implementation
Index(es):
- Date
- Thread