Re: [Groff] Re: groff: radical re-implementation
On 17-Oct-00 Werner LEMBERG wrote:
> Well, I insist that GNU troff doesn't support multi-byte encodings at
> all :-) troff itself should work on a glyph basis only. It has to
> work with *glyph names*, be it CJK entities or whatever. Currently,
> the conversion from input encoding to glyph entities and the further
> processing of glyphs is not clearly separated. From a modular point
> of view it makes sense if troff itself is restricted to a single input
> encoding (UTF-8) which is basically only meant as a wrapper to glyph
> names (cf. \U'xxxx' to enter Unicode encoded characters). Everything
> else should be moved to a preprocessor.
I have now managed to read this correspondence (16-20 Oct) and think
about it for a bit. Sorry to have been obliged to leave it on one
side until this week ended.
On the whole I go with the view that Werner has expressed, here and
in other mails. Also, I think that some people are not clear about
the distinction between "character" and "glyph" (not sure that I am,
in all cases, come to that ... ).
I would like to present an even more conservative view than Werner
has stated. And, by the way, in the following "troff" means the
main formatting program "gtroff". "groff" denotes the whole package.
A.1. At present troff accepts 8-bit input, i.e. recognises 256 distinct
entities in the input stream (with a small number of exceptions which
It does not really matter that these are interpreted, by default, as
iso-latin-1. They could correspond to anything on your screen when you
are typing, and you can set up translation macros in troff to make them
correspond to anything else (using either the ".char" request or the
traditional ".tr" or the new ".trnt" requests).
A.2. The direct correspondence between input bytes and characters is
defined in the font files for the device. In addition, groups of bytes
(such as, represented in ASCII, "\[Do]") can be made to correspond to
specific characters named in the font files.
A.3. What gets printed or displayed is a "glyph" which is defined by the
current font definition for the device. (Even in English, a character
such as "A" could be printed as a Times-Roman "A" glyph, a Helvetica
BoldItalic "A" glyph, \ ZapfChancery-MediumItalic glyph, ... ).
Troff uses the glyph-metric information in the font file to compute
A.4. Troff is not, and was never intended to be, WYSIWIG. Its concept
is that you prepare an input stream (using whatever interface pleases
you, and if this shows you say kanji characters then that's fine,
so long as you don't expect troff to "see" them as kanji) which,
when interpreted by troff, produces printed/displayed output which
bear the marks that you want. I don't see anything wrong (except
possibly in ease of use) in creating an ASCII input stream in
order to generate Japanese output. Preparation of an output
stream to drive a device capable of rendering the output is
the job of the post-processor (and, provided you have installed
appropriate font definition files, I cannot think of anything
that would be beyond the PostScript device "devps").
A: It follows that troff is already language-independent, for all
languages whose typographic conventions can be achieved by the primitive
mechanisms already present in troff. For such languages, there is no
need to change troff at all. For some other languages, there are
minor extra requirements which would require small extensions to
troff which would not interact with existing mechanisms.
Major exceptions to language-independence, at present, include all
the "left-to-right" languages (Hebrew, Arabic, ... ). I have been
studying Dan Berry's implementation of "ffortid" ["ditroff" backwards]
which is a post-processor that allows right-to-left text to be
correctly printed. I believe that a port to groff is quite feasible.
Dan Berry has also done "triroff" [tri-directional troff] for traditional
UNIX troff which can in addition do the top-to-bottom printing for
Chinese etc. To my untutored eye, the results look OK. This could also be
ported to groff.
Extra complications can arise in some languages, such as special
hyphenation rules (as has been mentioned); presence or absence of
particular ligatures [and I think that troff's hard-wired set
of ligatures should be replaced by a user-definable set] (e.g. in
Turkish you never use "fi" ligature since this suppresses the
distinction between dotless-i and i-with-dot); some characters may
not end, or may not begin, a line; some characters have different glyphs
at the beginning, the middle, or the end of words; and so on.
The above are cases where minor extensions of troff are required,
but they do not interact with other features of troff and require
no radical re-implementation.
Some of the complications with specific languages (such as the extra
space separating punctuation marks in French) can be set up on
a language-specific basis by suitable macros, and require no change at
all in troff itself.
B: Troff should be able to cope with multi-lingual documents, where
several different languages occur in the same document. I do NOT
believe that the right way to do this is to extend troff's capacity
to recognise thousands of different input encodings covering all the
languages which it might be called upon to typeset (e.g. by Unicode or
Troff's multi-character naming convention means that anything you could
possibly need can be defined, and given a name in the troff input
"character set" whenever you really need it, so long as you have the
device resources to render the appropriate glyph. If you want to use a
multi-byte encoding in your input-preparation software, you can
pre-process this with a suitable filter to generate the troff
input-sequences you need (I have done this with WordPerfect
multinational characters, for instance, which are two-byte entities).
C: Error messages and similar communications with the user (which
have nothing directly to do with troff's real job) are irrelevant to
the question of revising groff. If people would like these to appear in
their own language then I'm sure it can be arranged in a way which
would require no change whatever in the fundamental workings of troff.
CONCLUSION: Troff certainly needs some extensions to cope with the
typesetting demands of some languages (of which the major ones that
I can think of have been mentioned above). I also believe that there
are some features of troff which need to be changed in any case, but
these has nothing to do with language or "locale".
Apart from this, I believe that troff has all the primitive functionality
needed to cope with different languages and that any user can define their
own resources for specific languages (including multi-lingual documents).
There is certainly a strong argument for people who are expert both
in troff and in specific languages to prepare _definitive_ language-
specific resources, rather than have different users all doing different
and more-or-less adequate jobs on their own; but that is another issue
and still does not involve any radical re-design of groff.
Therefore, I suggest, troff can basically be left alone; it does not
need radical re-implementation.
Best wishes to all,
E-Mail: (Ted Harding) <Ted.Harding@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 284 7749
Date: 20-Oct-00 Time: 20:32:16
------------------------------ XFMail ------------------------------