[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [Groff] Re: groff: radical re-implementation


At Wed, 18 Oct 2000 00:46:46 +0200 (CEST),
Werner LEMBERG <wl@gnu.org> wrote:

>>  - GNU troff will support UTF-8 only.  Thus, multibyte encodings
>>    will be not supported.  [Though UTF-8 is multibyte :-p ]
> This was a typo, sorry.  I've meant that I don't want to support
> multiple multibyte encodings.
>>  - Groff handles glyph, not character.
>> I don't understand relationship between these two.  UTF-8 is a code
>> for character, not glyph.  ISO8859-1 and EUC-JP are also codes for
>> character.  No difference among UTF-8, ISO8859-1, and EUC-JP.

Ah, I meant that 'these two' is the relationship between that Groff
supports UTF-8 only and that Groff processes glyphs.  Sorry for my
poor English.

However, thank you for explaining glyph.  I also understand you 
understand problems on Japanese character codes well. 

I also understand the basic design of groff, though I had to read
the source code of groff myself...  I wonder why people who knows
better than I don't join this list and discuss about

Note that CJK ideographs also has distinction between character and
glyph.  The most famous example is two variants of a 'tall or high'
character.  Japanese people regard these two as the same in daily
use but Japanese people regard these two as different if they are
used in person's names or so on.  I don't know how Chinese and
Korean people treat them.  It may be different.  However, IMHO,
we should neglect this problem now since there are so far no standard
to treat these variants properly.  Though it is important, it is not
in our scope.

> A `glyph code' is just an arbitrary registration number for a glyph
> specified in the font definition file.

Then the 'font definition file' will be irrationally large.  I think
at least CJK ideographics and Korean precompiled Hanguls have to be
treated in different way.  (Ukai has already pointed this problem.
jgroff uses 'wchar<EUCcode>' for glyph names of Japanese characters.)

> For tty devices, the route is as follows.  Let's assume that the input
> encoding is Latin-1.  Then the input character code `0xa9' will be
> converted to Unicode character `U+00a9' (by the preprocessor). 
> A hard-coded table maps this character code to a glyph with the name
> `co'.  Now troff looks up the metric info in the font definition file.


> If the target device is an ASCII-capable terminal, the width is three
> characters (the glyph `co' is defined with the .char request to be
> equal to `(C)'); if it is a Unicode-capable terminal, the width is one
> character.  After formatting, a hard-coded table maps the glyphs back
> to Unicode.

How troff knows that the tty device is ASCII-capable or Unicode-capable?
 --- Ok, I understand it by reading the next line:

>  -m ascii --device=tty --output-encoding=ascii

'-m ascii' tells that.  '--output-encoding' will be passed through
for postprocessor.

A problem.  When compiled within internationalized OS, the names
for encodings (for iconv(3) and so on) is implementation-dependent
(You know, there are many implementation-dependent items in 
standard C/C++ language).  A solution is: we can have a hard-coded
translation table between implementation-dependent encoding names
and macro names for -m.  The table must be changed by OS (by
'./configure' script or so).  A minimal table will be translate
every implementation-dependent encoding names into 'ascii' macro,
since almost encodings in the world are superset of ASCII.  A full
table for a OS will cover the list generated by 'iconv --list'.

# Though I think some standardization of names for encoding is needed,
# it is not our topic now.

Since the '-m' option is generated by groff and passed to troff,
groff has to have '#ifdef I18N' code.  (or, the code can be
integrated to the preprocessor if we design the preprocessor to invoke

Tomohiro KUBOTA <kubota@debian.org>

Reply to: