Re: [Groff] Re: groff: radical re-implementation
At Tue, 17 Oct 2000 22:19:09 +0900,
Tomohiro KUBOTA <email@example.com> wrote:
> Though you may already know, please note that
> - Japanese and Chinese text contains few whitespace characters.
> (Japanese and Chinese words are not separated by whitespace).
> Therefore, different line-breaking algorithm should be used.
> (Hyphen character is not used when a word is broken into lines.)
> (Modern Korean language contains whitespace characters between
> words --- though not words, strictly speaking.)
As regards line breaking algorithm, I think we need some more cflags,
at least for Japanese. That is,
- lines must not be broken before the character
- lines must not be broken after the character
These seems to be implemented as PRE_KINSOKU and POST_KINSOKU in jgroff,
but it's done by hardcoded. I think this should be done by tmac.<lang>,
so I think it's good idea to have some mechanisms to load language
specific tmac files.
BTW, what do you think about code name for multibyte character/wide
character or glyph code what you said? In jgroff, it seems it used
> - Hyphenation algorithm differs from language to language.
This is already implemented by .hla <language>, isn't it?
> - Almost CJK characters (ideographics, hiragana, katakana, hangul,
> and so on) have double width on tty. Since you won't use wchar_t,
> you cannot use wcwidth() to get the width for characters.
> The source code for Xterm multiwidth/combining character extention
> patch may help you, though it is based on unicode.
I think we could use font description information for it.
jgroff provides "fixedkanji" directive in font description. But,
the code of font description loader depends on EUC<->KuTen mapping,
and it's not good idea for i18n. I think it would be better to provide
"wcharset" directive which support code range. However, code range
couldn't be used with EUC encoding or something like that, and not used
for Unicode, because we couldn't expect character codes for some language
are in succession.
Anyway, jgroff provides new font "M" and "G", which are "Mincho" and
"Gothic" respectively, for wide characters. What is the right way
to add i18n support in groff about font description?
> - Latin-1 people may use 0xa9 for '\(co'. However, this character
> cannot be read in other encodings. The current Groff convert
> '\(co' to 0xa9 in latin1 device and to '(C)' in ascii device.
> How it works for future Groff? Use u+00a9? The postprocessor
> (see below) cannot convert u+00a9 to '(C)' because the width is
> different and typesetting is broken. It is very difficult to
> design to avoid this problem...
Is it impossible to be implemented by character translation mechanism
in groff? I think it's the right way to select which translation is
used by output device, that is, -T<device> option. It might be
check $TERM environment for tty devices, because terminal emulator
for Japanese wouldn't show u+00a9 correctly, so it requires some translation.
> > . Another approach to influence text processing (not input encoding)
> > is the OpenType way: --script, --language, --feature.
> Though I don't know them, I hope these can handle Asian languages.
> > . Command lines shall be able to override input encoding
> > (--input-encoding).
How about creating new request (.encoding "encoding-name") in roff
> > > One compromise is that:
> > > - to use UCS-4 for internal processing, not wchar_t.
> > The internal processing within groff should use glyph IDs resp. glyph
> > names.
Could you explain (or point us which source code in groff) how to map
glyph IDs to output code, please?
For tty output device, UTF-8 and postprocessor may be possible, but
how about X, ps, or dvi?