[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [Groff] Re: groff: radical re-implementation



Hi,

At Tue, 17 Oct 2000 09:20:37 +0200 (CEST),
Werner LEMBERG <wl@gnu.org> wrote:

> Well, I insist that GNU troff doesn't support multibyte enodings at
> all :-) troff itself should work on a glyph basis only.  It has to
> work with *glyph names*, be it CJK entities or whatever.  Currently,
> the conversion from input encoding to glyph entities and the further
> processing of glyphs is not clearly separated.  From a modular point
> of view it makes sense if troff itself is restricted to a single input
> encoding (UTF-8) which is basically only meant as a wrapper to glyph
> names (cf. \U'xxxx' to enter Unicode encoded characters).  Everything
> else should be moved to a preprocessor.

This paragraph says two things:
 - GNU troff will support UTF-8 only.  Thus, multibyte encodings
   will be not supported.  [Though UTF-8 is multibyte :-p ]
 - Groff handles glyph, not character.
I don't understand relationship between these two.  UTF-8 is a code
for character, not glyph.  ISO8859-1 and EUC-JP are also codes for
character.  No difference among UTF-8, ISO8859-1, and EUC-JP.

However, I won't stick to wchar_t or ucs-4 for internal code,
though I have no idea about your '31bit glyph code'.  (Maybe I
have to study Omega...)



> > The other merit of wchar_t is user-friendliness.  Once a user set
> > LANG variable, every softwares work under the specified encoding.
> > If not, you have to specify encodings for every software.  We don't
> > want to have ~/.groffrc, ~/.greprc, ~/.bashrc, ~/.xtermrc, and so on
> > so on to specify 'encoding=ISO8859-1' or 'encoding=UTF-8'.
> 
> I must admit that I've never worked with wchar_t and friends.  I see
> the benefits, though.  And with glibc, we have a nice ready-to-use
> library which enormously simplifies the implementation.

Ok, the merit can be achieved by your idea of preprocessor.
Please note that glibc is not a only system which supporte locale.
I suppose that almost commercial UNIX systems support locale.
I suppose BSD variants (Free-, Net-, and Open-) also, though I 
don't know them.

You know, Sun Microsystems recently offered an internationalization
technology to open source community.  (http://www.sun.com/solaris/global/)
This means Solaris supports locale.


> > I suppose you don't know about 'ascii8' device.
> 
> I know it :-)  I've studied the Japanese patch in detail.

Wonderful!  However, though it is true that I wrote the
'ascii8' patch, this patch has no relation to the Japanese patch.

My patch consists of two parts:
 - `cp -r devlatin1 devascii8` and added 'char***' for >0x80.
 - hard-coded soft-hyphenation character (which is valid only 
   for Latin-1) is changed (this seems to be fixed by Groff 1.16).



> > > Maybe on systems with a recent glibc, iconv() and friends can be
> > > used to do more, but generally I prefer an iconv-preprocessor so
> > > that groff itself has not to deal with encoding conversions.
> > I think this works well.  However, who invokes iconv-preprocessor?
> groff (the program).
> > A user or wrapper-software?
> groff *is* the wrapper of troff.

I understand.  This works, since it is not likely that a user invokes
troff directly.


> > What determines the command option for iconv?
> 
> I suggest various possibilities.

At first I say that I agree with your idea in almost part.


>   . By default, locales should be used.  A --locale option might be a
>     good idea also since this has implications on text processing;
>     example: Portuguese doesn't use fi and fl ligatures.  The problem
>     is that a locale influences both input encoding and text
>     processing which might not always be the right choice.

The name '--locale' is confusing since it has no relation to
locale, i.e., a term which refer to a certain standard technology.

Though you may already know, please note that
 - Japanese and Chinese text contains few whitespace characters.
   (Japanese and Chinese words are not separated by whitespace).
   Therefore, different line-breaking algorithm should be used.
   (Hyphen character is not used when a word is broken into lines.)
   (Modern Korean language contains whitespace characters between
   words --- though not words, strictly speaking.)
 - Hyphenation algorithm differs from language to language.
 - Almost CJK characters (ideographics, hiragana, katakana, hangul,
   and so on) have double width on tty.  Since you won't use wchar_t,
   you cannot use wcwidth() to get the width for characters.
   The source code for Xterm multiwidth/combining character extention
   patch may help you, though it is based on unicode.
   http://www.ecs.soton.ac.uk/~rwb197/xterm/
 - Latin-1 people may use 0xa9 for '\(co'.  However, this character
   cannot be read in other encodings.  The current Groff convert
   '\(co' to 0xa9 in latin1 device and to '(C)' in ascii device.
   How it works for future Groff?  Use u+00a9?  The postprocessor
   (see below) cannot convert u+00a9 to '(C)' because the width is
   different and typesetting is broken.  It is very difficult to
   design to avoid this problem...

>   . Another approach to influence text processing (not input encoding)
>     is the OpenType way: --script, --language, --feature.
Though I don't know them, I hope these can handle Asian languages.

>   . Command lines shall be able to override input encoding
>     (--input-encoding).
Yes.

>   . Finally, we need to divide the -T option into a --device and
>     --output-encoding.

What is the default encoding for tty?  I suggest this should be
locale-sensible.  (Or, this can be UTF-8 and Groff can invoke a
postprocessor.)


> Believe me, most professional UNIX users in Germany don't have LANG
> set correctly (including me).  For example, I don't like to see German
> error messages since I'm used to the English ones.  In fact, I never
> got used to the German way of handling computers.  The German keyboard
> is awkward for programmers...

I think all of you will need LANG variable when UTF-8 becomes popular.
This is because UTF-8 is multibyte.  There are already many softwares
which are internationalized using locale mechanism.  These softwares
support UTF-8 if you set LANG variable.

If you like English message, you may set 'LANG=en_US.UTF-8'.


> > One compromise is that:
> >  - to use UCS-4 for internal processing, not wchar_t.
> 
> The internal processing within groff should use glyph IDs resp. glyph
> names.

As I wrote above, I won't stick to the internal code, though I 
have no idea what is glyph code.


> >  - a small part of input and output to be encoding-sensible.
> ???  Please explain.

Sorry, I meant:
  A small part of the source code of Groff related to I/O
  has to be encoding-sensible.
as I wrote in the previous mail.  You already wrote a comment
on this point.  Since I agree with your idea of preprocessor,
discussion on this idea has finished.


> >  - a compile-time option I18N to be introduced.
> 
> Yes.  The `iconv' preprocessor would then do some trivial, hard-coded
> conversion.

You mean, the preprocessor is iconv(1) ?
The preprocessor, provisional name 'gpreconv', will be designed as:
 - includes hard-coded converter for latin1, ebcdic, and utf8.
 - uses iconv(3) if possible (compiled within internationalized OS).
 - parses --input-encoding option.
 - default input is latin1 if compiled within non-internationalized OS.
 - default input is locale-sensible if compiled within internationalized OS.
I think it is easy to implement 'gpreconv'.


> >  - when I18N is off, default input is latin-1 and default output
> >    is also latin-1.
> Sounds reasonable.

Thus I designed the above 'gpreconv'.  Oh, I have to design 'gpostconv'
also.


> >  - when I18N is on, default input and default output are sensible
> >    to LC_CTYPE locale.
> Default output?  Really?  Please explain.

Default output encoding for tty device.  This can be achieved by
'gpostconv' postprocessor.


> I'll start with that after I've finished my reimplementation of the
> mdoc package.  Don't expect this happen too quickly :-(

I see.  All of us are volunteers and have no deadline.


>   . Write the input encoding preprocessor.  If you like, you can
>     immediately start with it since it is completely independent from
>     troff itself.

Ok, I think I can write that in a few weeks.  It is easy, though
the source code of Groff itself is difficult for me :-)

---
Tomohiro KUBOTA <kubota@debian.org>
http://surfchem0.riken.go.jp/~kubota/



Reply to: