[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [Groff] Re: groff: radical re-implementation


At Sat, 21 Oct 2000 10:46:51 +0200 (CEST),
Werner LEMBERG <wl@gnu.org> wrote:

> In general.  I want to define terms completely independent on any
> particular program.  We have
>   character set
>   character encoding
>   glyph set
>   glyph encoding

I understand.  Since we are discussing on the preprocessor, let's 
concentrate on character, not glyph.  I think you now will agree to
specify the 'character set/encoding' by a single word such as
'EUC-JP' instead of a pair of 'JIS-X-0208' and 'EUC'.

BTW, I am implementing the preprocessor.  Now it has features of:
 - input from standard input (stdin)
 - output to standard output (stdout)
 - I18N directive to support locale-sensible mode
 - hard-coded converter from Latin1, EBCDIC, and UTF-8 to UTF-8
 - locale-sensible converter from any encodings supported by OS to UTF-8
   (note: UTF-8 has to be supported by iconv(3) )
 - encoding for input is determined by command option or default
 - default is 'latin1' when compiled without I18N or locale-sensible when
   compiled with I18N
However I have to implement
 - encoding has to be determined also by '-*- ... -*-' directive in
   the roff source
 - (I18N mode) encoding has to be able to be specified by MIME-style
   and Emacs-style names.
 - efficiency of memory and CPU usage is not considered yet.
 - input from files besides stdin

I will send the source soon.

> >    roff source in any encoding like '\(co'     (character)
> >           |
> >           |  preprocessor
> >           V
> >    UTF-8 stream like u+00a9                    (character)
> >           |
> >           |  troff
> >           V
> >    glyph expression like 'co'                  (glyph)
> >           |
> >           |  troff (continuing)
> >           V
> Here is missing a step:
>      typeset output                              (glyph)
>             |
>             |  grotty
>             V
> >    UTF-8 stream like u+00a9 or '(C)'           (character)
> >           |
> >           |  postprocessor
> >           V
> >    formatted text in any encoding              (character)

I understand well.  Thank you for your explanation.
BTW, besides TTY output, HTML will need postprocess from glyph to 
character like 'grotty' in tty mode, since HTML is a text file.
I think the encoding for HTML can be always UTF-8.  We can add a
line between <HEAD> and </HEAD>

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">

(I found a code in grohtml.cc to write this line without charset

Tomohiro KUBOTA <kubota@debian.org>

Reply to: