[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [Groff] Re: groff: radical re-implementation


At Thu, 19 Oct 2000 22:12:07 +0200 (CEST),
Werner LEMBERG <wl@gnu.org> wrote:

> This is not true.  Encoding does *not* imply the character set.
> You are talking about charset/encoding tags.

Hmm, I cannot understand your idea...

In Emacs, charsets such as ISO8859-1, JISX0208.1990, and BIG5 are
defined.  Using these charsets, encodings such as euc-japan,
iso-2022-jp, and iso-2022-7bit are defined.  A user can use
these encodings by, for example, M-x set-buffer-file-coding-system,
M-x set-terminal-encoding-system, and so on.  (Names for 'coding-system'
can be followed by '-unix', '-dos', or '-mac' which specify
line-breaking code.)  I can also specify the encoding of a file
using '-*-coding: euc-jp;-*-' in the first line of the file.

You said that encoding names can be specified by MIME charset tag
names.  I write mails in 'charset=us-ascii' or in 'charset=iso-2022-jp'
and web pages in 'charset=us-ascii', 'charset=iso-2022-jp',
'charset=euc-jp', 'charset=utf-8', or so on.  I never specify
encoding and charset separately.  Nor I don't write 'charset=euc'.

ISO-2022 is a encoding which includes many charsets.  Using ISO-2022,
I can write a multilingual text including US-ASCII, ISO 646-*, ISO 8859-*, 
JIS X * (Japanese), CNS 11643 (traditional Chinese), GB 2312 (simplified
Chinese), TIS620 (Thai), and so on.  GL, GR, G0, G1, G2, and G3 can
be used for these charsets with clearly defined escape sequences and
other control codes.  Since the escape sequences and control codes are
clearly defined, we don't need 'charset=' information to read ISO-2022
text.  The preprocesser can work without it (though I won't implement
ISO-2022 converter.  I will implement only four converters --- from
Latin-1, EBCDIC, UTF-8 (no-conversion), and locale encoding (iconv(3))
to UTF-8).

# Note that conversion from Unicode variants to ISO-2022 (not ISO-2022-JP,
# ISO-2022-CN, and so on) contains a problem and almost impossible.
# However, now we are discussing on reading the roff source, not writing.

Indicating JIS-X-0208 and EUC is insufficient to specify an encoding.
Also, telling JIS-X-0208 and ISO-2022 lacks information.  In the former
case, EUC can handle four character sets for GL, GR, SS1, and SS2.
EUC-JP is: ASCII for GL and JIS-X-0208 for GR.  ISO-2022-JP is more

Practical view; according to your idea, a user can specify 
'charset: KOI8-R; encoding: EUC', which cannot be specified with my 
idea.  However, I don't think this can be a reason your idea is
superior.  Rather, IMHO, such a usage is harmful.

I intend to mean
 - character set: CCS (Coded Character Set) in RFC 2130
 - encoding: CES (Character Encoding Scheme) in RFC 2130

I don't understand on what context you say 'EUC' is an encoding.

And, I think this is the most important, what is the merit of your idea?

Tomohiro KUBOTA <kubota@debian.org>

Reply to: