[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: charsets (7) manpage



Hi,

At Mon, 7 May 2001 03:04:48 -0500,
David Starner <dstarner98@aasaa.ofe.org> wrote:

> I noticed that charsets (7) from the manpages package had a number
> of inaccuracies, so I did some editing. Before I send it back to
> the maintainer, I thought I might send to here, so that any errors
> I added could get fixed, and hopefully other errors could be found.
> Please note that most of my fixes were in the Unicode and ISO-8859
> areas. Someone else is going to have to pick apart the ISO-2022 
> section. My version of the manpage, and the diff between it and
> the old version, are attached.

A good work!
May I have some requests?


> .SH ASCII
...
> Various ASCII variants replacing the dollar sign with other currency
> symbols and replacing punctuation with non-English alphabetic characters 
> to cover German, French, Spanish and others in 7 bits exist. All are 
> deprecated; GNU libc doesn't support locales whose character sets aren't
> true supersets of ASCII.

I think these ASCII variants are ISO 646.  For example, please
check chapter of "Domestic ISO 646 Character Tables" found at
http://kanji.zinbun.kyoto-u.ac.jp/~yasuoka/CJK.html
I'd like you to mention the word "ISO 646" (though it may not
very important because it is not widely used and Linux does not
support it).


> .SH KOI8-R
> KOI8-R is a non-ISO character set popular in Russia.  The lower half
> is US ASCII; the upper is a Cyrillic character set somewhat better
> designed than ISO 8859-5. KOI8-U is a common character set, based off
> KOI8-R, that has better support for Ukranian.
> .LP
> Console support for KOI8-R is available under Linux through user-mode
> utilities that modify keyboard bindings and the EGA graphics table,
> and employ the "user mapping" font table in the console driver.

It may be mentioned that KOI8-R uses 0x80-0x9f code space for visible
characters (and not ISO 2022-compliant).

And more, I think this document may mention other several major
non-ISO character sets which are national standards.  My intention
is: JIS X 0208 (Japanese), KS X 1001 (Korean), GB 2312 (simplified
Chinese), Big5 (traditional Chinese), and TIS 620 (Thai).  These
character sets are widely used in these countries and Debian has
softwares which support these character sets (besides Mule/Emacs).

The following is an example of explanation on these character sets.
Mentioning east Asian character sets has advantage that concept of
multibyte character and distinction between character set and
encoding can be explained.


.SH JIS X 0208
JIS X 0208 is a Japanese national standard character set.  Though
there are some more national standard character sets (like JIS X
0201, JIS X 0212, and JIS X 0213), this is the most important one
for Japanese.  Characters are mapped into 94x94 two-byte matrix,
whose each byte takes range of 0x21-0x7e.  Note that JIS X 0208
is a character set, not an encoding.  This means that JIS X 0208
itself is not used for expressing text data.  JIS X 0208 is used
as a component to construct encodings such as EUC-JP, Shift_JIS,
and ISO-2022-JP.  EUC-JP is the most important encoding for Linux
and includes US ASCII and JIS X 0208.  In EUC-JP, JIS X 0208
characters are expressed in two bytes, each of which is JIS X 0208
code plus 0x80.

.SH KS X 1001
KS X 1001 is a Korean national standard character set.  Just as
JIS X 0208, characters are mapped into 94x94 two-byte matrix.
KS X 1001 is not used as an encoding; instead, used as a component
to construct encodings such as EUC-KR, Johab, and ISO-2022-KR.
EUC-KR is the most important encoding for Linux and includes
US ASCII and KS X 1001.  KS C 5601 is previous name for KS X 1001.

.SH GB 2312
GB 2312 is a mainland Chinese national stadndard character set
to express simplified Chinese.  Just as JIS X 0208, characters are
mapped into 94x94 two-byte matrix.  GB 2312 is not used as an
encoding and used as a component to construct encodings such as
EUC-CN.  EUC-CN is the most important encoding for Linux and
includes US ASCII and GB 2312.  Note that EUC-CN is often called
as GB, GB 2312, or CN-GB and you'd better be careful not to confuse
character set and encoding.

.SH Big5
Big5 is a popular character set in Taiwan to express tratidional
Chinese.  It is a superset of US ASCII.  Non-ASCII characters
are expressed in two bytes.  Upper half (0xa1-0xfe) is used as a
leading byte for two-byte characters.  Big5 and its extension is
widely used in Taiwan and Hong Kong.  It is not ISO 2022-compliant.

.SH TIS 620
TIS 620 is a Thai national standard character set and a superset
of US ASCII.  Like ISO 8859 series, Thai characters are mapped into
0xa1-0xfe.

How do you think about including explanations like this?

---
Tomohiro KUBOTA <kubota@debian.org>
http://www.debian.or.jp/~kubota/
"Introduction to I18N"  http://www.debian.org/doc/manuals/intro-i18n/



Reply to: