Re: A small question
On Thu, Jul 01, 1999 at 06:13:24PM -0400, Adam Di Carlo wrote:
> Well, I don't know for sure, but:
>
> (a) it's obviously not us-ascii (7bit)
Agree. ⌣
> (b) it could be interpreted as UTF8, but if that was the case, why
> would Russian/Japanese not use Unicode too?
It cannot be <emphasis>interpreted</emphasis> as UTF8 unless only 7bit
characters are there.
> (c) it uses the stock SGML entities we suppy, which are
> Unicode/ISOLat1.
Hmm... I would disagree on using `stock' here. On this later.
> Your distinction between Unicode (UTF8) and ISOLat1 is basically
> irrelevent, since, as far as I know, for all the character entities
> that debiandoc-sgml uses, they are represented identically in both
> representations.
Let me explain. There is UNICODE standard that say `these symbols have these
codes; if we use these codes, they should be drawn this way'. There are
several methods for representing these codes: UCS-2, UCS-4, UTF-7, UTF-8, and,
I believe, others. For example, UCS-2 uses 2 bytes for representing Unicode
codes. In this case, symbols from ISOLat1 are basically the same thing as in
iso-8859-1, since the first byte will always be zero. But, in case of UTF-8
(which we seem to discuss), the situation is different. According to RFC2279,
UCS-2 codes should be converted to UTF-8 using this rules:
0000-007F 0xxxxxxx
0080-07FF 110xxxxx 10xxxxxx
0800-FFFF 1110xxxx 10xxxxxx 10xxxxxx
That's why I repeat: if we have ISOLat1 characters to output, these should be
encoded as 2-byte sequences in case of UTF-8. Thus, the output files we have
at the moment <emphasis>cannot</emphasis> be interpreted as UTF-8, since they
are not.
> I don't understand why you are saying that you *can't* capture the
> character '╘' but you can capture '[copy ]'? Isnt' that the simplest
> way to solve things, rather than breaking the rest of the roman
> character set languages?
OK. This is due the way nsgmls processes CDATA and SDATA. The output of
nsgmls is a file of special format where all CDATA are immediately converted to
the appropriate character (I believe this depends on all those SP_ variables
that control the character sets), while for SDATA a special consruct will
be created. If, for example, we have
<!ENTITY copy SDATA "blah blah blah">
and input is
hello © world
in output we get
-hello \|blah blah blah\| world
while in case of
<!ENTITY copy CDATA "&169;">
the output will be
-hello ╘ world
You see, the construct \|...\| can be easily cought since it's a special thing
(`\' in input will be escaped with \ giving \\ in output). Well, in case of
SDATA-entities, I see how to make use of them.
> >One more issues (I just made a more throughly look on entities supplied by
> >sgml-data. Why some files provide Unicode equivalents for entities and some
> >proprietary SDATA? Is this by design?
>
> There are none that use SDATA AFAIK. YOu might be mixing up sgml-data
> with some other packages which put stuff in /usr/lib/sgml/entities.
I am sorry to say that the freshly downloaded and unpacked in a separate
directory sgml-data package has ISO* files that define SDATA-entities.
Well, and now returning to `stock' SGML entities. copy, and certain other
entities (like nbsp, for example) are from ISOnum, while in sgml-data package
they are defined in both of them (and they are different, BTW).
As for working out this problem. There are two possibilities: to make use of
SDATA entities in all programs that come with Debian; or to use some Unicode
encoding for intermediate/output files.
--
Mike
Reply to: