[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: A small question



On Thu, Jul 01, 1999 at 06:13:24PM -0400, Adam Di Carlo wrote:
> Well, I don't know for sure, but:
> 
>   (a) it's obviously not us-ascii (7bit)
Agree. ⌣

>   (b) it could be interpreted as UTF8, but if that was the case, why
> would Russian/Japanese not use Unicode too?
It cannot be <emphasis>interpreted</emphasis> as UTF8 unless only 7bit
characters are there.

>   (c) it uses the stock SGML entities we suppy, which are
> Unicode/ISOLat1.
Hmm...  I would disagree on using `stock' here.  On this later.

> Your distinction between Unicode (UTF8) and ISOLat1 is basically
> irrelevent, since, as far as I know, for all the character entities
> that debiandoc-sgml uses, they are represented identically in both
> representations.
Let me explain.  There is UNICODE standard that say `these symbols have these
codes; if we use these codes, they should be drawn this way'.  There are
several methods for representing these codes: UCS-2, UCS-4, UTF-7, UTF-8, and,
I believe, others.  For example, UCS-2 uses 2 bytes for representing Unicode
codes.  In this case, symbols from ISOLat1 are basically the same thing as in
iso-8859-1, since the first byte will always be zero.  But, in case of UTF-8
(which we seem to discuss), the situation is different.  According to RFC2279,
UCS-2 codes should be converted to UTF-8 using this rules:

   0000-007F   0xxxxxxx
   0080-07FF   110xxxxx 10xxxxxx
   0800-FFFF   1110xxxx 10xxxxxx 10xxxxxx

That's why I repeat: if we have ISOLat1 characters to output, these should be
encoded as 2-byte sequences in case of UTF-8.  Thus, the output files we have
at the moment <emphasis>cannot</emphasis> be interpreted as UTF-8, since they
are not.

> I don't understand why you are saying that you *can't* capture the
> character '╘' but you can capture '[copy  ]'?  Isnt' that the simplest
> way to solve things, rather than breaking the rest of the roman
> character set languages?
OK.  This is due the way nsgmls processes CDATA and SDATA.  The output of
nsgmls is a file of special format where all CDATA are immediately converted to
the appropriate character (I believe this depends on all those SP_ variables
that control the character sets), while for SDATA a special consruct will
be created.  If, for example, we have

    <!ENTITY copy SDATA "blah blah blah">

and input is

    hello &copy; world

in output we get

-hello \|blah blah blah\| world

while in case of

    <!ENTITY copy CDATA "&169;">

the output will be

-hello ╘ world

You see, the construct \|...\| can be easily cought since it's a special thing
(`\' in input will be escaped with \ giving \\ in output).  Well, in case of
SDATA-entities, I see how to make use of them.

> >One more issues (I just made a more throughly look on entities supplied by
> >sgml-data.  Why some files provide Unicode equivalents for entities and some
> >proprietary SDATA?  Is this by design?
> 
> There are none that use SDATA AFAIK.  YOu might be mixing up sgml-data
> with some other packages which put stuff in /usr/lib/sgml/entities.

I am sorry to say that the freshly downloaded and unpacked in a separate
directory sgml-data package has ISO* files that define SDATA-entities.

Well, and now returning to `stock' SGML entities.  copy, and certain other
entities (like nbsp, for example) are from ISOnum, while in sgml-data package
they are defined in both of them (and they are different, BTW).

As for working out this problem.  There are two possibilities: to make use of
SDATA entities in all programs that come with Debian; or to use some Unicode
encoding for intermediate/output files.

--
Mike


Reply to: