[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: A small question



>Unicode character is fine if the output file is in Unicode.  Is the output of
>debiandoc2* are in Unicode?

Well, actually, the character in question, © is an ISOLat1.

Yes, I think we *can* assume (without further evidence) that the
output is in ISO Latin 1, that is, the extended ASCII 255 character
set.

In fact, ISOLat1 can be considered a degenerate case of UTF8.
However, I think we'd do best just to deal with ISO Latin 1 at this
point, as a reasonable default.

>I believe the problem *you* encountered with
>processing Russian translation has origins in fact that the output files are
>not in Unicode.

Actually, I think the situation is that they are charset koi8-r, but
with some ISO Latin 1 characters mixed in.

> For example, if we translate dselect-beginner.ru.sgml into
>HTML format, we get a plain text file that has `Content-Type; text/html;
>charset=koi8-r' at the very beginning.  All © in source file will appear
>as 8-bit characters since we have
>
>    <!ENTITY copy CDATA "&#169">

>For all charset that have (C) symbol for code 169, the output will look fine.
>Then, when you try to process the latex output from debiandoc2latex, you get a
>lot of errors since in cyrillic font there is no symbol with code 169.

I think your analysis is probably correct.  I am not sure.  I can
assure you that the current system works for American/European
language fine (german, french, english, probably much more), including
PDF, HTML, and all other outputs in all the applications (xpdf,
acroread, netscape, lynx, w3-el) I bothered to check.

>So the question is: what to do?

Well, first off, I think moving from ISOLat1 (8 bit chars) to SDATA in
sgml-data will break a lot more than it's going to fix.  For instance,
all the stuff working above would probably break (maybe you could test
that?)

Secondly, I would suppose that you need to check for 8-bit ISOLat1
character in your input stream, and convert them to whatever character
set you are using.  And probably do this pretty early in the chain,
i.e., before we start branching out into TeX, HTML, etc etc.

>The one we use for making the documentation from DebianDoc DTD is Unicode
>aware?  And do we really supply it with Unicode file?

Well, see above.  It is producing proper PDF files.

>Does nsgmls have to be compiled in multi-byte mode for being Unicode aware or
>not?

Yes.

>If yes, is it as of sp 1.3.3-1.2.1-7?

Yes -- has been for a long time.

>Why?  I believe (I have not checked that yet) this should break sgml-tools
>package (yes, yes sgml-tools v1).  It makes use of SDATA entities for
>producing proper output.

Well, sgml-tools (v1) does a lot of things in what I would consider a
messed up way.  The pacakge is effectively orphaned, both upstream and
in Debian.  Moreover, it ships and uses it's own wierd set of ISO
entity sets.  Note there is no FPI for ISOLat1 etc in sgml-tools, but
instead, in /etc/sgml.catalog:

  -- outdated and shared entities --
  ENTITY %common            "../sgml/dtd/common"
  ENTITY %isoent            "../sgml/dtd/isoent"

Point two: it would break a *lot* of things to change the default
encoding of ISO entities from ISOLat1/Unicode to SDATA.

Point three: AFAIK, the SDATA encodings are not standard but
proprietary; that is, I don't know of any standard that says that
&copy; should be "[copy  ]"

>Actually, I have only a practical aim in mind: to make Russian documents
>correct.  So how to make &copy; look (C) in all versions of
>dselect-beginer.ru? &smile;

I suggest you solve the problem of converting from the ISO Latin 1
(implied) encoding to koi8-r.  I believe there are tools to do this.

We should probably make the fact that debiandoc-sgml is default
encoded in ISOLat1 (or UTF8, if we wanna go there) more explicit in
the documentation or elsewhere.  Perhaps this should be filed as a
bug on debiandoc-sgml.

--
.....Adam Di Carlo....adam@onShore.com.....<URL:http://www.onShore.com/>




Reply to: