Re: A small question
On Thu, Jul 01, 1999 at 03:51:59PM -0400, Adam Di Carlo wrote:
> Yes, I think we *can* assume (without further evidence) that the
> output is in ISO Latin 1, that is, the extended ASCII 255 character
> set.
Sorry, I do not understand. How can we assume the output is in ISO Latin 1 if
I see the contrary. :)
> In fact, ISOLat1 can be considered a degenerate case of UTF8.
I doubt that. It cannot. I believe for the upper half of ISOLat1, we would
use two bytes, no?
> >I believe the problem *you* encountered with
> >processing Russian translation has origins in fact that the output files are
> >not in Unicode.
>
> Actually, I think the situation is that they are charset koi8-r, but
> with some ISO Latin 1 characters mixed in.
Yes, but the latter could not be easily distinguished.
> >For all charset that have (C) symbol for code 169, the output will look fine.
> >Then, when you try to process the latex output from debiandoc2latex, you get a
> >lot of errors since in cyrillic font there is no symbol with code 169.
>
> I can assure you that the current system works for American/European
> language fine (german, french, english, probably much more), including
> PDF, HTML, and all other outputs in all the applications (xpdf,
> acroread, netscape, lynx, w3-el) I bothered to check.
Yes, since they all use ISOLat1, where (C) has code 169.
> >So the question is: what to do?
>
> Well, first off, I think moving from ISOLat1 (8 bit chars) to SDATA in
> sgml-data will break a lot more than it's going to fix. For instance,
> all the stuff working above would probably break (maybe you could test
> that?)
Certainly, since debiandoc2* scripts do no attempt for processing system data
entities.
> Secondly, I would suppose that you need to check for 8-bit ISOLat1
> character in your input stream, and convert them to whatever character
> set you are using. And probably do this pretty early in the chain,
> i.e., before we start branching out into TeX, HTML, etc etc.
So, you propose replace all © to whatever is needed in koi8-r in case
of Russian translation? Well, I just cannot do that: KOI8-R charset does
not have (C) character. If you meant something else, please clarify your
proposal.
> >The one we use for making the documentation from DebianDoc DTD is Unicode
> >aware? And do we really supply it with Unicode file?
>
> Well, see above. It is producing proper PDF files.
Does you reference to PDF mean that in case of PDF we must use Unicode in some
way?
[ stuff about sgml-tools v1 skipped ]
OK. I justed wanted to make an example of system where SDATA-entities are used
with success. ⌣
> We should probably make the fact that debiandoc-sgml is default
> encoded in ISOLat1 (or UTF8, if we wanna go there) more explicit in
> the documentation or elsewhere. Perhaps this should be filed as a
> bug on debiandoc-sgml.
I do not quite understand why you think that debiandoc-sgml's output is encoded
in ISOLat1? The output of debiandoc-sgml is just plain 8-bit stream. I
believe nobody can tell what it is.
One more issues (I just made a more throughly look on entities supplied by
sgml-data. Why some files provide Unicode equivalents for entities and some
proprietary SDATA? Is this by design?
--
Mike
Reply to: