[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: A small question



On Thu, Jul 01, 1999 at 03:51:59PM -0400, Adam Di Carlo wrote:
> Yes, I think we *can* assume (without further evidence) that the
> output is in ISO Latin 1, that is, the extended ASCII 255 character
> set.
Sorry, I do not understand.  How can we assume the output is in ISO Latin 1 if
I see the contrary. :)

> In fact, ISOLat1 can be considered a degenerate case of UTF8.
I doubt that.  It cannot.  I believe for the upper half of ISOLat1, we would
use two bytes, no?

> >I believe the problem *you* encountered with
> >processing Russian translation has origins in fact that the output files are
> >not in Unicode.
> 
> Actually, I think the situation is that they are charset koi8-r, but
> with some ISO Latin 1 characters mixed in.
Yes, but the latter could not be easily distinguished.

> >For all charset that have (C) symbol for code 169, the output will look fine.
> >Then, when you try to process the latex output from debiandoc2latex, you get a
> >lot of errors since in cyrillic font there is no symbol with code 169.
> 
> I can assure you that the current system works for American/European
> language fine (german, french, english, probably much more), including
> PDF, HTML, and all other outputs in all the applications (xpdf,
> acroread, netscape, lynx, w3-el) I bothered to check.
Yes, since they all use ISOLat1, where (C) has code 169.

> >So the question is: what to do?
> 
> Well, first off, I think moving from ISOLat1 (8 bit chars) to SDATA in
> sgml-data will break a lot more than it's going to fix.  For instance,
> all the stuff working above would probably break (maybe you could test
> that?)
Certainly, since debiandoc2* scripts do no attempt for processing system data
entities.

> Secondly, I would suppose that you need to check for 8-bit ISOLat1
> character in your input stream, and convert them to whatever character
> set you are using.  And probably do this pretty early in the chain,
> i.e., before we start branching out into TeX, HTML, etc etc.
So, you propose replace all © to whatever is needed in koi8-r in case
of Russian translation?  Well, I just cannot do that: KOI8-R charset does
not have (C) character.  If you meant something else, please clarify your
proposal.

> >The one we use for making the documentation from DebianDoc DTD is Unicode
> >aware?  And do we really supply it with Unicode file?
> 
> Well, see above.  It is producing proper PDF files.
Does you reference to PDF mean that in case of PDF we must use Unicode in some
way?

[ stuff about sgml-tools v1 skipped ]
OK.  I justed wanted to make an example of system where SDATA-entities are used
with success. ⌣

> We should probably make the fact that debiandoc-sgml is default
> encoded in ISOLat1 (or UTF8, if we wanna go there) more explicit in
> the documentation or elsewhere.  Perhaps this should be filed as a
> bug on debiandoc-sgml.
I do not quite understand why you think that debiandoc-sgml's output is encoded
in ISOLat1?  The output of debiandoc-sgml is just plain 8-bit stream.  I
believe nobody can tell what it is.

One more issues (I just made a more throughly look on entities supplied by
sgml-data.  Why some files provide Unicode equivalents for entities and some
proprietary SDATA?  Is this by design?

--
Mike


Reply to: