Bug#227273: www.debian.org: Japanese DDTP files are provided with EUC-JP endoding.
From: Frank Lichtenheld <djpig@debian.org>
Subject: Bug#227273: www.debian.org: Japanese DDTP files are provided with EUC-JP endoding.
Date: Thu, 29 Jan 2004 01:25:26 +0100
> On Thu, Jan 29, 2004 at 12:16:19AM +0100, Frank Lichtenheld wrote:
> > I used the Perl module Text::Iconv which itself uses iconv(3)
> > This module seems to suck or I am to dump to use it. If I convert the
> > raw Japanese Packages file with iconv(1) (which probably uses iconv(3),
> > too) all escape sequences seem to be generated correctly, if I use
> > Text::Iconv->convert, only the very first one is.
>
> Correction, it only forgets the very last escape sequence since this one is
> not generated by iconv(3). It "forgets" to clear the state at the end of
> the conversion which I found out in comaprison with iconv(1) that handles
> this case correctly. I prepared a patch and will file a bug against the
> package.
I tested on gluck (packages.debian.org):
(a)
$ echo -en '\xa4\xa2' | iconv -f EUC-JP -t ISO-2022-JP | od -t x1
0000000 1b 24 42 24 22 1b 28 42
0000010
The last three bytes is the closing escape sequence.
Thus iconv(1) works well. Next, I wrote the following script:
(b)
#!/usr/bin/perl
use Text::Iconv;
$conv = Text::Iconv->new("EUC-JP", "ISO-2022-JP");
$a=""; while(<>){ $a .= $_; }
$b = $conv->convert($a);
print $b;
Then
(c)
$ echo -ne '\xa4\xa2' | ./a.pl | od -t x1
0000000 1b 24 42 24 22
0000005
In this case, closing escape sequence is missing. However, if the
source string has some following characters after JIS X 0208 Japanese
characters, like:
(d)
$ echo -e '\xa4\xa2' | ./a.pl |od -t x1
0000000 1b 24 42 24 22 1b 28 42 0a
0000011
(e)
$ echo -ne '\xa4\xa2\x41' | ./a.pl |od -t x1
0000000 1b 24 42 24 22 1b 28 42 41
0000011
Then the closing escape sequence is added.
Explanation:
In the case of (e), it is clear that closing escape sequence is
needed. In case of (d), it is also needed because ISO-2022-JP
requires that when Line Feed appears the "state" must be ASCII.
In case of (c), Text::Iconv does not know whether the following
string will be Japanese or ASCII. Addition of closing escape
sequence would be redundant if Japanese would follow. I imagine
this is why Text::Iconv does not add closing escape sequence in
this case.
I think the safest way is to use Text::Iconv to convert the whole
web page at one time. (Or, at least the whole line (logical
line which ends with Line Feed code) at one time.)
---
Tomohiro KUBOTA <kubota@debian.org>
http://www.debian.or.jp/~kubota/
Reply to: