[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#227273: www.debian.org: Japanese DDTP files are provided with EUC-JP endoding.



From: Frank Lichtenheld <djpig@debian.org>
Subject: Bug#227273: www.debian.org: Japanese DDTP files are provided with EUC-JP endoding.
Date: Thu, 29 Jan 2004 01:25:26 +0100

> On Thu, Jan 29, 2004 at 12:16:19AM +0100, Frank Lichtenheld wrote:
> > I used the Perl module Text::Iconv which itself uses iconv(3)
> > This module seems to suck or I am to dump to use it. If I convert the
> > raw Japanese Packages file with iconv(1) (which probably uses iconv(3), 
> > too) all escape sequences seem to be generated correctly, if I use
> > Text::Iconv->convert, only the very first one is.
> 
> Correction, it only forgets the very last escape sequence since this one is 
> not generated by iconv(3). It "forgets" to clear the state at the end of 
> the conversion which I found out in comaprison with iconv(1) that handles 
> this case correctly. I prepared a patch and will file a bug against the 
> package.

I tested on gluck (packages.debian.org):

  (a)
  $ echo -en '\xa4\xa2' | iconv -f EUC-JP -t ISO-2022-JP | od -t x1
  0000000 1b 24 42 24 22 1b 28 42
  0000010

The last three bytes is the closing escape sequence.
Thus iconv(1) works well.  Next, I wrote the following script:

  (b)
  #!/usr/bin/perl
  use Text::Iconv;
  $conv = Text::Iconv->new("EUC-JP", "ISO-2022-JP");
  $a=""; while(<>){ $a .= $_; }
  $b = $conv->convert($a);
  print $b;

Then

  (c)
  $ echo -ne '\xa4\xa2' | ./a.pl | od -t x1
  0000000 1b 24 42 24 22
  0000005

In this case, closing escape sequence is missing.  However, if the
source string has some following characters after JIS X 0208 Japanese
characters, like:

  (d)
  $ echo -e '\xa4\xa2' | ./a.pl |od -t x1
  0000000 1b 24 42 24 22 1b 28 42 0a
  0000011

  (e)
  $ echo -ne '\xa4\xa2\x41' | ./a.pl |od -t x1
  0000000 1b 24 42 24 22 1b 28 42 41
  0000011

Then the closing escape sequence is added.

Explanation:
In the case of (e), it is clear that closing escape sequence is
needed.  In case of (d), it is also needed because ISO-2022-JP
requires that when Line Feed appears the "state" must be ASCII.
In case of (c), Text::Iconv does not know whether the following
string will be Japanese or ASCII.  Addition of closing escape
sequence would be redundant if Japanese would follow.  I imagine
this is why Text::Iconv does not add closing escape sequence in
this case.

I think the safest way is to use Text::Iconv to convert the whole
web page at one time.  (Or, at least the whole line (logical
line which ends with Line Feed code) at one time.)

---
Tomohiro KUBOTA <kubota@debian.org>
http://www.debian.or.jp/~kubota/



Reply to: