[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#418058: iconv: half-smart on ascii compatible code conversion (latin1, shift-jis, ...)



tags 418058 + unreproducible
thanks

Osamu Aoki a écrit :
> Package: libc6
> Version: 2.3.6.ds1-13
> Severity: important
> 
> Problem: ~ ' \ conversion.
> 
> In short, iconv should not to smart guessing for 7 bit section of each
> traditional encodings which was ASCII compatible.  They should be same
> in that 7 bit section.
> 
> Here we go....
> 
> For all popular C/perl/shell/... programs written originally in latin-1,
> latin-2, ..., shift-jis, euc-jp, ...  encodings will break if iconv is
> used to convert them in UTF-8.  iconv does half-smart job to please some
> cosmetic factors but forgot about how these encodings were originally
> developed and used in real life so it is harmful to the data.  (Of
> course those funny 8 bit texts are in the comments and the quoted text)  
> 
> In this sense, I could file grave bug for breaking data but considering
> timing, I stay with important.  (After etch, I may raise this bug
> severity.)
> 
> All these encodings (latin-1, latin-2, ..., shift-jis, euc-jp, ...  )
> were developed so non-ASCII characters can be expressed without breaking
> existing tools/codes developped for ASCII.  That is why they are ASCII
> compatible.  All 0x00-0x7f (7bit) represented characters shared the same
> position (We do use alternative font for the ASCII 0x5c = back_lash =
> '\' in Japan which looks like Japanese Yen-mark, but these \ in ASCII
> and yen in shift-jis serves the same purpose in the program world.  C
> standard even mention about dual nature of \.)  So by changing encoding
> of the file, we expect all 0x00-0x7f (7bit) to remain the same.
> 
> But I iconv does many funny things.
> 
> The code 0x27 (single-quote) is changed to something else (long UTF-8
> sequence for single-quote) when converted from any of latin-1, latin-2,
> shift-jis, euc-jp,... to UTF-8 changes.  This is not expected.
> 
> For shift-jis, it is even worse.  iconv tries to map character 0x5c to
> UTF-8 YEN mark.  That mapping should be done for the yen mark code in
> 16bit (full width character section) and not for this 7 bit one.  This
> is very bad for any program.  Another issue is 0x7e '~'.  This is
> translated to upper bar.  Although some Japanese old PC (pre-IBM
> compatible, NEC 98 machines, I think) had upper bar shaped font for ~,
> converting this ~ in data to UTF-8 upper bar breaks URLs data stored on
> shift-jis machines.
> 
> The choice of conversion table should not be based on superficial shape
> caparison but should take into full account of actual usage and
> implication.
> 
> iconv being basic tool, it should not do these conversion on 7 bit code
> for these.  If anyone want syntactical pretty print conversion of UTF-8
> text, it should rely on some other tool.  Then they can use open and
> closing quote if they wish.  But we can keep C programs right.   Many
> old C programs in each locale used to use these ASCII compatible
> encodings and all we want to do is convert quoted text and comments to
> UTF-8.

All the diff you provide are actually wrong. In all those file, the
input character for ' is not 27 but E2 80 99, which is an UTF-8
sequence. iconv behaves correctly here.

Please provide us a correct input file (check it with hexdump) that
exhibits the problem. I suggests to gzip it to avoid encoding
translation by your MUA.

-- 
  .''`.  Aurelien Jarno	            | GPG: 1024D/F1BCDB73
 : :' :  Debian developer           | Electrical Engineer
 `. `'   aurel32@debian.org         | aurelien@aurel32.net
   `-    people.debian.org/~aurel32 | www.aurel32.net



Reply to: