Bug#418058: iconv: half-smart on ascii compatible code conversion (latin1, shift-jis, ...)
tags 418058 + unreproducible
Osamu Aoki a écrit :
> Package: libc6
> Version: 2.3.6.ds1-13
> Severity: important
> Problem: ~ ' \ conversion.
> In short, iconv should not to smart guessing for 7 bit section of each
> traditional encodings which was ASCII compatible. They should be same
> in that 7 bit section.
> Here we go....
> For all popular C/perl/shell/... programs written originally in latin-1,
> latin-2, ..., shift-jis, euc-jp, ... encodings will break if iconv is
> used to convert them in UTF-8. iconv does half-smart job to please some
> cosmetic factors but forgot about how these encodings were originally
> developed and used in real life so it is harmful to the data. (Of
> course those funny 8 bit texts are in the comments and the quoted text)
> In this sense, I could file grave bug for breaking data but considering
> timing, I stay with important. (After etch, I may raise this bug
> All these encodings (latin-1, latin-2, ..., shift-jis, euc-jp, ... )
> were developed so non-ASCII characters can be expressed without breaking
> existing tools/codes developped for ASCII. That is why they are ASCII
> compatible. All 0x00-0x7f (7bit) represented characters shared the same
> position (We do use alternative font for the ASCII 0x5c = back_lash =
> '\' in Japan which looks like Japanese Yen-mark, but these \ in ASCII
> and yen in shift-jis serves the same purpose in the program world. C
> standard even mention about dual nature of \.) So by changing encoding
> of the file, we expect all 0x00-0x7f (7bit) to remain the same.
> But I iconv does many funny things.
> The code 0x27 (single-quote) is changed to something else (long UTF-8
> sequence for single-quote) when converted from any of latin-1, latin-2,
> shift-jis, euc-jp,... to UTF-8 changes. This is not expected.
> For shift-jis, it is even worse. iconv tries to map character 0x5c to
> UTF-8 YEN mark. That mapping should be done for the yen mark code in
> 16bit (full width character section) and not for this 7 bit one. This
> is very bad for any program. Another issue is 0x7e '~'. This is
> translated to upper bar. Although some Japanese old PC (pre-IBM
> compatible, NEC 98 machines, I think) had upper bar shaped font for ~,
> converting this ~ in data to UTF-8 upper bar breaks URLs data stored on
> shift-jis machines.
> The choice of conversion table should not be based on superficial shape
> caparison but should take into full account of actual usage and
> iconv being basic tool, it should not do these conversion on 7 bit code
> for these. If anyone want syntactical pretty print conversion of UTF-8
> text, it should rely on some other tool. Then they can use open and
> closing quote if they wish. But we can keep C programs right. Many
> old C programs in each locale used to use these ASCII compatible
> encodings and all we want to do is convert quoted text and comments to
All the diff you provide are actually wrong. In all those file, the
input character for ' is not 27 but E2 80 99, which is an UTF-8
sequence. iconv behaves correctly here.
Please provide us a correct input file (check it with hexdump) that
exhibits the problem. I suggests to gzip it to avoid encoding
translation by your MUA.
.''`. Aurelien Jarno | GPG: 1024D/F1BCDB73
: :' : Debian developer | Electrical Engineer
`. `' email@example.com | firstname.lastname@example.org
`- people.debian.org/~aurel32 | www.aurel32.net