[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#418058: iconv: half-smart on ascii compatible code conversion (latin1, shift-jis, ...)



Package: libc6
Version: 2.3.6.ds1-13
Severity: important

Problem: ~ ' \ conversion.

In short, iconv should not to smart guessing for 7 bit section of each
traditional encodings which was ASCII compatible.  They should be same
in that 7 bit section.

Here we go....

For all popular C/perl/shell/... programs written originally in latin-1,
latin-2, ..., shift-jis, euc-jp, ...  encodings will break if iconv is
used to convert them in UTF-8.  iconv does half-smart job to please some
cosmetic factors but forgot about how these encodings were originally
developed and used in real life so it is harmful to the data.  (Of
course those funny 8 bit texts are in the comments and the quoted text)  

In this sense, I could file grave bug for breaking data but considering
timing, I stay with important.  (After etch, I may raise this bug
severity.)

All these encodings (latin-1, latin-2, ..., shift-jis, euc-jp, ...  )
were developed so non-ASCII characters can be expressed without breaking
existing tools/codes developped for ASCII.  That is why they are ASCII
compatible.  All 0x00-0x7f (7bit) represented characters shared the same
position (We do use alternative font for the ASCII 0x5c = back_lash =
'\' in Japan which looks like Japanese Yen-mark, but these \ in ASCII
and yen in shift-jis serves the same purpose in the program world.  C
standard even mention about dual nature of \.)  So by changing encoding
of the file, we expect all 0x00-0x7f (7bit) to remain the same.

But I iconv does many funny things.

The code 0x27 (single-quote) is changed to something else (long UTF-8
sequence for single-quote) when converted from any of latin-1, latin-2,
shift-jis, euc-jp,... to UTF-8 changes.  This is not expected.

For shift-jis, it is even worse.  iconv tries to map character 0x5c to
UTF-8 YEN mark.  That mapping should be done for the yen mark code in
16bit (full width character section) and not for this 7 bit one.  This
is very bad for any program.  Another issue is 0x7e '~'.  This is
translated to upper bar.  Although some Japanese old PC (pre-IBM
compatible, NEC 98 machines, I think) had upper bar shaped font for ~,
converting this ~ in data to UTF-8 upper bar breaks URLs data stored on
shift-jis machines.

The choice of conversion table should not be based on superficial shape
caparison but should take into full account of actual usage and
implication.

iconv being basic tool, it should not do these conversion on 7 bit code
for these.  If anyone want syntactical pretty print conversion of UTF-8
text, it should rely on some other tool.  Then they can use open and
closing quote if they wish.  But we can keep C programs right.   Many
old C programs in each locale used to use these ASCII compatible
encodings and all we want to do is convert quoted text and comments to
UTF-8.

-- System Information:
Debian Release: lenny/sid
  APT prefers unstable
  APT policy: (500, 'unstable'), (500, 'testing')
Architecture: amd64 (x86_64)
Shell:  /bin/sh linked to /bin/bash
Kernel: Linux 2.6.18-mactel64
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)

Versions of packages libc6 depends on:
ii  tzdata                        2007d-1    Time Zone and Daylight Saving Time

libc6 recommends no packages.

-- debconf-show failed

Conversion results are attached as diffs.
--- ascii.txt	2007-04-07 00:10:04.000000000 +0900
+++ eucj-utf8.txt	2007-04-07 00:10:26.000000000 +0900
@@ -39,7 +39,7 @@
        044   36    24    $    144   100   64    d
        045   37    25    %    145   101   65    e
        046   38    26    &    146   102   66    f
-       047   39    27    ’    147   103   67    g
+       047   39    27    €™    147   103   67    g
        050   40    28    (    150   104   68    h
        051   41    29    )    151   105   69    i
        052   42    2A    *    152   106   6A    j
--- ascii.txt	2007-04-07 00:10:04.000000000 +0900
+++ shiftj-utf8.txt	2007-04-07 00:10:45.000000000 +0900
@@ -28,7 +28,7 @@
        031   25    19    EM   131   89    59    Y
        032   26    1A    SUB  132   90    5A    Z
        033   27    1B    ESC  133   91    5B    [
-       034   28    1C    FS   134   92    5C    \
+       034   28    1C    FS   134   92    5C    \
        035   29    1D    GS   135   93    5D    ]
        036   30    1E    RS   136   94    5E    ^
        037   31    1F    US   137   95    5F    _
@@ -39,7 +39,7 @@
        044   36    24    $    144   100   64    d
        045   37    25    %    145   101   65    e
        046   38    26    &    146   102   66    f
-       047   39    27    ’    147   103   67    g
+       047   39    27    窶    147   103   67    g
        050   40    28    (    150   104   68    h
        051   41    29    )    151   105   69    i
        052   42    2A    *    152   106   6A    j
@@ -62,6 +62,6 @@
        073   59    3B    ;    173   123   7B    {
        074   60    3C    <    174   124   7C    |
        075   61    3D    =    175   125   7D    }
-       076   62    3E    >    176   126   7E    ~
+       076   62    3E    >    176   126   7E    ~
        077   63    3F    ?    177   127   7F    DEL
 
--- ascii.txt	2007-04-07 00:10:04.000000000 +0900
+++ l1-utf8.txt	2007-04-07 00:10:59.000000000 +0900
@@ -39,7 +39,7 @@
        044   36    24    $    144   100   64    d
        045   37    25    %    145   101   65    e
        046   38    26    &    146   102   66    f
-       047   39    27    ’    147   103   67    g
+       047   39    27    ’    147   103   67    g
        050   40    28    (    150   104   68    h
        051   41    29    )    151   105   69    i
        052   42    2A    *    152   106   6A    j
--- ascii.txt	2007-04-07 00:10:04.000000000 +0900
+++ l2-utf8.txt	2007-04-07 00:11:05.000000000 +0900
@@ -39,7 +39,7 @@
        044   36    24    $    144   100   64    d
        045   37    25    %    145   101   65    e
        046   38    26    &    146   102   66    f
-       047   39    27    ’    147   103   67    g
+       047   39    27    ’    147   103   67    g
        050   40    28    (    150   104   68    h
        051   41    29    )    151   105   69    i
        052   42    2A    *    152   106   6A    j

Reply to: