Bug#418058: iconv: half-smart on ascii compatible code conversion (latin1, shift-jis, ...)
Package: libc6
Version: 2.3.6.ds1-13
Severity: important
Problem: ~ ' \ conversion.
In short, iconv should not to smart guessing for 7 bit section of each
traditional encodings which was ASCII compatible. They should be same
in that 7 bit section.
Here we go....
For all popular C/perl/shell/... programs written originally in latin-1,
latin-2, ..., shift-jis, euc-jp, ... encodings will break if iconv is
used to convert them in UTF-8. iconv does half-smart job to please some
cosmetic factors but forgot about how these encodings were originally
developed and used in real life so it is harmful to the data. (Of
course those funny 8 bit texts are in the comments and the quoted text)
In this sense, I could file grave bug for breaking data but considering
timing, I stay with important. (After etch, I may raise this bug
severity.)
All these encodings (latin-1, latin-2, ..., shift-jis, euc-jp, ... )
were developed so non-ASCII characters can be expressed without breaking
existing tools/codes developped for ASCII. That is why they are ASCII
compatible. All 0x00-0x7f (7bit) represented characters shared the same
position (We do use alternative font for the ASCII 0x5c = back_lash =
'\' in Japan which looks like Japanese Yen-mark, but these \ in ASCII
and yen in shift-jis serves the same purpose in the program world. C
standard even mention about dual nature of \.) So by changing encoding
of the file, we expect all 0x00-0x7f (7bit) to remain the same.
But I iconv does many funny things.
The code 0x27 (single-quote) is changed to something else (long UTF-8
sequence for single-quote) when converted from any of latin-1, latin-2,
shift-jis, euc-jp,... to UTF-8 changes. This is not expected.
For shift-jis, it is even worse. iconv tries to map character 0x5c to
UTF-8 YEN mark. That mapping should be done for the yen mark code in
16bit (full width character section) and not for this 7 bit one. This
is very bad for any program. Another issue is 0x7e '~'. This is
translated to upper bar. Although some Japanese old PC (pre-IBM
compatible, NEC 98 machines, I think) had upper bar shaped font for ~,
converting this ~ in data to UTF-8 upper bar breaks URLs data stored on
shift-jis machines.
The choice of conversion table should not be based on superficial shape
caparison but should take into full account of actual usage and
implication.
iconv being basic tool, it should not do these conversion on 7 bit code
for these. If anyone want syntactical pretty print conversion of UTF-8
text, it should rely on some other tool. Then they can use open and
closing quote if they wish. But we can keep C programs right. Many
old C programs in each locale used to use these ASCII compatible
encodings and all we want to do is convert quoted text and comments to
UTF-8.
-- System Information:
Debian Release: lenny/sid
APT prefers unstable
APT policy: (500, 'unstable'), (500, 'testing')
Architecture: amd64 (x86_64)
Shell: /bin/sh linked to /bin/bash
Kernel: Linux 2.6.18-mactel64
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Versions of packages libc6 depends on:
ii tzdata 2007d-1 Time Zone and Daylight Saving Time
libc6 recommends no packages.
-- debconf-show failed
Conversion results are attached as diffs.
--- ascii.txt 2007-04-07 00:10:04.000000000 +0900
+++ eucj-utf8.txt 2007-04-07 00:10:26.000000000 +0900
@@ -39,7 +39,7 @@
044 36 24 $ 144 100 64 d
045 37 25 % 145 101 65 e
046 38 26 & 146 102 66 f
- 047 39 27 ’ 147 103 67 g
+ 047 39 27 147 103 67 g
050 40 28 ( 150 104 68 h
051 41 29 ) 151 105 69 i
052 42 2A * 152 106 6A j
--- ascii.txt 2007-04-07 00:10:04.000000000 +0900
+++ shiftj-utf8.txt 2007-04-07 00:10:45.000000000 +0900
@@ -28,7 +28,7 @@
031 25 19 EM 131 89 59 Y
032 26 1A SUB 132 90 5A Z
033 27 1B ESC 133 91 5B [
- 034 28 1C FS 134 92 5C \
+ 034 28 1C FS 134 92 5C \
035 29 1D GS 135 93 5D ]
036 30 1E RS 136 94 5E ^
037 31 1F US 137 95 5F _
@@ -39,7 +39,7 @@
044 36 24 $ 144 100 64 d
045 37 25 % 145 101 65 e
046 38 26 & 146 102 66 f
- 047 39 27 ’ 147 103 67 g
+ 047 39 27 窶 147 103 67 g
050 40 28 ( 150 104 68 h
051 41 29 ) 151 105 69 i
052 42 2A * 152 106 6A j
@@ -62,6 +62,6 @@
073 59 3B ; 173 123 7B {
074 60 3C < 174 124 7C |
075 61 3D = 175 125 7D }
- 076 62 3E > 176 126 7E ~
+ 076 62 3E > 176 126 7E ~
077 63 3F ? 177 127 7F DEL
--- ascii.txt 2007-04-07 00:10:04.000000000 +0900
+++ l1-utf8.txt 2007-04-07 00:10:59.000000000 +0900
@@ -39,7 +39,7 @@
044 36 24 $ 144 100 64 d
045 37 25 % 145 101 65 e
046 38 26 & 146 102 66 f
- 047 39 27 ’ 147 103 67 g
+ 047 39 27 â 147 103 67 g
050 40 28 ( 150 104 68 h
051 41 29 ) 151 105 69 i
052 42 2A * 152 106 6A j
--- ascii.txt 2007-04-07 00:10:04.000000000 +0900
+++ l2-utf8.txt 2007-04-07 00:11:05.000000000 +0900
@@ -39,7 +39,7 @@
044 36 24 $ 144 100 64 d
045 37 25 % 145 101 65 e
046 38 26 & 146 102 66 f
- 047 39 27 ’ 147 103 67 g
+ 047 39 27 â 147 103 67 g
050 40 28 ( 150 104 68 h
051 41 29 ) 151 105 69 i
052 42 2A * 152 106 6A j
Reply to: