[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#418058: iconv: half-smart on ascii compatible code conversion (shift-jis)



tags 418058 - unreproducible
retitle 418058 iconv: half-smart on ascii compatible code conversion (shift-jis)
thanks

Let me update with better data since original report was contaminated
with other bugs such as groff. (Thanks Aurelien Jarno to checking them.)

Bug: The \ and ~ (ascii 92 126)are not handled right by iconv under
SHIFT-JIS (SJIS).

The conversion error of iconv itself over printable 7 bit character was
tested with attached script with its result in diff.txt.

The conversion error also occurs on GB for # and ~ (ascii 35 126).
Please ask Chinese speaking people for GB situation. But I am almost
certain this is quite likely bug too.

NB: As for Japanese, I remember EUC-JP used to have similar problem.

Rationale:

iconv should not to smart guessing for 7 bit section of each traditional
encodings which was ASCII compatible.  They should be same in that 7 bit
section.

For all popular C/perl/shell/... programs written originally in
shift-jis  should not break if iconv is used to convert them in UTF-8.

Details:
For shift-jis, iconv tries to map character 0x5c to UTF-8 YEN mark.
That mapping to UTF-8 YEN mark should be done frim the yen mark code in
16bit (full width character section) and not for this 7 bit one hich is
0x5c.  This is very bad for any program.  Another issue is 0x7e '~'.
This is translated to upper bar.  Although some Japanese old PC (pre-IBM
compatible, NEC 98 ans Sharp MZ machines which used to run
IBM-incompatible MS-DOS, I think) had upper bar shaped font for ~ and
keyboard, converting this ~ in data to UTF-8 upper bar breaks URLs data
stored on shift-jis machines.

These cosmetic differenceis were just font difference.  The code point
should not be moved for these.

Osamu

Attachment: ascii.tar.gz
Description: Binary data


Reply to: