Bug#418058: iconv: half-smart on ascii compatible code conversion (shift-jis)

To: Aurelien Jarno <aurelien@aurel32.net>, 418058@bugs.debian.org
Cc: control@bugs.debian.org
Subject: Bug#418058: iconv: half-smart on ascii compatible code conversion (shift-jis)
From: Osamu Aoki <osamu@debian.org>
Date: Sat, 16 Jun 2007 10:37:47 +0900
Message-id: <[🔎] 20070616013747.GA28504@snoopy.lan>
Reply-to: Osamu Aoki <osamu@debian.org>, 418058@bugs.debian.org
In-reply-to: <461E05B6.20707@aurel32.net>
References: <20070406160432.GA10083@snoopy.lan> <461E05B6.20707@aurel32.net>

tags 418058 - unreproducible
retitle 418058 iconv: half-smart on ascii compatible code conversion (shift-jis)
thanks

Let me update with better data since original report was contaminated
with other bugs such as groff. (Thanks Aurelien Jarno to checking them.)

Bug: The \ and ~ (ascii 92 126)are not handled right by iconv under
SHIFT-JIS (SJIS).

The conversion error of iconv itself over printable 7 bit character was
tested with attached script with its result in diff.txt.

The conversion error also occurs on GB for # and ~ (ascii 35 126).
Please ask Chinese speaking people for GB situation. But I am almost
certain this is quite likely bug too.

NB: As for Japanese, I remember EUC-JP used to have similar problem.

Rationale:

iconv should not to smart guessing for 7 bit section of each traditional
encodings which was ASCII compatible.  They should be same in that 7 bit
section.

For all popular C/perl/shell/... programs written originally in
shift-jis  should not break if iconv is used to convert them in UTF-8.

Details:
For shift-jis, iconv tries to map character 0x5c to UTF-8 YEN mark.
That mapping to UTF-8 YEN mark should be done frim the yen mark code in
16bit (full width character section) and not for this 7 bit one hich is
0x5c.  This is very bad for any program.  Another issue is 0x7e '~'.
This is translated to upper bar.  Although some Japanese old PC (pre-IBM
compatible, NEC 98 ans Sharp MZ machines which used to run
IBM-incompatible MS-DOS, I think) had upper bar shaped font for ~ and
keyboard, converting this ~ in data to UTF-8 upper bar breaks URLs data
stored on shift-jis machines.

These cosmetic differenceis were just font difference.  The code point
should not be moved for these.

Osamu

Attachment: ascii.tar.gz
Description: Binary data

Reply to:

Follow-Ups:
- Processed: Bug#418058: iconv: half-smart on ascii compatible code conversion (shift-jis)
  - From: owner@bugs.debian.org (Debian Bug Tracking System)

Prev by Date: Bug#429064: linux-libc-dev: <linux/types.h> conflicts with <sys/ustat.h>
Next by Date: Processed: Bug#418058: iconv: half-smart on ascii compatible code conversion (shift-jis)
Previous by thread: Bug#429064: linux-libc-dev: <linux/types.h> conflicts with <sys/ustat.h>
Next by thread: Processed: Bug#418058: iconv: half-smart on ascii compatible code conversion (shift-jis)
Index(es):
- Date
- Thread