[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#255224: xlibs-data: Patch for bug of ct_encoding sequence in zh_CN.gbk locale



On Thu, Jul 08, 2004 at 02:25:08PM -0500, Branden Robinson wrote:
> Thanks!  I have reviewed the patch, and while compound text encoding is a
> bit beyond me, I do appreciate the heads-up.  :)
  Thanks to you and all the DDs :)
> 
> I notice you are the author of this fix.  Because of problems with
> XFree86's recent change in licensing policy[1], I'd like to be certain I
> know what the provenance of your patch is.
> 
> Can you confirm the following statements?
    Yes, I can. :)
> 
>   * I am the author of this patch.
    Yes, I'm the only author of this patch.

    The bug report on http://bugs.xfree86.org/show_bug.cgi?id=1362
    has detailed how I found and fixed this bug[2]. My project
    `mule-gbk' mentioned in the report may be available on
    Sourceforge, later. One can find a very old version of mule-gbk from
    http://lists.debian.org/debian-chinese-big5/2002/04/msg00013.html. 
> 
>   * If any copyright attaches to this patch, I hereby place it under the
>     traditional MIT/X11 license[2].
    If any copyright attaches to this patch, I hereby place it under the
    traditional MIT/X11 license[1].

  [1] Here's a copy of the license text:

  Permission is hereby granted, free of charge, to any person
  obtaining a copy of this software and associated documentation
  files (the "Software"), to deal in the Software without
  restriction, including without limitation the rights to use,
  copy, modify, merge, publish, distribute, sublicense, and/or sell
  copies of the Software, and to permit persons to whom the
  Software is furnished to do so, subject to the following conditions:

  The above copyright notice and this permission notice shall be
  included in all copies or substantial portions of the Software.

  THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
  EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
  OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
  NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
  HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
  WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
  FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
  OTHER DEALINGS IN THE SOFTWARE.


  [2] Here's a copy of the original bug report:

---------------8<---------------
GBK <-> COMPOUND_TEXT translation in XFree86 is incorrect.

I started a project `mule-gbk' which aims to enable Chinese GBK 
encoding support(GBK support is important to the people from 
People's Republic of China.) on Emacs21.3/Mule, a few year ago.
In the process of enabling X selection between Emacs21 and other
applications on X11, I found the bug.

Normal X11 applications do GBK <-> COMPOUND_TEXT translation 
in Inter-Client Communication of X Selection with each
others use the routines from the xlib. But Emacs/Mule's 
COMPOUND_TEXT translation is implemented in Emacs Lisp. The point
is, if they(Mule & xlib) both encode GBK into COMPOUND_TEXT
correctly, there was no difficulty in the ICC of X Selection. 
But the experiments shows that Emacs/Mule can't understand the 
ctext translated from GBK text by normal X11 apps, like gedit,
mozilla, crxvt, etc. When you paste GBK text form these apps
to Emacs, the breakon sequence appeares "...GBK-0...".
Note that my locale is set to zh_CN.GBK by
  export LANG=zh_CN
  export LC_ALL=zh_CN.GBK
and the locale `zh_CN.GBK' has been generated on my Debian
GNU/Linux box by
  dpkg-reconfigure locales

The version of my XFree86 is 4.3.0.



Because it's so boring to me, I started to analyze the message from
the normal X11 apps, by inserting debugging statements into the 
clipboard program `xclip'. I found ctext from the normal X11 apps 
contains redundant sequences, it also makes wrong value of the
character counter in the `extended segments' of the ctext.

According to the document `Compound Text Encoding':
    http://www.xfree86.org/current/ctext.pdf
,----
| 6.  Non-Standard Character Set Encodings
| 
| Character set encodings that are not in the list of approved
| standard encodings can be included using ``extended seg-
| ments''.  An extended segment begins with one of the follow-
| ing sequences:
| 
|      01/11 02/05 02/15 03/00 M L   variable number of octets per character
|      01/11 02/05 02/15 03/01 M L   1 octet per character
|      01/11 02/05 02/15 03/02 M L   2 octets per character
|      01/11 02/05 02/15 03/03 M L   3 octets per character
|      01/11 02/05 02/15 03/04 M L   4 octets per character
| 
| [This uses the ``other coding system'' of ISO 2022, using
| private Final characters.]
| 
| The ``M'' and ``L'' octets represent a 14-bit unsigned value
| giving the number of octets that appear in the remainder of
| the segment.  The number is computed as ((M - 128) * 128) +
| (L - 128).  The most significant bit M and L are always set
| to one.  The remainder of the segment consists of two parts,
| the name of the character set encoding and the actual text.
| The name of the encoding comes first and is separated from
| the text by the octet 00/02 (STX, START OF TEXT).  Note that
| the length defined by M and L includes the encoding name and
| separator.
`----
extended segment in ctext for GBK text is defined as
    01/11 02/05 02/15 03/02 M L ,
because GBK is a non-standard character set with 2 octets 
per character.


Now, I found a simple method to solve this problem on my Debian GNU/Linux
Sid by modifying a line in the system file of XFree86:
   /usr/X11R6/lib/X11/locale/zh_CN.gbk/XLC_LOCALE
The line:
 ct_encoding GBK-0:GLGR:\x1b\x25\x2f\x32\x80\x88\x47\x42\x4b\x2d\x30\x02
                                        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                           (These are   128,128+8,'G','B','K','-','0', 2,
                           where "GBK-0" may be the character name for GBK.
                           It's so strange question how they goes here?
                           May be due to the misunderstanding
                           of the author to `Compound Text Encoding'??)
should be changed into(equivalently, remove these 8 octets):
 ct_encoding GBK-0:GLGR:\x1b\x25\x2f\x32
                        ~~~~~~~~~~~~~~~~
             (This is exactly the first 4 octets of the 
             extended sequences defined in `Compound Text Encoding'.)

Till now, this method has been used by many mule-gbk users from P.R.C.

How ever, I don't know the explicit meaning of this line, maybe an 
Xpert can figure out :(

I have download
          xc/nls/XLC_LOCALE/zh_CN.gbk
from
http://cvsweb.xfree86.org/cvsweb/*checkout*/xc/nls/XLC_LOCALE/zh_CN.gbk?rev=HEAD&only_with_tag=xf-4_4_99_4&content-type=text/plain
(This file is untouched for 3 years), and made a patch for it:

*** zh_CN.gbk.orig	2004-05-06 23:33:06.000000000 +0800
--- zh_CN.gbk	2004-05-06 23:34:31.000000000 +0800
***************
*** 62,68 ****
  	byte2		\x40,\x7e;\x80,\xfe
  
  	wc_encoding	\x00008000
! 	ct_encoding	GBK-0:GLGR:\x1b\x25\x2f\x32\x80\x88\x47\x42\x4b\x2d\x30\x02
  
  	mb_conversion	[\x8140,\xfefe]->\x0140
  	ct_conversion	[\x0140,\x7efe]->\x8140
--- 62,68 ----
  	byte2		\x40,\x7e;\x80,\xfe
  
  	wc_encoding	\x00008000
! 	ct_encoding	GBK-0:GLGR:\x1b\x25\x2f\x32
  
  	mb_conversion	[\x8140,\xfefe]->\x0140
  	ct_conversion	[\x0140,\x7efe]->\x8140

SU Yong
---------------8<---------------

-- 
SU Yong <yoyosu@ustc.edu.cn>
Proud Debian/GNU Linux User
PGP-Key-ID: 584F35F3

Attachment: signature.asc
Description: Digital signature


Reply to: