[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

[轉載]Abiword request for help from CJK hackers



Subject: request for help from CJK hackers
From: Vlad Harchev (hvv@hippo.ru)
Date: Wed Nov 08 2000 - 13:13:46 CST 

      sorted by: [ date ] [ thread ] [ subject ] [ author ] 
      Next message: Eric W. Sink: "RE: 0.7.12 for Monday?" 
      Previous message: Thomas Briggs: "Re: 0.7.12 for Monday?" 
      Next in thread: Belcon Zhao: "Re: request for help from CJK hackers" 
      Reply: Belcon Zhao: "Re: request for help from CJK hackers" 
      Reply: Belcon Zhao: "Re: request for help from CJK hackers" 
      Reply: Belcon Zhao: "Re: request for help from CJK hackers" 
      Reply: Belcon Zhao: "Re: request for help from CJK hackers" 
      Reply: Belcon Zhao: "Re: request for help from CJK hackers" 
      Reply: Belcon Zhao: "Re: request for help from CJK hackers" 
      Reply: Belcon Zhao: "Re: request for help from CJK hackers" 
      Reply: Vlad Harchev: "Re: request for help from CJK hackers" 

 Hi guys, 

 It seems AbiWord-0.7.12 will be released in the begining of next week, so it 
would be nice if all CJK issues were worked out. 
 It seems that the major thing to be done to make AW seamlessly support CJK 
languages is import of RTFs with CJK characters. Internally cutting and 
pasting is implemented as exporting fragment of document to RTF and reading 
piece of the document that was cut from RTF. So, unless RTF importer with CJK 
characters doesn't work AW won't be able to paste anything. 
 Belcon Zhao <rainfall@yeah.net> is tightly working on this problem for a 
week or more (of course on other problems too, but this is the only problem 
that left to solve thanks to Belcon). Contact Belcon for more information. 
 So could you guys see what's wrong with it? 

 I should say that current code works fine with singlebyte encodings (even in 
case when current encoding and encoding used in RTF file differ). So I don't 
have idea why it doesn't work for CJK. 

 Here is my recommendations on how to research the problem: 
* Type few Chinese characters (you may surround them with "AbiWord" to quickly 
identify them in raw RTF) 
* Save document as RTF (not RTF for old apps). Check that it's really rtf 
 (giving it .rtf extension is not enough - type should be specified from popup). 
* Try to import that rtf file. As I understand, incorrect Chinese character 
are read. 

 You can use just cut and paste - the same set of exporter and importer 
functions will get called. 

 The function that should be inspected: 
IE_Imp_RTF::ParseChar(UT_UCSChar ch,bool no_convert=0) in 
/src/wp/impexp/xp/ie_imp_RTF.cpp 

 The first parameter is character that was read from .rtf (either raw or 
converted to proper character from one specified in form \'hh (e.g. "\'a3" 
will result in calling ParseChar(0xa3,0)) or as Unicode value as \uc0\uHHHH - 
(e.g. \uc0\u3e9f that will result in call ParseChar(0x3e9f,1) ) ). 
 The second parameter tells whether the character should be converted from 
charset of RTF file or whether it's already unicode character (case 3 above - 
\uc0\uHHHH form). 

 The following is done inside that function (important part left) 

UT_Bool IE_Imp_RTF::ParseChar(UT_UCSChar ch,bool no_convert) 
{ 
        /* insure we are not chunk marked as "deleted" */ 
                                if (no_convert==0 && ch<=0xff) 
                                { 
                                        wchar_t wc; 
                                        if (m_mbtowc.mbtowc(wc,(UT_Byte)ch)) 
                                                return AddChar(wc); 
                                } else 
                                        return AddChar(ch); 
} 

 Here AddChar() inserts Unicode character in the document (it works OK). 
 m_mbtowc is of type UT_Mbtowc defined in /src//src/af/util/xp/ut_mbtowc.cpp 
- a wrapper around iconv that converts characters from multibyte encoding of 
RTF file (it's properly setup) to Unicode. Instances of this wrapper are used 
in a lot of places (e.g. when converting input from keyboard or importing 
plain text) and they work OK there. So I don't know why it doesn't work here. 
The function 'int UT_Mbtowc::mbtowc(wchar_t &wc,char mb)' returns 1 if mb is 
the terminator of already-agregated multibyte sequence (in this case it 
returns proper value in value passed by reference as 1st parameter). 
 Belcon tells that m_mbtowc.mbtowc(wc,(UT_Byte)ch) returns 1. 

 The most probable reason why it doesn't work is that iconv_t member of 
m_mbtowc is ((iconv_t)-1). Could you check that? 
 The input charset for m_mbtowc is set twice - once at creation of IE_Imp_RTF 
(it's set to current locale's charset) and the secon time - when \ansicpg is 
seen - in IE_Imp_RTF::TranslateKeyword: 
        switch (*pKeyword) 
        { 
        case 'a': 
                if (strcmp((char*)pKeyword, "ansicpg") == 0) 
                { 
                        m_mbtowc.setInCharset(XAP_EncodingManager::instance-> 
                                charsetFromCodepage((UT_uint32)param)); 
                } 
                break; 
                /* [...] */ 
        } 
 So you should ensure that XAP_EncodingManager::instance-> 
charsetFromCodepage((UT_uint32)param) returns name of charset libc knows. If 
it returns charset name unknow to glibc, just tell me for what parameter it 
should return what (and what it actually returns) and I will correct it 
properly (or do it yourself - in /src/af/xap/xp/xap_EncodingManager.cpp) - 
but write a quick hack in order not to wait for correct fix (from you or me), 
and test it. Test cut and paste after this. 

Also, please test (and fix :) the following: 
* cutting from AW and pasting to other apps 
* pasting to AW from other apps 

 It seems other things are OK. 
 But if you want to polish, add a correct header that will be written by AW 
when exporting to Latex (function XAP_EncodingManager::getTexPrologue() and 
the way TexPrologue is initialized in XAP_EncodingManager::initialize()). 

 Also you also can check (and fix) that Word and other apps understand RTF 
generated by AW and that AW understands their RTF. 

 Please report any problems. 
 Feel free to contact me directly if you have troubles. 

 PS: Latest news: Belcon tells that with the patch to xap_UnixFont.cpp that 
was commited last night AW shows characters in GB2312 without any problem. 

 When testing, remember that your $LANG should contain the name of the 
encoding (as understood by your iconv implementation) - e.g. "zh_CN.GB2312" 

 Let's make AW CJK-aware! 

 Thanks for your help in advance. 

 Best regards, 
  -Vlad 

-- 
| This message was re-posted from debian-chinese-gb@lists.debian.org
| and converted from gb2312 to big5 by an automatic gateway.



Reply to: