[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: How to convert Unicode numbers into proper utf8 text?



On Wed, Oct 18, 2006 at 05:11:11PM +0800, Jeff Zhang wrote:
> OOo 2.0.4 can export LaTeX file now, but East Asia text were converted
> into unicode numbers, like:
> ...
> \begin{document}
> [95EE?][7956?][5B97?][4E4B?][5FB7?][6CFD?][FF0C?][543E?][8EAB?][6240?][4EAB
> ?][8005?][FF0C?][662F?][5F53?][5FF5?][5176?][79EF?][7D2F?][4E4B?][96BE?][FF
> 1B?][95EE?][5B50?][5B59?][4E4B?][798F?][7949?][FF0C?][543E?][8EAB?][6240?][
> 8D3B?][8005?][FF0C?]
> [662F?][8981?][601D?][5176?][503E?][8986?][4E4B?][6613?][3002?]
> [3000?][3000?][FF0D?][FF0D?][300A?][83DC?][6839?][8C2D?][300B?]
> \end{document}
> 
> How to convert those unicode number(95EE, 7956, ...) into utf8 text?
> Thanks in advance!

If you are looking for a ready-made too, I don't know.
If you are looking for the spec, I got the following from the Unicode 
Standard, version 3.0:

     Scalar value     UTF-16            1st byte 2nd byte 3rd byte 4th byte
     000000000xxxxxxx 000000000xxxxxxx  0xxxxxxx
     00000yyyyyxxxxxx 00000yyyyyxxxxxx  110yyyyy 10xxxxxx
     zzzzyyyyyyxxxxxx zzzzyyyyyyxxxxxx  1110zzzz 10yyyyyy 10xxxxxx
uuuuuzzzzyyyyyyxxxxxx 110110wwwwzzzzyy+ 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx
                      110111yyyyxxxxxx

where uuuuu = wwww+1 (to account for the addition of 10000 base 16 as in 
Section 3.7, surrogates)

When converting a Unicode scalar value to UTF-8, the shortest form that 
can represent those values shall be used.  This practice preserves 
uniqueness of coding.  For example, the Unicode buinary value 
<0000000000000001> is encoded as <00000001>, not as <11000000 10000001>.  
The latter is an example of an irregular UTF-8 bute sequence.  Irregular 
UTF-8 sequences shall not be used foe encoding any other information.

To which I add that Java, in particular, uses an erregulat UTF-8 
sequence to encode the <0000000000000000> character, so that it can 
encode it unambiguously in an environment that would otherwise use an 
all-zero byte to indicate end-of-string.

-- hendrik


> 
> 
> -- 
> To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org 
> with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
> 



Reply to: