[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: How to convert Unicode numbers into proper utf8 text?



hendrik@topoi.pooq.com wrote:
> 
> If you are looking for a ready-made too, I don't know.
> If you are looking for the spec, I got the following from the Unicode 
> Standard, version 3.0:
> 
>      Scalar value     UTF-16            1st byte 2nd byte 3rd byte 4th byte
>      000000000xxxxxxx 000000000xxxxxxx  0xxxxxxx
>      00000yyyyyxxxxxx 00000yyyyyxxxxxx  110yyyyy 10xxxxxx
>      zzzzyyyyyyxxxxxx zzzzyyyyyyxxxxxx  1110zzzz 10yyyyyy 10xxxxxx
> uuuuuzzzzyyyyyyxxxxxx 110110wwwwzzzzyy+ 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx
>                       110111yyyyxxxxxx
> 
> where uuuuu = wwww+1 (to account for the addition of 10000 base 16 as in 
> Section 3.7, surrogates)
> 
> When converting a Unicode scalar value to UTF-8, the shortest form that 
> can represent those values shall be used.  This practice preserves 
> uniqueness of coding.  For example, the Unicode buinary value 
> <0000000000000001> is encoded as <00000001>, not as <11000000 10000001>.  
> The latter is an example of an irregular UTF-8 bute sequence.  Irregular 
> UTF-8 sequences shall not be used foe encoding any other information.
> 
> To which I add that Java, in particular, uses an erregulat UTF-8 
> sequence to encode the <0000000000000000> character, so that it can 
> encode it unambiguously in an environment that would otherwise use an 
> all-zero byte to indicate end-of-string.
> 
> -- hendrik
> 

thanks for the information!
ready to read it~



Reply to: