Re: How to convert Unicode numbers into proper utf8 text?
hendrik@topoi.pooq.com wrote:
>
> If you are looking for a ready-made too, I don't know.
> If you are looking for the spec, I got the following from the Unicode
> Standard, version 3.0:
>
> Scalar value UTF-16 1st byte 2nd byte 3rd byte 4th byte
> 000000000xxxxxxx 000000000xxxxxxx 0xxxxxxx
> 00000yyyyyxxxxxx 00000yyyyyxxxxxx 110yyyyy 10xxxxxx
> zzzzyyyyyyxxxxxx zzzzyyyyyyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx
> uuuuuzzzzyyyyyyxxxxxx 110110wwwwzzzzyy+ 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx
> 110111yyyyxxxxxx
>
> where uuuuu = wwww+1 (to account for the addition of 10000 base 16 as in
> Section 3.7, surrogates)
>
> When converting a Unicode scalar value to UTF-8, the shortest form that
> can represent those values shall be used. This practice preserves
> uniqueness of coding. For example, the Unicode buinary value
> <0000000000000001> is encoded as <00000001>, not as <11000000 10000001>.
> The latter is an example of an irregular UTF-8 bute sequence. Irregular
> UTF-8 sequences shall not be used foe encoding any other information.
>
> To which I add that Java, in particular, uses an erregulat UTF-8
> sequence to encode the <0000000000000000> character, so that it can
> encode it unambiguously in an environment that would otherwise use an
> all-zero byte to indicate end-of-string.
>
> -- hendrik
>
thanks for the information!
ready to read it~
Reply to: