Re: How to convert Unicode numbers into proper utf8 text?

To: debian-user@lists.debian.org
Subject: Re: How to convert Unicode numbers into proper utf8 text?
From: Jeff Zhang <idealbsd@gmail.com>
Date: Wed, 18 Oct 2006 23:39:09 +0800
Message-id: <[🔎] 45364A9D.6080109@gmail.com>
In-reply-to: <[🔎] 20061018135036.GA9355@topoi.pooq.com>
References: <[🔎] 4535EFAF.2030801@gmail.com> <[🔎] 20061018135036.GA9355@topoi.pooq.com>

hendrik@topoi.pooq.com wrote:
> 
> If you are looking for a ready-made too, I don't know.
> If you are looking for the spec, I got the following from the Unicode 
> Standard, version 3.0:
> 
>      Scalar value     UTF-16            1st byte 2nd byte 3rd byte 4th byte
>      000000000xxxxxxx 000000000xxxxxxx  0xxxxxxx
>      00000yyyyyxxxxxx 00000yyyyyxxxxxx  110yyyyy 10xxxxxx
>      zzzzyyyyyyxxxxxx zzzzyyyyyyxxxxxx  1110zzzz 10yyyyyy 10xxxxxx
> uuuuuzzzzyyyyyyxxxxxx 110110wwwwzzzzyy+ 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx
>                       110111yyyyxxxxxx
> 
> where uuuuu = wwww+1 (to account for the addition of 10000 base 16 as in 
> Section 3.7, surrogates)
> 
> When converting a Unicode scalar value to UTF-8, the shortest form that 
> can represent those values shall be used.  This practice preserves 
> uniqueness of coding.  For example, the Unicode buinary value 
> <0000000000000001> is encoded as <00000001>, not as <11000000 10000001>.  
> The latter is an example of an irregular UTF-8 bute sequence.  Irregular 
> UTF-8 sequences shall not be used foe encoding any other information.
> 
> To which I add that Java, in particular, uses an erregulat UTF-8 
> sequence to encode the <0000000000000000> character, so that it can 
> encode it unambiguously in an environment that would otherwise use an 
> all-zero byte to indicate end-of-string.
> 
> -- hendrik
> 

thanks for the information!
ready to read it~

Reply to:

References:
- How to convert Unicode numbers into proper utf8 text?
  - From: Jeff Zhang <idealbsd@gmail.com>
- Re: How to convert Unicode numbers into proper utf8 text?
  - From: hendrik@topoi.pooq.com

Prev by Date: Re: AVG anti-virus
Next by Date: Re: firewalls and installation stuff....
Previous by thread: Re: How to convert Unicode numbers into proper utf8 text?
Next by thread: installing spamcannibal
Index(es):
- Date
- Thread