[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: utf



On Tue, 03 Apr 2018, Darac Marjal wrote:
> If these things matter to you, it's better to convert from UTF-8 to Unicode,

UTF-8 *is* Unicode :p  What you mean is either UCS-4 or UTF-32 (which
are just another encoding for Unicode).  But all of them are Unicode.

UTF-* are only used for Unicode encodigs: one implies the other.  You
could encode generic binary data using a bit-packing scheme identical to
UTF-8, but it would have to be called something else.

> first. I tend to think of Unicode as an arbitrarily large code page. Each
> character maps to a number, but that number could be 1, 1000 or 500_000
> (Unicode seems to be growing without might end in sight). Internally, you

It won't go past 32 bits without becoming something else than Unicode,
and the real limit is lower (0x10ffff from RFC 3629).  What the Unicode
consortium is doing is to *fill in* this range.  There is still a lot of
unallocated/reserved space, but yes, we are perfectly capable of filling
it up with junk given a few decades.

> might store those code points as Integers or QUad Words or whatever you
> like. Only once you're ready to transfer the text to another process (print

"whatever you like" is a rather bad idea.

Use whatever is appropriate for the internal representation of Unicode
on whatever programming language you are dealing with, and fall back to
UCS-4 (unsigned 32-bit integer) if there isn't one.

For C, you'd use uint32_t to store the codepoints, and UCS-4 (or UTF-32)
to encode them (they are the same if you don't care for detecting
illegal code points and are not serializing that data directly to the
outside world).  There is wchar_t, but that thing is bad news if you
need to be portable _and_ handle anything outside of the Unicode BMP.

IMO, you will want to avoid UCS-2 and UTF-16 as much as you can.

> Basically, you consider UTF-8 to be a transfer-only format (like Base64). If
> you want to do anything non-trivial with it, decode it into Unicode.

You mean decode it to UCS-4/UTF-32, but yes, that's the idea.

-- 
  Henrique Holschuh


Reply to: