[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: how u mine 4 utf8 [was Re: Using .XCompose]



On Sat, 18 Jul 2020 davidson wrote:
On Thu, 16 Jul 2020 Ajith R wrote:
On Thursday 16 July 2020 4:54:09 AM IST davidson wrote:

[snip]
I recommend two references I found helpful, though I didn't read
either one especially carefully. I mostly just stared at the templates
(the UTF-8 fighters below), repeatedly asked myself "why did they do
it that way?" and then skimmed to confirm guesses.

First, utf-8(7).

$ man 7 utf-8

These two too:

RFC 3629 - UTF-8, a transformation format of ISO 10646 (14 pages)
https://tools.ietf.org/html/rfc3629

The Unicode Consortium. The Unicode Standard.
http://www.unicode.org/versions/latest/

In present version 13.0.0 of the latter,

 * A brief feature summary of UTF-8 (in chapter 2, General Structure)
   on pages 37-38

 * on page 124 (in section 3.9) is a chart that I think makes the
   mapping from unicode code points to UTF-8 especially clear.

 * Section 3.9 "Unicode Encoding Forms". Formal definitions, if one
   enjoys that kind of thing.

[snip]
So given a unicode code point, do this:

1. discard its leading 0 bits,
2. count how many bits you have left (which is how many x you will
   need from the templates below), and
3. pick your UTF-8 fighter:

And pick the shortest one that works. Choosing a longer one will give
invalid result.


1 octet  (7 bits)  0xxx  xxxx
2 octets (11 bits) 110x  xxxx  10xx  xxxx
3 octets (16 bits) 1110  xxxx  10xx  xxxx  10xx  xxxx
4 octets (21 bits) 1111  0xxx  10xx  xxxx  10xx  xxxx  10xx  xxxx
[snip]

--
Ce qui est important est rarement urgent
et ce qui est urgent est rarement important
-- Dwight David Eisenhower


Reply to: