Re: how u mine 4 utf8 [was Re: Using .XCompose]

To: debian-user@lists.debian.org
Subject: Re: how u mine 4 utf8 [was Re: Using .XCompose]
From: davidson <davidson@freevolt.org>
Date: Sat, 18 Jul 2020 07:13:51 +0000 (UTC)
Message-id: <[🔎] alpine.DEB.2.21.2007180713420.6331@azone.org>
In-reply-to: <[🔎] alpine.DEB.2.21.2007180117440.12973@azone.org>
References: <143551610.2618526.1594013699409.ref@mail.yahoo.com> <[🔎] 143551610.2618526.1594013699409@mail.yahoo.com> <[🔎] alpine.DEB.2.21.2007061804280.26486@azone.org> <[🔎] alpine.DEB.2.21.2007070628500.6440@azone.org> <[🔎] 885539452.3315628.1594115314827@mail.yahoo.com> <[🔎] alpine.DEB.2.21.2007092139232.19897@azone.org> <908961137.4769747.1594370585219@mail.yahoo.com> <[🔎] 1104871163.4762905.1594370616864@mail.yahoo.com> <[🔎] alpine.DEB.2.21.2007110839180.28476@azone.org> <[🔎] alpine.DEB.2.21.2007111021500.7787@azone.org> <[🔎] 520110585.417671.1594565493490@mail.yahoo.com> <[🔎] alpine.DEB.2.21.2007152324040.10414@azone.org> <[🔎] 1370813785.2444121.1594924286284@mail.yahoo.com> <[🔎] alpine.DEB.2.21.2007180117440.12973@azone.org>

On Sat, 18 Jul 2020 davidson wrote:

On Thu, 16 Jul 2020 Ajith R wrote:

On Thursday 16 July 2020 4:54:09 AM IST davidson wrote:


[snip]

I recommend two references I found helpful, though I didn't read
either one especially carefully. I mostly just stared at the templates
(the UTF-8 fighters below), repeatedly asked myself "why did they do
it that way?" and then skimmed to confirm guesses.

First, utf-8(7).

$ man 7 utf-8


These two too:

RFC 3629 - UTF-8, a transformation format of ISO 10646 (14 pages)
https://tools.ietf.org/html/rfc3629

The Unicode Consortium. The Unicode Standard.
http://www.unicode.org/versions/latest/

In present version 13.0.0 of the latter,

 * A brief feature summary of UTF-8 (in chapter 2, General Structure)
   on pages 37-38

 * on page 124 (in section 3.9) is a chart that I think makes the
   mapping from unicode code points to UTF-8 especially clear.

 * Section 3.9 "Unicode Encoding Forms". Formal definitions, if one
   enjoys that kind of thing.

[snip]

So given a unicode code point, do this:

1. discard its leading 0 bits,
2. count how many bits you have left (which is how many x you will
   need from the templates below), and
3. pick your UTF-8 fighter:


And pick the shortest one that works. Choosing a longer one will give
invalid result.


1 octet  (7 bits)  0xxx  xxxx
2 octets (11 bits) 110x  xxxx  10xx  xxxx
3 octets (16 bits) 1110  xxxx  10xx  xxxx  10xx  xxxx
4 octets (21 bits) 1111  0xxx  10xx  xxxx  10xx  xxxx  10xx  xxxx

[snip]

--
Ce qui est important est rarement urgent
et ce qui est urgent est rarement important
-- Dwight David Eisenhower

Reply to:

Follow-Ups:
- Re: how u mine 4 utf8 [was Re: Using .XCompose]
  - From: Ajith R <ajithramayyan@yahoo.co.in>

References:
- Using .XCompose
  - From: Ajith R <ajithramayyan@yahoo.co.in>
- Re: Using .XCompose
  - From: davidson <davidson@freevolt.org>
- Re: Using .XCompose
  - From: davidson <davidson@freevolt.org>
- Re: Using .XCompose
  - From: Ajith R <ajithramayyan@yahoo.co.in>
- Re: Using .XCompose
  - From: davidson <davidson@freevolt.org>
- Using .XCompose
  - From: Ajith R <ajithramayyan@yahoo.co.in>
- Re: Using .XCompose
  - From: davidson <davidson@freevolt.org>
- Re: Using .XCompose
  - From: davidson <davidson@freevolt.org>
- Re: Using .XCompose
  - From: Ajith R <ajithramayyan@yahoo.co.in>
- Re: Using .XCompose
  - From: davidson <davidson@freevolt.org>
- Re: Using .XCompose
  - From: Ajith R <ajithramayyan@yahoo.co.in>
- how u mine 4 utf8 [was Re: Using .XCompose]
  - From: davidson <davidson@freevolt.org>

Prev by Date: UTF-8 Everywhere -- Re: how u mine 4 utf8 [was Re: Using .XCompose]
Next by Date: Re: tmpfs is not a ramdisk (was: delimiters with more than one character? ...)
Previous by thread: UTF-8 Everywhere -- Re: how u mine 4 utf8 [was Re: Using .XCompose]
Next by thread: Re: how u mine 4 utf8 [was Re: Using .XCompose]
Index(es):
- Date
- Thread