[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

how u mine 4 utf8 [was Re: Using .XCompose]



On Thu, 16 Jul 2020 Ajith R wrote:
On Thursday 16 July 2020 4:54:09 AM IST davidson wrote:

[snip]
   $ sed 'y/\xc2\xa0/%/' somefile

An off topic question: of sed, awk and perl, if I am to chose one to
learn, which would you suggest. I wanted to do some substitutions. I
read about them and decided on PERL because from what I understood,
it has better support for regular expressions and do almost
everything that sed and awk could do. Have I made the right
decision?

[interesting question. postponing my answer]

(However, one advantage of using the C-style byte-constants (\xHH)
instead is that it is easy for everyone to see what they are, the
web archive won't replace them with normal spaces, etc.)

Using the Unicode sequence also gives the same advantages, doesn't
it?

Sure. The unicode code points for characters are a slightly more
abstract value for "what they are" than whatever bytes represent them
in some file, it seems to me. Depending on the purpose, this could be
a good difference.

I find it difficult to get the "translation" between the unicode
code values and the hexadecimal/octal representation.

In this case "the hexadecimal/octal representation" is the UTF-8
representation of unicode code points.

My understanding of text handling has a lot of weaknesses and fuzzy
spots that these discussions have made more apparent to me. And the
map from unicode code points to UTF-8 is one of them.

Or at least it was, until yesterday.

I recommend two references I found helpful, though I didn't read
either one especially carefully. I mostly just stared at the templates
(the UTF-8 fighters below), repeatedly asked myself "why did they do
it that way?" and then skimmed to confirm guesses.

First, utf-8(7).

 $ man 7 utf-8

Second, Wikipedia has a not-bad article about UTF-8.

Anyways, here is my ELI5/explainer how to represent a unicode code
point in UTF-8.


Mapping unicode to UTF-8

[If you don't/can't/won't display this with a fixed-width font, like
courier or something, you might prefer to stop reading here. The
references above are pretty good anyways.]

It helps me to notice that the UTF-8 encoding of unicode permits you
to know --whenever you grab some octet of bits-- where that byte
belongs in the following triage based on its Most Significant Bits
(MSBs).

The octet...

 1. represents an entire character on its own (MSB is 0)
 2. is a non-initial byte of a multi-byte character (MSBs are 10)
 3. is the initial byte of an N-byte character with N>1 (MSBs are N 1s trailed by a 0)

(Notice that you could rephrase (1-3) above in terms of the position
of the most significant zero bit.)

So given a unicode code point, do this:

 1. discard its leading 0 bits,
 2. count how many bits you have left (which is how many x you will
    need from the templates below), and
 3. pick your UTF-8 fighter:

1 octet  (7 bits)  0 x x x  x x x x
2 octets (11 bits) 1 1 0 x  x x x x  1 0 x x  x x x x
3 octets (16 bits) 1 1 1 0  x x x x  1 0 x x  x x x x  1 0 x x  x x x x
4 octets (21 bits) 1 1 1 1  0 x x x  1 0 x x  x x x x  1 0 x x  x x x x  1 0 x x  x x x x

Plug the unicode code point's bits into the rightward slots marked
with x in the template, and fill in any remaining more significant
(leftward) bits with zeros.

Synopsis of worked examples below:
[unicode code point]
[unicode representation, stripped of leading zero bits]
[smallest appropriate UTF-8 template]
[UTF-8 representation]

Example 1. Nonbreaking space character

U+00A0

A        0
1 0 1 0  0 0 0 0

1 1 0 x  x x x x  1 0 x x  x x x x

C        2        A        0
1 1 0 0  0 0 1 0  1 0 1 0  0 0 0 0


Example 2. Don't know the name of this one. I tell myself it
represents a velar nasal stop-initial syllable, but I would appreciate
correction.

U+0D19

1 1 0 1  0 0 0 1  1 0 0 1

So, we need to fit the 12 bits above into one of the four UTF-8
templates.

The smallest two can only fit 7 and 11 bits respectively, so we'll
need the next larger template, the three-octet one:

1 1 1 0  x x x x  1 0 x x  x x x x  1 0 x x  x x x x

E        0        B        4        9        9
1 1 1 0  x x x x  1 0 1 1  0 1 0 0  1 0 0 1  1 0 0 1


Example 3. a malayalam virama (sp?)

U+0D4D

1 1 0 1  0 1 0 0  1 1 0 1


1 1 1 0  x x x x  1 0 x x  x x x x  1 0 x x  x x x x

E     	 0     	  B   	   5   	    8 	     D
1 1 1 0  0 0 0 0  1 0 1 1  0 1 0 1  1 0 0 0  1 1 0 1

--
Ce qui est important est rarement urgent
et ce qui est urgent est rarement important
-- Dwight David Eisenhower

Reply to: