[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Using .XCompose



On Mon, 13 Jul 2020 Greg Wooledge wrote:
On Sat, Jul 11, 2020 at 08:32:34PM +0000, davidson wrote:
'!' marks the spot of nonbreaking spaces that made it into OP's first
report of odd behavior, upon testing the white scissors XCompose rule:

  $ grep "WHITE SCISSORS"  d-u_xcompose_2020-07-08.nbsp | tr $'\xc2\xa0' \!
  <Multi_key> <s> <x>!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! : "✄"!!!! U2704 # WHITE SCISSORS

Note that tr does not handle multi-character sequences.  If you pass
something like

tr abc xyz

It does *not* look for "abc" sequences and convert only those sequences.
Rather, it looks at single characters.  It converts 'a' to 'x', and 'b'
to 'y', and 'c' to 'z'.

The number of characters in the first pattern is supposed to match the
number of characters in the second pattern, so that there is a 1:1
mapping.

This much, as it happens, I knew.

My intent was to make plain the nonbreaking spaces for the list, and
in particular for the list archives. (Because it appears that the
versions of messages posted in the web-archives do not preserve such
characters.)

GNU tr also does not handle multi-byte *characters* correctly (which
violates POSIX -- it's a known bug).

And *this* I did *not* know. It was my incorrect belief that the two
bytes of tr's first argument would be treated by tr as a single
character.

It wasn't a *firm* belief. I just did not think it through
carefully. Had I wanted it's actual behavior instead, I may have even
expected it. Stopped clock is correct once in a while!

I see now that "info tr" mentions this behavior up-front. (And
indicates that in future tr will support multi-byte characters.)

In the man page tr(1) I see nothing about all this. It simply talks
about "characters", and assumes the reader is some kind of K&R
obsessed mind-reader who knows "Well, of course a 'character' here
just denotes an octet of bits."

So, your tr command actually converts all c2 bytes into ! and all a0
bytes into ! as well.  Not *just* c2a0 pairs.

Thank you for catching all this, and for concise and comprehensive
explanations.

tr's output line was significantly longer than the line it received as
input. I ought to have noticed.

Nevertheless, this is useful as a first pass approximation to say that,
hey, there *might* be a bunch of NBSPs here, and you should take a
closer look.

You are being charitable.

Since it does not (currently) know what characters are when they
aren't composed of single octets, it was the wrong tool for the job.

  sed 'y/\xc2\xa0/!/'

seems to do the right thing.

NBSPs most often result when someone gets lazy and pastes a line
from a web page or from a Microsoft Word/Excel document into a Unix
terminal or X11 application, instead of pasting just the characters
they actually want.  Web pages, especially *older* web pages, often
use NBSPs for primitive formatting.

Noted.

--
Ce qui est important est rarement urgent
et ce qui est urgent est rarement important
-- Dwight David Eisenhower

Reply to: