[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: why would "tr --complement --squeeze-repeats ..." append the substitution char once more? ...



On Mon, Dec 11, 2023 at 02:00:49PM +0000, Albretch Mueller wrote:
>  Ach, yes! I forgot echo by default appends a new line character at
> the end of every string it spits out. In order to suppress it you need
> to use the "n" option: "echo -n ..."
> 
> _FL_TYPE="   abc  á é í ó ú ü ñ Á É Í Ó Ú Ü Ñ 123 birdie🐦here ¿ ¡ §
> ASCII  ä ö ü ß Ä Ö Ü Text    "
> echo "// __ \$_FL_TYPE: |${_FL_TYPE}|"
> _FL_TYPE=$(echo "${_FL_TYPE}" | xargs)
> echo "// __ \$_FL_TYPE: |${_FL_TYPE}|"
> _FL_TYPE=$(echo -n "${_FL_TYPE}" |  tr --complement --squeeze-repeats
> '[A-Za-z0-9.]' '_');
> echo "// __ \$_FL_TYPE: |${_FL_TYPE}|"
> 
> // __ $_FL_TYPE: |   abc  á é í ó ú ü ñ Á É Í Ó Ú Ü Ñ 123 birdie🐦here
> ¿ ¡ § ASCII  ä ö ü ß Ä Ö Ü Text    |
> // __ $_FL_TYPE: |abc á é í ó ú ü ñ Á É Í Ó Ú Ü Ñ 123 birdie🐦here ¿ ¡
> § ASCII ä ö ü ß Ä Ö Ü Text|
> // __ $_FL_TYPE: |abc_123_birdie_here_ASCII_Text|

OK.  Tomas's analysis was better than mine in this case.  Looks like CR
was not the issue this time around.  I do have some comments, though.

1) Many implementations of echo will interpret parts of their argument(s),
   in addition to processing options like -n.  If you want to print a
   variable's contents to standard output without *any* interpretation,
   use printf.

    printf %s "$myvar"
    printf '%s\n' "$myvar"

2) As tomas already told you, the square brackets in

    tr -c -s '[A-Za-z0-9.]' _

   are literal.  You're using a command which will keep left and right
   square brackets in the input, *not* replacing them with underscores.
   This may not be what you want.

3) In locales other than C or POSIX, ranges like A-Z are *not* necessarily
   synonyms for [:upper:].  As I've already mentioned, GNU tr is known to
   contain bugs, so you're getting lucky here.  The bugs in GNU tr happen
   to work the way you're expecting, so that A-Z is treated like [:upper:]
   when it should not be.  If at some point in the future GNU tr is fixed
   to conform to POSIX, your script may break.

   The correct tr command you should be using if you want to retain
   accented letters (as defined in your locale) is:

    tr -c -s '[:alnum:].' _

   If you want to discard accented letters, then either of these is OK:

    LC_COLLATE=C tr -c -s '[:alnum:].' _
    LC_COLLATE=C tr -c -s 'A-Za-z0-9.' _

4) The xargs command, which you used above, uses quotation mark characters
   as well as whitespace to define input words.  Your example worked only
   because your input does not contain any single or double quotes.

Here's a demonstration of A-Z not equating to [:upper:] using GNU sed,
which is behaving correctly:

    unicorn:~$ x='   abc  á é í ó ú ü ñ Á É Í Ó Ú Ü Ñ 123 birdie🐦here ¿ '
    unicorn:~$ printf '%s\n' "$x" | sed 's/[A-Z]//g'
       abc  á é í ó ú ü ñ        123 birdie🐦here ¿ 
    unicorn:~$ printf '%s\n' "$x" | LC_COLLATE=C sed 's/[A-Z]//g'
       abc  á é í ó ú ü ñ Á É Í Ó Ú Ü Ñ 123 birdie🐦here ¿ 

The meaning of [A-Z] in the sed command depends on the locale.  In my
locale, which is en_US.utf8, characters like Á are part of the A-Z range.
In the C locale, they aren't, as seen in the last command above.

The use of [A-Z] in regular expressions and globs is a *very* heavily
debated topic, and I'm only scratching the surface here.  Honestly, you
really should avoid using it.  It's just too unpredictable.

Here's an example of xargs failing when its input contains a quote:

    unicorn:~$ echo 'foo "bar' | xargs
    xargs: unmatched double quote; by default quotes are special to xargs unless you use the -0 option
    foo

You can't use xargs to normalize whitespace safely.  In fact, the proper
way to normalize whitespace is...

    unicorn:~$ printf 'foo "bar \t\t \t  baz  \n' | tr -s ' \t' ' '
    foo "bar baz 

Thus, we come full circle.


Reply to: