[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: why would "tr --complement --squeeze-repeats ..." append the substitution char once more? ...



 "tr --complement --squeeze-repeats ..." makes sure that the replaced
characters only appear once (that it doesn't immediately repeat). Say
you have something like "  " (two spaces) or "?$|" (three characters)
which will be replaced by just an underscore.

In the case of: "ASCII text"
 what should come out of it is: "ASCII_text"
 not: "ASCII_text_"
 no underscore at the end. That is the question I have.

 I use such constructs as: "[A-Za-z0-9.]" to make explicit to myself
and other people what I mean. I work in corpora research dealing with
text based various alphabets not just in ASCII so I avoid any kinds of
linguistic/cultural shortcuts and abbreviations.

 lbrtchx

On 12/11/23, tomas@tuxteam.de <tomas@tuxteam.de> wrote:
> On Mon, Dec 11, 2023 at 08:04:06AM +0000, Albretch Mueller wrote:
>> On 12/11/23, Greg Wooledge <greg@wooledge.org> wrote:
>> > Please tell us ...
>>
>>  OK, here is what I did as a t-table
>
> [...]
>
> Your style is confusing, to say the least. Why not play with minimal
> examples and work your way up from that?
>
>> the two strings are not the same length even though your are just
>> replacing ASCII characters, why did:
>> echo "${ftype}" | tr --complement --squeeze-repeats '[A-Za-z0-9.]' '_'
>> place a character at the end?
>
> Two things stick out:
>
>  1. with --squeeze-repeats you are challenging tr to output less
>    characters than the input has:
>
>    trotzki:~$ echo -n "this is a #   string ###" | tr -cs 'a-z' '_'
>    => this_is_a_string_
>
>    (I allowed myself to simplify things a bit) See? tr is squeezing
>    repeats (repeated matches), the space-plus-three-hashes at the
>    end gets squeezed to just one _, thus changing the length.
>    If your strings contain more than one non-alphanumeric (something
>    I don't feel like even trying a guess at), this is bound to happen.
>    You ordered it.
>
>  2. This is tr, not regexp, so '[A-Za-z0-9.]' isn't doing what you
>    think it does. It will match '[', 'A' to 'Z', 'a' to 'z','.' and
>    ']'. I guess you want to say 'A-Za-z0-9.'
>
>  3. As a convenience, tr has char classes. Perhaps [:alnum:] is for
>    you. No idea whether this is a GNU extension
>
>  4. In case of doubt, read the man page :)
>
> Cheers
> --
> t
>


Reply to: