[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: delimiters with more than one character? ...



On 2020-08-05 22:06, Greg Wooledge wrote:
On Wed, Aug 05, 2020 at 03:04:32PM +0900, John Crawley wrote:
This method cuts off the first part of the string, up to the delimiter, and
adds it to the array, then continues with what's left of the string until
there's none left:

Yes, that's a valid approach, although not one that most bash scripters
would choose, because it's hideously slow in bash, and doesn't "feel"
like it's in the "spirit" of shell scripting.

I likely don't have a sufficient grasp of that "spirit" to comment there :)
but as for "hideously slow", remain to be convinced. Bash string modifications seem to be pretty fast. Would very long strings requiring multiple runs of the loop slow it down excessively? (See below for stringx1024) With the OP's provided string, on my system:

time bash ./multi-delim.sh
declare -a arr=([0]=" 34 + 45 " [1]=" abc " [2]=" 1 2 3 " [3]=" c" [4]="123abc ")

real		0m0.004s
user	0m0.000s
sys		0m0.000s

Exact time varies from 1ms to 5ms, but it seems faster than python, perl or a bash script calling sed or awk would be.

Your output doesn't match your input.  You've got an extra backslash
character in the [1] element.  Perhaps you tested with several different
inputs, and accidentally pasted the wrong output.

Thanks - that was indeed the case. Sorry.
( Tested with some other characters like * # ! Everything seems to be treated as a plain string, needing no escaping. )

I've used this myself, so am eager to hear of any hidden snags. :)

(One already: if the delimiter is a repeated character which might also be
the last in the last string fragment, then the loop never closes. Fairly
rare?)

OK... yeah, that would be a show-stopper, all right.  I was able
to reproduce the infinite loop...
(snip)
A different approach is needed.  The one that immediately springs to
mind for me is:

s='a || b || c ||| d |'
del='||'
arr=()
while [[ $s = *"$del"* ]]; do arr+=( "${s%%"$del"*}" ); s=${s#*"$del"}; done
arr+=( "$s" )

This is very similar to the flawed approach, but instead of appending
an extra delimiter to the input and looping until the input is empty,
we *check* for the existence of a delimiter in the input, and loop until
none is found.  Whatever is left over becomes the final array element
This will be a little bit slower than the flawed code (probably), but I
believe it is free from infinite loops.

Nice! The run time is very similar.
Adding this:

for i in {1..10}
do
    _S+=$del$_S
done

to multipy the string by 1024 gave times of ~3.9s and ~4.0s.

I can't see any way the loop would fail to end. This will be my boilerplate if I ever need to do such a thing again...

Since we're not modifying the input by appending an extra delimiter,
any ambiguity in the input string was put there by the original input
source, not created by our code.  In my sample input, the ||| substring
between c and d can be parsed as either "delimiter followed by pipe",
or "pipe followed by delimiter".  It is impossible to tell which one
is correct, because we have no external knowledge of the input string.
Therefore, either result must be acceptable.  If the humans who
provided this input don't care for the results they get, well, it's
their problem and not ours.

I guess there's little need to try to cope with delimiters that would be ambiguous in practice...

--
John


Reply to: