[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: sort (-g) [offtopic]



On Sun 18 Feb 2018 at 16:55:28 (+0100), Ionel Mugurel Ciobîcă wrote:
> 
> Anyone care to explain what exactly means the -g option of sort? The
> fine manual only says "general numerical", but I doubt that is true,
> because -g (and all other options I have tried, -n, -M, -h, -V) will
> all put Roman numeral 9 in between 4 and 5. See here:
> 
> # echo "III\nII\nI\nV\nIV\nVII\nVI\nVIII\nX\nIX" | sort -g | nl
> 
> What I expect is to put 9 in between 8 and 10.
> 
> As I wrote above, I have tried -n as well. I tried -M because in
> Romanian often the months are written with Roman numerals (I to XII),
> but that also failed. -h and -V were not useful here either.
> 
> How do I sort in a pipe those roman numerals? I have written two bash
> scripts roman_to_arab.sh and arab_to_roman.sh, but I do not know how
> to adapt it to use it in pipes. Also, it may be too cumbersome to make
> the conversion to arab digits, sort with -n and then convert back into
> roman numerals...

Any script that reads stdin and writes stdout can be used in a pipe.
That's one of the guiding principles of unix.

Many commands take input from stdin, either be specifying no input
file or by using - as the filename. Same thing for output. Some use
a mixture, eg diff:
cat file1 file2 | diff - file3 | less
compares file1+file2 with file3 and pipes to less.

> Anyone has encounter this issue? Any ideas how to sort out this sort
> issue? Of course, the easier will be if, indeed, the sort -g would
> work as expected, e.g. if "_general_ numeric" will not be particular
> to exclude Roman numerals...

After they've done Roman numerals, they can settle down and do
yan tan tethera in all dialects.
https://en.wikipedia.org/wiki/Yan_Tan_Tethera

> At the moment I have to run this sort three times. First time to limit
> it before IX (with grep -v -e IX -e '^X'), second time just grep "IX",
> and third time to exclude all that starts with I and V: grep -v -e
> "^I" -e "^V", and then put all together, like this:
> 
> ( echo "III\nII\nI\nV\nIV\nVII\nVI\nVIII\nXI\nIX\nXII\nX" | sort -g | grep -v -e "IX" -e '^X' ; echo "III\nII\nI\nV\nIV\nVII\nVI\nVIII\nXI\nIX\nXII\nX" | grep -e "IX" ; echo "III\nII\nI\nV\nIV\nVII\nVI\nVIII\nXI\nIX\nXII\nX" | sort -g | grep -v -e "^I" -e "^V") | nl

You shouldn't sort like that. If you've got records to sort which have
an unsortable field like Roman months, then write some thing in sed,
say, that can do the conversion. Now read your records, say:
field1 field2 XII field3 field4
field1 field2 IV field3 field4
and prefix each record with the numeric representation;
12 field1 field2 XII field3 field4
04 field1 field2 IV field3 field4
Now sort that, then throw away the first field with cut. You should
never have to worry about converting things back!

Basically, that throwaway prefix (it could itself be several fields)
could be a function of any complexity: the order of seats in a
theatre, the value of chess pieces, a lookup table of the order of
precedence of church clergy, whatever turns unsortables into
sortables.

> I exclude here larger numerals, because at the moment I do not need
> anything in that range...

No—handling Romanian month names and abbreviations might be more
useful. I once wrote an arabic→roman converter but that was just as
an exercise in returning variable length strings from OS/360 assembler
to Fortran IV.

> Using the unicode gliphs also doesn't work:
> 
> # echo "Ⅲ\nⅡ\nⅠ\nⅣ\nⅤ\nⅨ\nⅥ\nⅦ\nⅧ\nⅫ\nⅪ\nⅩ" | sort -g | nl

Again, simpler with sed. And don't forget the lower case set just
along the way.

Cheers,
David.


Reply to: