[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Possible bug in 'sort -m'



Matus UHLAR - fantomas wrote:
On 13.03.07 15:34, Bob McGowan wrote:
  sort -n -o from_number from_number

  sort -n -o to_number to_number
[deleted]
  sort -m from_number to_number | uniq | wc -l
  122010

This is still almost 12000 too big (only 17 less than the 'uniq' on the separate files). So, I run this:

  sort -u from_number to_number | wc -l

And I get 110256, the same number as the SQL UNION gave me.

So, if both files are sorted and I then use 'sort -m' followed by 'uniq' and count the results, shouldn't I get the same thing as resorting the two (already sorted) files with sort's '-u' option and counting that output?

Either do not sort those files numerically, or _always_ use '-n' with sort.
The latter may work, the first should work.

I did wonder if I needed to use '-n' with the '-m', but that didn't fix anything, in fact, I got a different count: 121995.

Am I missing something obvious, having to do with numbers and merging? Or is this a bug in sort?


Well, I thought I'd taken care of the "'-n' in all cases" question. Since that doesn't appear to be the case, let me repeat the sorting, being sure to use '-n' in all cases.

Here, I numerically sort and merge the two source files into a destination and count lines in all three. Then I get unique lines from the merged sort and count the result.
$ sort -n -m from_number to_number > xxx
$ wc -l from_number to_number xxx
  84919 from_number
  84919 to_number
 169838 xxx
 339676 total
$ uniq xxx|wc -l
121995
The merged file has the expected double number of lines, compared to the two source files. All duplicated lines should be together, whether from the original individual files, or due to the merge. A 'uniq' should then report only the lines that are globally unique. It reports there are 121995 lines.

This time, I apply the unique operation to the sort itself.
$ sort -n -m -u from_number to_number > yyy
$ wc -l yyy
121995 yyy
And the result matches the result above, 121995.

This time, I apply the unique to the sorting on the two individual files, then use sort with the numeric option to merge them.
$ sort -n -u from_number > aa
$ sort -n -u to_number > bb
$ sort -n -m aa bb > cc
$ wc -l cc
122027 cc
The number here is larger than the above number, because there are duplicated values between files. That is, each file is totally unique but the combination is not.

So run 'uniq' on this result, you get:
$ uniq cc | wc -l
110256

And, if you combine the merge, unique, numeric sort in one fell swoop, you get:
$ sort -n -m -u aa bb | wc -l
110256

These last two case match the result from my SQL, but not the result from the two initial examples. But it seems to me that they should. After all, the sorts should be putting identical values together, the unique should remove all but one line (whether done with the sort or after it), and the results should be the same.

But, they aren't.

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature


Reply to: