Re: Possible bug in 'sort -m'

To: debian-user@lists.debian.org
Subject: Re: Possible bug in 'sort -m'
From: Bob McGowan <bob_mcgowan@symantec.com>
Date: Wed, 14 Mar 2007 11:37:11 -0700
Message-id: <[🔎] 45F840D7.6020103@symantec.com>
In-reply-to: <[🔎] 20070314105737.GG18027@fantomas.sk>
References: <[🔎] 45F726F1.9020503@symantec.com> <[🔎] 20070314105737.GG18027@fantomas.sk>

Matus UHLAR - fantomas wrote:

On 13.03.07 15:34, Bob McGowan wrote:
  sort -n -o from_number from_number

  sort -n -o to_number to_number
[deleted]
  sort -m from_number to_number | uniq | wc -l
  122010
This is still almost 12000 too big (only 17 less than the 'uniq' on theseparate files). So, I run this:
  sort -u from_number to_number | wc -l

And I get 110256, the same number as the SQL UNION gave me.
So, if both files are sorted and I then use 'sort -m' followed by 'uniq'and count the results, shouldn't I get the same thing as resorting thetwo (already sorted) files with sort's '-u' option and counting that output?
Either do not sort those files numerically, or _always_ use '-n' with sort.
The latter may work, the first should work.
I did wonder if I needed to use '-n' with the '-m', but that didn't fixanything, in fact, I got a different count: 121995.
Am I missing something obvious, having to do with numbers and merging?Or is this a bug in sort?

Well, I thought I'd taken care of the "'-n' in all cases" question.Since that doesn't appear to be the case, let me repeat the sorting,being sure to use '-n' in all cases.

Here, I numerically sort and merge the two source files into adestination and count lines in all three. Then I get unique lines fromthe merged sort and count the result.

$ sort -n -m from_number to_number > xxx
$ wc -l from_number to_number xxx
  84919 from_number
  84919 to_number
 169838 xxx
 339676 total
$ uniq xxx|wc -l
121995

The merged file has the expected double number of lines, compared to thetwo source files. All duplicated lines should be together, whether fromthe original individual files, or due to the merge. A 'uniq' shouldthen report only the lines that are globally unique. It reports thereare 121995 lines.


This time, I apply the unique operation to the sort itself.
$ sort -n -m -u from_number to_number > yyy
$ wc -l yyy
121995 yyy
And the result matches the result above, 121995.

This time, I apply the unique to the sorting on the two individualfiles, then use sort with the numeric option to merge them.

$ sort -n -u from_number > aa
$ sort -n -u to_number > bb
$ sort -n -m aa bb > cc
$ wc -l cc
122027 cc

The number here is larger than the above number, because there areduplicated values between files. That is, each file is totally uniquebut the combination is not.


So run 'uniq' on this result, you get:
$ uniq cc | wc -l
110256

And, if you combine the merge, unique, numeric sort in one fell swoop,you get:

$ sort -n -m -u aa bb | wc -l
110256

These last two case match the result from my SQL, but not the resultfrom the two initial examples. But it seems to me that they should.After all, the sorts should be putting identical values together, theunique should remove all but one line (whether done with the sort orafter it), and the results should be the same.


But, they aren't.

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply to:

Follow-Ups:
- Re: Possible bug in 'sort -m'
  - From: Cameron Hutchison <camh@xdna.net>

References:
- Possible bug in 'sort -m'
  - From: Bob McGowan <bob_mcgowan@symantec.com>
- Re: Possible bug in 'sort -m'
  - From: Matus UHLAR - fantomas <uhlar@fantomas.sk>

Prev by Date: Re: Bug in acroread?
Next by Date: adept error
Previous by thread: Re: Possible bug in 'sort -m'
Next by thread: Re: Possible bug in 'sort -m'
Index(es):
- Date
- Thread