[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: random lines



On Mon, Jul 02, 2001 at 11:24:40PM +0200, Martin F. Krafft wrote:
> also sprach Joost Kooij (on Mon, 02 Jul 2001 11:15:38PM +0200):
> > perl -e 'print map {$_->[1]} sort {$a->[0]<=>$b->[0]} map {[rand,$_]} <>' file
> 
> that's perfect!
> 
> the other perl algorithm permitted did a scan through the file,
> halting on a given line according to a random number. i may well be
> wrong here, but doesn't that absolutely favor the entries up to in the
> file since they are considered sooooo many more times?

Actually, if you mean the one I posted from the perlfaq, you have made
several errors.

That algorithm did not halt on a random line.  It reads the whole file,
one line at a time, using almost no memory.  The perl line above reads
the whole file into memory, which is far less efficient.  The perlfaq
algorithm also has far fewer operations per line, so it will be faster
on that count, too.

I won't present a proof for the algorithm from the perlfaq, but one is
available.  What it does is make the currently-read line the "randomly
chosen" line based on a probability that changes as you go through the
file:

- The first line is guaranteed to be chosen in a single-line file.
- The second line is 50% likely to be chosen in a two-line file.
- The third line is 1/3 likely to be chosen in a three-line file.
  ...
- The eightieth line is 1/80 likely to be chosen in an eighty-line file.

Consider, the three line file case:

Line 1 had a 1/1 chance of a 1/2 chance of a 2/3 chance.  # == 1/3
Line 2 had a 1/2 chance of a 2/3 chance.                  # == 1/3
Line 3 had a 1/3 chance.                                  # == 1/3

So no matter when the file ends and the algorithm is complete, the
currently selected line is random, and was chosen with the proper
probability given the number of lines in the file.

That algorithm is one of the purest distillations of beauty I've ever
seen.  :-)  Remember, the perlfaq is far more peer-reviewed than almost
any other source of info.

Steve
--
Steven Smolinski => http://arbiter.ca/



Reply to: