[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: random lines



On Mon, Jul 02, 2001 at 01:46:29PM -0400, Steven Smolinski wrote:
> On Mon, Jul 02, 2001 at 11:24:40PM +0200, Martin F. Krafft wrote:
> > also sprach Joost Kooij (on Mon, 02 Jul 2001 11:15:38PM +0200):
> > > perl -e 'print map {$_->[1]} sort {$a->[0]<=>$b->[0]} map {[rand,$_]} <>' file
> > 
> > the other perl algorithm permitted did a scan through the file,
> > halting on a given line according to a random number. i may well be
> > wrong here, but doesn't that absolutely favor the entries up to in the
> > file since they are considered sooooo many more times?
> 
> Actually, if you mean the one I posted from the perlfaq, you have made
> several errors.
[perlfaq5 code:]

  rand($.) < 1 && ($line = $_) while <>

> That algorithm did not halt on a random line.

It is also a finest example of "jive programming" in perl slang.  :-)

>                                                It reads the whole file,
> one line at a time, using almost no memory.  The perl line above reads
> the whole file into memory, which is far less efficient.  The perlfaq
> algorithm also has far fewer operations per line, so it will be faster
> on that count, too.

But then they are entirely different operations.  One permutates all
lines of the file randomly, the other chooses one line randomly.

However, the fisher_yates_shuffle from perlfaq4 is more efficient at O(n),
mine has an extra factor log(n) due to the sort.  But then it has the
cool schwartzian transform and is also a nice example of using perl as
a sort of lisp, but with all the brackets replaced with "tty noise".

> I won't present a proof for the algorithm from the perlfaq, but one is
> available.  What it does is make the currently-read line the "randomly
> chosen" line based on a probability that changes as you go through the
> file:
> 
> - The first line is guaranteed to be chosen in a single-line file.
> - The second line is 50% likely to be chosen in a two-line file.
> - The third line is 1/3 likely to be chosen in a three-line file.
>   ...
> - The eightieth line is 1/80 likely to be chosen in an eighty-line file.
> 
> Consider, the three line file case:
> 
> Line 1 had a 1/1 chance of a 1/2 chance of a 2/3 chance.  # == 1/3
> Line 2 had a 1/2 chance of a 2/3 chance.                  # == 1/3
> Line 3 had a 1/3 chance.                                  # == 1/3
> 
> So no matter when the file ends and the algorithm is complete, the
> currently selected line is random, and was chosen with the proper
> probability given the number of lines in the file.

You almost gave the induction proof already.

Lets rewrite the program a little, into some easily digestible quiche.

 while ( $input_line = "the next line from standard input" ) {
   if ( "a random number in [0, $current_line_number)" < 1 ) {
     $remember_line = $input_line
   }
 }

statement:
 p(n) :: all n lines have an equal chance to be picked by the algorithm.

for n=0: not really interesting.  direct proof of p(0) is evident.
 
induction proof for n>0:
 p(1):
   there is only one $input_line, and any number in [0, 1) is smaller
   than 1, so "the line" is picked.
   => "all 1 lines" have equal chance 1 to be picked.

 p(n) => p(n+1)  (n>0):
   suppose n lines have been read so far.  assuming that p(n) is true,
   all n lines have equal chance 1/n to have been last picked.  If a
   new line n+1 is read, then this line is picked if "a random number in
   [0, n+1)" is smaller than 1.  That is a 1/(n+1) chance.  The chance
   for other lines to still be the last pick, is:
     "old chance" * "chance that line n+1 was not picked" 
         1/n      *     (1 - 1/(n+1))
         1/n      *        n/(n+1) 
	       1/(n+1)
   => all n+1 lines have equal chance 1/(n+1) to be picked.
 qed.

> That algorithm is one of the purest distillations of beauty I've ever
> seen.  :-)  Remember, the perlfaq is far more peer-reviewed than almost
> any other source of info.

Agreed.

Cheers,


Joost



Reply to: