Re: random lines
On Mon, Jul 02, 2001 at 01:46:29PM -0400, Steven Smolinski wrote:
> On Mon, Jul 02, 2001 at 11:24:40PM +0200, Martin F. Krafft wrote:
> > also sprach Joost Kooij (on Mon, 02 Jul 2001 11:15:38PM +0200):
> > > perl -e 'print map {$_->[1]} sort {$a->[0]<=>$b->[0]} map {[rand,$_]} <>' file
> >
> > the other perl algorithm permitted did a scan through the file,
> > halting on a given line according to a random number. i may well be
> > wrong here, but doesn't that absolutely favor the entries up to in the
> > file since they are considered sooooo many more times?
>
> Actually, if you mean the one I posted from the perlfaq, you have made
> several errors.
[perlfaq5 code:]
rand($.) < 1 && ($line = $_) while <>
> That algorithm did not halt on a random line.
It is also a finest example of "jive programming" in perl slang. :-)
> It reads the whole file,
> one line at a time, using almost no memory. The perl line above reads
> the whole file into memory, which is far less efficient. The perlfaq
> algorithm also has far fewer operations per line, so it will be faster
> on that count, too.
But then they are entirely different operations. One permutates all
lines of the file randomly, the other chooses one line randomly.
However, the fisher_yates_shuffle from perlfaq4 is more efficient at O(n),
mine has an extra factor log(n) due to the sort. But then it has the
cool schwartzian transform and is also a nice example of using perl as
a sort of lisp, but with all the brackets replaced with "tty noise".
> I won't present a proof for the algorithm from the perlfaq, but one is
> available. What it does is make the currently-read line the "randomly
> chosen" line based on a probability that changes as you go through the
> file:
>
> - The first line is guaranteed to be chosen in a single-line file.
> - The second line is 50% likely to be chosen in a two-line file.
> - The third line is 1/3 likely to be chosen in a three-line file.
> ...
> - The eightieth line is 1/80 likely to be chosen in an eighty-line file.
>
> Consider, the three line file case:
>
> Line 1 had a 1/1 chance of a 1/2 chance of a 2/3 chance. # == 1/3
> Line 2 had a 1/2 chance of a 2/3 chance. # == 1/3
> Line 3 had a 1/3 chance. # == 1/3
>
> So no matter when the file ends and the algorithm is complete, the
> currently selected line is random, and was chosen with the proper
> probability given the number of lines in the file.
You almost gave the induction proof already.
Lets rewrite the program a little, into some easily digestible quiche.
while ( $input_line = "the next line from standard input" ) {
if ( "a random number in [0, $current_line_number)" < 1 ) {
$remember_line = $input_line
}
}
statement:
p(n) :: all n lines have an equal chance to be picked by the algorithm.
for n=0: not really interesting. direct proof of p(0) is evident.
induction proof for n>0:
p(1):
there is only one $input_line, and any number in [0, 1) is smaller
than 1, so "the line" is picked.
=> "all 1 lines" have equal chance 1 to be picked.
p(n) => p(n+1) (n>0):
suppose n lines have been read so far. assuming that p(n) is true,
all n lines have equal chance 1/n to have been last picked. If a
new line n+1 is read, then this line is picked if "a random number in
[0, n+1)" is smaller than 1. That is a 1/(n+1) chance. The chance
for other lines to still be the last pick, is:
"old chance" * "chance that line n+1 was not picked"
1/n * (1 - 1/(n+1))
1/n * n/(n+1)
1/(n+1)
=> all n+1 lines have equal chance 1/(n+1) to be picked.
qed.
> That algorithm is one of the purest distillations of beauty I've ever
> seen. :-) Remember, the perlfaq is far more peer-reviewed than almost
> any other source of info.
Agreed.
Cheers,
Joost
Reply to: