Re: select N random lines in a file
On Sun, Aug 22, 2004 at 02:35:14PM -0500, Lance Hoffmeyer wrote:
> I would like to write a script that will select N number of
> random lines in a file. Any suggestions on how to do this?
This program has the advantage that it doesn't read the whole file into
memory, which is important if the file is large. Save it as an executable
file, randomlines, and then type "randomlines N file" to get N random
lines from file (without repetition). The lines will be in the same
order they were in the file. Type "randomlines -r N file" to get N random
lines in random order.
#! /usr/bin/perl -s
$N=shift; #first arg is N
srand;
while(<>){
if(rand($.) < $N){
if(@lines == $N){
# drop one random element
splice @lines,int rand $N,1;
}
if($r){
splice @lines, int rand @lines+1, 0, $_;
}
else{
push @lines, $_;
}
}
}
print $_ for @lines;
__END__
The proof that the algorithm is correct is by induction on the number of lines
in the file (also, see Knuth reference below).
It is based on a program in the perl documentation that returns 1 random
line from a file, which I found by typing "perldoc -q 'random line'":
How do I select a random line from a file?
Here's an algorithm from the Camel Book:
srand;
rand($.) < 1 && ($line = $_) while <>;
This has a significant advantage in space over reading the whole file
in. You can find a proof of this method in The Art of Computer Pro-
gramming, Volume 2, Section 3.4.2, by Donald E. Knuth.
You can use the File::Random module which provides a function for that
algorithm:
use File::Random qw/random_line/;
my $line = random_line($filename);
Another way is to use the Tie::File module, which treats the entire
file as an array. Simply access a random array element.
(END)
Winston Smith, x@y where x=winstonsmith, y=ispwest.com
Reply to: