Re: select N random lines in a file

To: debian-user@lists.debian.org
Cc: Lance Hoffmeyer <lance@augustmail.com>
Subject: Re: select N random lines in a file
From: Winston Smith <ga41h@yahoo.com>
Date: Mon, 23 Aug 2004 11:20:09 -0400
Message-id: <[🔎] 20040823152009.GA18475@localhost.localdomain>
In-reply-to: <[🔎] 4128F572.4040201@augustmail.com>
References: <[🔎] 4128F572.4040201@augustmail.com>

On Sun, Aug 22, 2004 at 02:35:14PM -0500, Lance Hoffmeyer wrote:
> I would like to write a script that will select N number of
> random lines in a file.  Any suggestions on how to do this?

This program has the advantage that it doesn't read the whole file into
memory, which is important if the file is large. Save it as an executable
file, randomlines, and then type "randomlines N file" to get N random
lines from file (without repetition). The lines will be in the same
order they were in the file. Type "randomlines -r N file" to get N random
lines in random order.

#! /usr/bin/perl -s

	$N=shift; #first arg is N
	srand;
	while(<>){
    	if(rand($.) < $N){
			if(@lines == $N){
				# drop one random element
				splice @lines,int rand $N,1;
			}
			if($r){
				splice @lines, int rand @lines+1, 0, $_;
			}
			else{
	    		push @lines, $_;
			}
    	}
	}

	print $_ for @lines;

__END__

The proof that the algorithm is correct is by induction on the number of lines
in the file (also, see Knuth reference below). 

It is based on a program  in the perl documentation that returns 1 random
line from a file, which I found by typing "perldoc -q 'random line'":

  How do I select a random line from a file?

        Here's an algorithm from the Camel Book:

            srand;
            rand($.) < 1 && ($line = $_) while <>;

        This has a significant advantage in space over reading the whole file
        in.  You can find a proof of this method in The Art of Computer Pro-
        gramming, Volume 2, Section 3.4.2, by Donald E. Knuth.

        You can use the File::Random module which provides a function for that
        algorithm:

                use File::Random qw/random_line/;
                my $line = random_line($filename);

        Another way is to use the Tie::File module, which treats the entire
        file as an array.  Simply access a random array element.
 (END)

Winston Smith, x@y where x=winstonsmith, y=ispwest.com

Reply to:

References:
- select N random lines in a file
  - From: Lance Hoffmeyer <lance@augustmail.com>

Prev by Date: Re: attempt to access beyond end of device
Next by Date: Best iso image for unstable
Previous by thread: Re: select N random lines in a file
Next by thread: [OT] More than one "helper" application in mozilla
Index(es):
- Date
- Thread