[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: searching files for patterns



zdrysdal@diagnostic.co.nz wrote:
>is this the best/fastest way to search through 800,000 hl7 files?

Nope :)

>for each file i am grepping for 6 names... thus each file is
>scanned/grepped 6 times over.  Basically i am searching for 1 name in 4 1/2
>million files.  Even though the server is fast, it is still processing on
>average 2 files per second.
>
>here is my script...  any thoughts would be appreciated as we have a tight
>schedule.
>
>
>cd /backup/Loaders/Ld21/HL7FILES
>#for file in `find * -print`
>for file in `ls`
>do
>  while read name
>  do
>    search=`cut -d "|" -f 20 < $file | grep $name`	{extracts name from
>field 20 of hl7 file}
>    if [ "$search" > /dev/null ]
>    then
>      dir=`pwd`
>      echo "$name -> $dir/$file" >> /home/zane/found.list	{adds found names
> and relevant filenames}
>    fi
>  done < /home/zane/scripts/filelist	{file containing the six names to
>search for}
>done

This is certainly going to be slow. I'll offer you two solutions, one in
shell script, one in Perl.

Shell script (requires bash-2.00 or above due to using $() instead of
backquotes; replace $(...$(...)...) by `...\`...\`...` if you have an
older version):

cd /backup/Loaders/Ld21/HL7FILES
pattern=$(cat /home/zane/scripts/filelist)
for file in `find . -type f -print`; do
  cut -d'|' -f20 $file | egrep $pattern | sort | uniq \
      | sed "s/$/ -> $file/g"
done > /home/zane/found.list

What I do here is build up a single egrep pattern beforehand which
matches any one of the six names. You then only need to run egrep over
each file once, which will be faster. You'll need to make filelist read
something like:

John Smith|Joe Brown|Colin Watson|...

Perl:

===== cut here =====
#! /usr/bin/perl -w
use diagnostics;
use strict;

my $names = shift;
open NAMES, $names or die "Couldn't open name list: $!";
my @names = map { chomp; $_ } <NAMES>;
close NAMES;

chdir '/backup/Loaders/Ld21/HL7FILES' or die "Couldn't chdir: $!";
opendir HL7FILES, '.' or die "Couldn't open directory: $!";
while (defined(my $file = readdir HL7FILES))
{
    next unless -f $file;
    open HL7FILE, $file or die "Couldn't open data file: $!";
    my @hl7 = map { chomp; (split /\|/, $_, 20)[19] } <HL7FILE>;
    foreach my $name (@names)
    {
        print "$name -> $file\n" if scalar grep m/\Q$name/, @hl7;
    }
}
===== cut here =====

Call this with ./search.pl namelist, or similar. The list of names
should be one per line this time.

Caveat: if this is mission-critical, *test these first*. I've only done
fairly minimal testing, and these scripts may very well still contain
bugs.

-- 
Colin Watson                                           [cjw44@cam.ac.uk]


Reply to: