[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Slow Script



On Tue, Feb 03, 2009 at 12:14:48PM EST, Gorka wrote:
> Hi! I've got a perl script with this for:
> 
>   for (my $j=0;$j<=$#fichero1;$j++)
>   {
>     if (@fichero1[$j] eq $valor1)
>     {
>       $token = 1;
>     }
>   }

> The problem is that fichero1 has 32 millions of records and moreover
> I've got to repeat this for several millions times, so this way it
> would take years to finish.  Does anybody know a way to optimize this
> script? Is there any other linux programing language I could make this
> more quickly whith?

Since I can't imagine you need this on your home machine, I would talk
to my boss ... recommend an IBM mainframe running z/OS and a consultant
who will charge you $5000.00 to write three lines of JCL and optionally
ten lines of assembler that will emulate the above logic. 

Contact me off-list if interested.

More seriouly, when you are dealing with 32 million records, one major
venue for optimization is to keep disk access to a minimum. Disk access
IIRC is measured in milliseconds, RAM access in nanoseconds and above..

Do the math.. 

The way to look at it is to make sure any logical record is transferred
from disk to RAM _once only_ (rather than a million times) and that each
disk access transfers as many records to central memory as the
filesystem (or rather the "access method" in mainframe parlance) and
hardware architecture allow - e.g. if for instance you set yourself up
so that your file's physical layout is such that each block contains
5,000 records and your access method (driver?) allows you to request 256
blocks to be transferred with one disk access, the same program will run
orders of magnitude faster than if each block contained one record and
you were reading one block at time because even though data transfer
times would be congruent, you would be skipping all the individual wait
times (head positioning and waiting while some unrelated process is
keeping the controller busy).

If you have to stick with an "Intel Inside" machine running linux, even
though neither the machine nor the OS were designed with this type of
work in mind, there are probably ways to keep disk access to a healthy
minimum but that's something I can't help you with.

Obviously, as others have suggested, this doesn't mean that you should
not _first_ look into why your logic dictates that you need to process
32 million records "several million times".

HTH

CJ


Reply to: