[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: distributed batch processing




On 10 May 2005, at 1:05 am, Paul Brossier wrote:

Hi all,

I am looking at ways to distribute batch jobs on various hosts.
Essentially, i have N different command lines, and M different
hosts to run them on:

        foo -i file1.data -p 0.1
        foo -i file2.data -p 0.1
        foo -i file3.data -p 0.1
        ...
        foo -i file1.data -p 0.2
        ...

I had a try with 'queue' [1], but it seems rather obsolete now.
I am now seeking recent alternatives. I went across a few
solutions, such as DQS [2] (non-free, unmaintained), OpenPBS [3]
(non-free), and distribulator [4] (looks interesting).

Now i feel like i have missed something obvious. Is there a tool
out there that i could use as a drop in replacement for queue?

This is not the right forum for this question.

However, I'll answer you anyway, since I know something about this. The two market leaders for this sort of processing are Sun GridEngine (which is free [as in beer, at least]) and Platform LSF, which is proprietary and costs $$$, but is very good at what it does.

Both products can do what you are asking. Personally, I use LSF in my day job on a ~1500 CPU cluster, running a mixture of Red Hat 8.0, Debian sarge (on newer X86 boxes), Tru64 5.1B (on alphas) and SGI ProPack Linux (on our SGI Altixes), but I know SGE could run this as well.

In LSF, you'd submit that set of jobs (let's say your files are named file1.data - file100.data) as something like the following:

bsub -J"set1[1-100]" -o 0.1.output.%I foo -i file\$LSB_JOBINDEX.data -p 0.1 bsub -J"set2[1-100]" -o 0.2.output.%I foo -i file\$LSB_JOBINDEX.data -p 0.2

The standard output and standard error, as well as a job summary (CPU time and memory used, etc) would appear in output files named:

0.1.output.1
0.1.output.2
etc

GridEngine would have its own methods for doing these so called "job arrays".

I looked at GNU queue a long time ago, and it looked (to me) as though its mode of operation was largely based on how LSF works, but when I looked at GNU queue it was pretty fundamentally broken (and it got removed from woody as a result). GridEngine is rather different in its organisation, but a lot of people swear by it.

Tim

--
Dr Tim Cutts
GPG: 1024/D FC81E159 5BA6 8CD4 2C57 9824 6638  C066 16E2 F4F5 FC81 E159



Reply to: