Xiaoming Hu <xhu@ncsu.edu> writes:
I guess I need to do a research on how to submit the job through mpirun
Initially I thought mpirun will know the machines in the cluster since
I listed them in /etc/hosts.
No, mpirun does not guess the machines that should be involved in your
paralell application. You can configure the default list of machines
used by mpirun in the file /etc/mpich/machines.LINUX (this only has to
be done
on the head node, if you do not need to call mpirun on any other node
in your
cluster).
However, in a typical cluster environment, a paralell job does not
always span across the whole cluster, and one would probably like to be
able to run several paralell jobs at once using the available resources.
That is why typically, one uses some kind of job queueing system (like
torque).
In such a system, you submit a job with certain criteria (attributes)
like the number of nodes (and CPUs per node) you would like to use for
your
job.
When the job is executed (typically a shell script)
the job queueing system tells the script somehow which hosts are allowed
to be used in this job (based on the attributes given to the job
initially).
And here is where the -machinefile argument is typically used.
You generate a temporary machines file for your job in the job script,
and run mpirun with the -machinefile argument to tell it an explicit
list of hosts.
Another question: do I need a copy of hosts on each of the machines in
the cluster(basically 2 laptops in my case)?
You need a properly configured /etc/hosts on all of your cluster nodes
in order to have rsh (ssh) passwordless logins work properly. However,
you only need your generated machine file (or your default
/etc/mpich/machines.LINUX) on the node you run mpirun at.
Thanks very much
Xiaoming
Mario Lang wrote:
Xiaoming Hu <xhu@ncsu.edu> writes:
I have two laptops with ubuntu system. I got NFS working also ssh
without password prompt between the two laptops. I also installed
MPICH.
And my simple parallel code is compiled successfully. But after I use
mpirun -np n xxx to submit my job. it doesn't work.
What error message do you get? Did you configure your default MPICH
hosts
file, or create one for your job?
mpirun -machinefile some-file-name -np xxx ...