[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

I got the parallel code running on my two-computer-cluster



Hey,

Sorry, I figured out what was the problem. It was just because the bugs in my code.

Thanks for the help!

But I have anther problem. Since the two computers in my cluster are behind my router, so the ip address is dynamic. So every time I need to change the place where it is ip address dependent.

How to change the ip address to static?

Thanks again


Xiaoming

Xiaoming Hu wrote:
Hey

Thanks very much
After modify the machines.LINUX, my parallel job runs fine with mpirun submission. But there is problem with output from the client node. I have make sure the "root" on client node can write on the shared nfs directory. But it seems the client node didn't generate the file I asked for in my code(refer to the code below, only debug_server.txt is generate after the execution of my job through mpirun -np 2 xx).

        if( my_rank .eq. 0) then
        open(888,file='debug_server.txt',
     $                IOSTAT=ierr)
        else if (  my_rank .eq. 1) then
        open(888,file='debug_client.txt',
     $                IOSTAT=ierr)
        endif

Should I keep working on nfs or something else?

Thanks very much

Xiaoming

Mario Lang wrote:
Xiaoming Hu <xhu@ncsu.edu> writes:

I guess I need to do a research on how to submit the job through mpirun

Initially I thought mpirun will know the machines in the cluster since
I listed them in /etc/hosts.

No, mpirun does not guess the machines that should be involved in your
paralell application.  You can configure the default list of machines
used by mpirun in the file /etc/mpich/machines.LINUX (this only has to be done on the head node, if you do not need to call mpirun on any other node in your
cluster).

However, in a typical cluster environment, a paralell job does not
always span across the whole cluster, and one would probably like to be
able to run several paralell jobs at once using the available resources.
That is why typically, one uses some kind of job queueing system (like torque).
In such a system, you submit a job with certain criteria (attributes)
like the number of nodes (and CPUs per node) you would like to use for your
job.
When the job is executed (typically a shell script)
the job queueing system tells the script somehow which hosts are allowed
to be used in this job (based on the attributes given to the job initially).
And here is where the -machinefile argument is typically used.
You generate a temporary machines file for your job in the job script,
and run mpirun with the -machinefile argument to tell it an explicit
list of hosts.

Another question: do I need a copy of hosts on each of the machines in
the cluster(basically 2 laptops in my case)?

You need a properly configured /etc/hosts on all of your cluster nodes
in order to have rsh (ssh) passwordless logins work properly.  However,
you only need your generated machine file (or your default
/etc/mpich/machines.LINUX) on the node you run mpirun at.

Thanks very much

Xiaoming

Mario Lang wrote:
Xiaoming Hu <xhu@ncsu.edu> writes:

I have two laptops with ubuntu system. I got NFS working also ssh
without password prompt between the two laptops. I also installed
MPICH.
And my simple parallel code is compiled successfully. But after I use
mpirun -np n xxx to submit my job. it doesn't work.
What error message do you get? Did you configure your default MPICH hosts
file, or create one for your job?

mpirun -machinefile some-file-name -np xxx ...






Reply to: