$subject_val = "[OMPI users] mpirun fails on the host"; include("../../include/msg-header.inc"); ?>
Subject: [OMPI users] mpirun fails on the host
From: Honest Guvnor (honestguvnor_at_[hidden])
Date: 2009-06-18 17:49:27
OpenMPI 1.2.7, ethernet, Centos 5.3 i386 fresh install on host and nodes.
Despite ssh and pdsh working, mpirun hangs when launching a program
from the host to a node:
[cluster_at_hankel ~]$ ssh n06 hostname
n06
[cluster_at_hankel ~]$ pdsh -w n06 hostname
n06: n06
[cluster_at_hankel ~]$ mpirun -np 1 --host n06 hostname
[HANGS]
However, mpirun works fine in reverse:
[cluster_at_n06 ~]$ mpirun -np 1 --host hankel date
Thu Jun 18 22:53:27 CEST 2009
and from node to node. Paths to bin and lib seem OK:
[cluster_at_hankel ~]$ printenv PATH
/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lib/openmpi/1.2.7-gcc/bin:/home/cluster/bin
[cluster_at_hankel ~]$ printenv LD_LIBRARY_PATH
:/usr/lib/openmpi/1.2.7-gcc/lib
[cluster_at_hankel ~]$ ssh n06 printenv PATH
/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lib/openmpi/1.2.7-gcc/bin
[cluster_at_hankel ~]$ ssh n06 printenv LD_LIBRARY_PATH
:/usr/lib/openmpi/1.2.7-gcc/lib
We are new to openmpi but checked a few mca parameters and turned on a
diagnostic flag or two but without coming up with much. The nodes do
not have access to the hosts external network and we half convinced
ourselves this was the problem because of mentions in the output with
the -d flag but:
[cluster_at_hankel ~]$ mpirun --mca btl tcp,self --mca btl_tcp_if_exclude
lo,eth0 --mca oob_tcp_if_exclude lo,eth0 -np 1 --host n06 hostname
[STILL HANGS]
where eth0 is the external network.
Suggestions gratefully received on how we can get openmpi to report
what has failed or where to poke and prod further?