From: Bas van der Vlies (basv_at_[hidden])
Date: 2007-05-02 02:27:39


Ole Holm Nielsen wrote:
> We have built OpenMPI 1.2.1 with support for Torque 2.1.8 and its
> Task Manager interface. We use the PGI 6.2-4 compiler and the
> --with-tm option as described in
> http://www.open-mpi.org/faq/?category=building#build-rte-tm
> for building an OpenMPI RPM on a Pentium-4 machine running CentOS 4.4
> (RHEL4U4 clone). The TM interface seems to be available as it should:
>
> # ompi_info | grep tm
> MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.2.1)
> MCA ras: tm (MCA v1.0, API v1.3, Component v1.2.1)
> MCA pls: tm (MCA v1.0, API v1.3, Component v1.2.1)
>
> When we submit a Torque batch job running the example code in
> openmpi-1.2.1/examples/hello_c.c we get this error message:
>
> /usr/local/openmpi-1.2.1-pgi/bin/mpirun -np 2 -machinefile $PBS_NODEFILE
> hello_c
> [u126.dcsc.fysik.dtu.dk:11981] pls:tm: failed to poll for a spawned
> proc, return status = 17002
> [u126.dcsc.fysik.dtu.dk:11981] [0,0,0] ORTE_ERROR_LOG: In errno in file
> rmgr_urm.c at line 462
> [u126.dcsc.fysik.dtu.dk:11981] mpirun: spawn failed with errno=-11
>
Ole,

  You must use the following command:
{{{
mpiexec -np 2 ./a.out

whello, i am 0 of 2
whello, i am 1 of 2
all is well that ends well

}}}

{{{
$ mpiexec -np 2 -machinefile $PBS_NODEFILE ./a.out
[ib-r6n19.irc.sara.nl:04999] pls:tm: failed to poll for a spawned proc,
return status = 17002
[ib-r6n19.irc.sara.nl:04999] [0,0,0] ORTE_ERROR_LOG: In errno in file
rmgr_urm.c at line 462
[ib-r6n19.irc.sara.nl:04999] mpiexec: spawn failed with errno=-11
}}}

> When we run the same code in an interactive (non-Torque) shell the
> hello_c code works correctly:
>
> # /usr/local/openmpi-1.2.1-pgi/bin/mpirun -np 2 -machinefile hostfile
> hello_c
> Hello, world, I am 0 of 2
> Hello, world, I am 1 of 2
>
> To prove that the Torque TM interface is working correctly we also make
> this
> test within the Torque batch job using the Torque pbsdsh command:
>
> pbsdsh hostname
> u126.dcsc.fysik.dtu.dk
> u113.dcsc.fysik.dtu.dk
>
> So obviously something is broken between Torque 2.1.8 and OpenMPI 1.2.1
> with respect to the TM interface, whereas either one alone seems to work
> correctly. Can anyone suggest a solution to this problem ?
>
> I wonder if this problem may be related to this list thread:
> http://www.open-mpi.org/community/lists/users/2007/04/3028.php
>
> Details of configuration:
> -------------------------
>
> We use the buildrpm.sh script from
> http://www.open-mpi.org/software/ompi/v1.2/srpm.php
> and change the following options in the script:
>
> prefix="/usr/local/openmpi-1.2.1-pgi"
>
> configure_options="--with-tm=/usr/local FC=pgf90 F77=pgf90 CC=pgcc
> CXX=pgCC CFLAGS=-Msignextend CXXFLAGS=-Msignextend
> --with-wrapper-cflags=-Msignextend --with-wrapper-cxxflags=-Msignextend
> FFLAGS
> =-Msignextend FCFLAGS=-Msignextend --with-wrapper-fflags=-Msignextend
> --with-wrapper-fcflags=-Msignextend"
> rpmbuild_options=${rpmbuild_options}" --define 'install_in_opt 0'
> --define 'install_shell_scripts 1' --define 'install_modulefile 0'"
> rpmbuild_options=${rpmbuild_options}" --define '_prefix ${prefix}'"
>
> build_single=yes
>

-- 
********************************************************************
*                                                                  *
*  Bas van der Vlies                     e-mail: basv_at_[hidden]      *
*  SARA - Academic Computing Services    phone:  +31 20 592 8012   *
*  Kruislaan 415                         fax:    +31 20 6683167    *
*  1098 SJ Amsterdam                                               *
*                                                                  *
********************************************************************