From: Ralph Castain (rhc_at_[hidden])
Date: 2007-05-01 12:32:03


The most likely problem is that you have a path or library issue regarding
the location of the OpenMPI/OpenRTE executables when running batch versus
interactive. We see this sometimes when the shell startups differ in those
two modes.

You might try just running a batch vs interactive printenv to see if
differences exist.

As far as I know, there are no compatibility issues with Torque at this
time.

Ralph

On 5/1/07 8:54 AM, "Ole Holm Nielsen" <Ole.H.Nielsen_at_[hidden]> wrote:

> We have built OpenMPI 1.2.1 with support for Torque 2.1.8 and its
> Task Manager interface. We use the PGI 6.2-4 compiler and the
> --with-tm option as described in
> http://www.open-mpi.org/faq/?category=building#build-rte-tm
> for building an OpenMPI RPM on a Pentium-4 machine running CentOS 4.4
> (RHEL4U4 clone). The TM interface seems to be available as it should:
>
> # ompi_info | grep tm
> MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.2.1)
> MCA ras: tm (MCA v1.0, API v1.3, Component v1.2.1)
> MCA pls: tm (MCA v1.0, API v1.3, Component v1.2.1)
>
> When we submit a Torque batch job running the example code in
> openmpi-1.2.1/examples/hello_c.c we get this error message:
>
> /usr/local/openmpi-1.2.1-pgi/bin/mpirun -np 2 -machinefile $PBS_NODEFILE
> hello_c
> [u126.dcsc.fysik.dtu.dk:11981] pls:tm: failed to poll for a spawned proc,
> return
> status = 17002
> [u126.dcsc.fysik.dtu.dk:11981] [0,0,0] ORTE_ERROR_LOG: In errno in file
> rmgr_urm.c at line 462
> [u126.dcsc.fysik.dtu.dk:11981] mpirun: spawn failed with errno=-11
>
> When we run the same code in an interactive (non-Torque) shell the
> hello_c code works correctly:
>
> # /usr/local/openmpi-1.2.1-pgi/bin/mpirun -np 2 -machinefile hostfile hello_c
> Hello, world, I am 0 of 2
> Hello, world, I am 1 of 2
>
> To prove that the Torque TM interface is working correctly we also make this
> test within the Torque batch job using the Torque pbsdsh command:
>
> pbsdsh hostname
> u126.dcsc.fysik.dtu.dk
> u113.dcsc.fysik.dtu.dk
>
> So obviously something is broken between Torque 2.1.8 and OpenMPI 1.2.1
> with respect to the TM interface, whereas either one alone seems to work
> correctly. Can anyone suggest a solution to this problem ?
>
> I wonder if this problem may be related to this list thread:
> http://www.open-mpi.org/community/lists/users/2007/04/3028.php
>
> Details of configuration:
> -------------------------
>
> We use the buildrpm.sh script from
> http://www.open-mpi.org/software/ompi/v1.2/srpm.php
> and change the following options in the script:
>
> prefix="/usr/local/openmpi-1.2.1-pgi"
>
> configure_options="--with-tm=/usr/local FC=pgf90 F77=pgf90 CC=pgcc CXX=pgCC
> CFLAGS=-Msignextend CXXFLAGS=-Msignextend --with-wrapper-cflags=-Msignextend
> --with-wrapper-cxxflags=-Msignextend FFLAGS
> =-Msignextend FCFLAGS=-Msignextend --with-wrapper-fflags=-Msignextend
> --with-wrapper-fcflags=-Msignextend"
> rpmbuild_options=${rpmbuild_options}" --define 'install_in_opt 0' --define
> 'install_shell_scripts 1' --define 'install_modulefile 0'"
> rpmbuild_options=${rpmbuild_options}" --define '_prefix ${prefix}'"
>
> build_single=yes