$subject_val = "[OMPI users] PBSPro/OpenMPI Errors"; include("../../include/msg-header.inc"); ?>
Subject: [OMPI users] PBSPro/OpenMPI Errors
From: Robert Jackson (rjackson_at_[hidden])
Date: 2009-06-25 12:06:58
When using OpenMPI and nwchem standalone (mpirun --byslot --mca btl
self,sm,tcp --mca btl_base_verbose 30 --mca btl_tcp_if_exclude lo,eth1
$NWCHEM h2o.nw > & h2o.nwo.$$) the job runs fine.
When running the same job via the PBSPro scheduler I get errors. The PBS
script is called nwrun and is run with the following command - qsub -V
-S /bin/bash ./nwrun.
Nwrun listing:
#!/bin/tcsh
#PBS -N h2o
#PBS -l select=4:ncpus=4:mpiprocs=4
#PBS -l walltime=0:10:00
#PBS -e .
#PBS -j eo
#PBS -k eo
#
# set working directory
set echo
cd $PBS_O_WORKDIR
#
# make sure that the proper mpirun is installed
##module load hpc/openmpi-1.2.6-intel
#
# load NWChem
#module load hpc/nwchem-5.1
setenv LD_LIBRARY_PATH
/share/apps/openmpi-1.2.6-intel/lib:/share/apps/intel/mkl/10.0.1.014/lib
/em64t:/s
hare/apps/intel/cce/10.1.015/lib:/share/apps/intel/fce/10.1.015/lib
setenv NWCHEM /share/apps/nwchem-5.1/bin/nwchem
setenv PERMANENT_DIR $PBS_O_WORKDIR
setenv SCRATCH_DIR $TMPDIR
#
setenv | grep LD_LIB
which mpirun
cat $PBS_NODEFILE
# run a parallel job
mpirun --byslot --mca btl self,sm,tcp --mca btl_tcp_if_exclude lo,eth1
$NWCHEM h2o.nw >& h2o.nwo.$$
exit
Error listing from error file:
ARMCI configured for 4 cluster nodes. Network protocol is 'TCP/IP
Sockets'.
1:trying connect to host=compute-1-4.local, port=35506 t=5 111
1:armci_CreateSocketAndConnect: connect failed: -1
trying to connect:: Connection refused
1:armci_CreateSocketAndConnect: connect failed: -1
Last System Error Message from Task 1:: Connection refused
[compute-1-4.local:04739] MPI_ABORT invoked on rank 1 in communicator
MPI_COMM_WORLD with errorcode -1
3:trying connect to host=compute-1-4.local, port=35508 t=5 111
trying to connect:: Connection refused
3:armci_CreateSocketAndConnect: connect failed: -1
Last System Error Message from Task 3:: Connection refused
3:armci_CreateSocketAndConnect: connect failed: -1
[compute-1-4.local:04741] MPI_ABORT invoked on rank 3 in communicator
MPI_COMM_WORLD with errorcode -1
6:trying connect to host=compute-1-5.local, port=48920 t=5 111
10:trying connect to host=compute-1-6.local, port=36350 t=5 111
4:armci_CreateSocketAndConnect: connect failed: -1
4:trying connect to host=compute-1-5.local, port=48918 t=5 111
trying to connect:: Connection refused
4:armci_CreateSocketAndConnect: connect failed: -1
Last System Error Message from Task 4:: Connection refused
5:armci_CreateSocketAndConnect: connect failed: -1
5:trying connect to host=compute-1-5.local, port=48919 t=5 111
trying to connect:: Connection refused
5:armci_CreateSocketAndConnect: connect failed: -1
Last System Error Message from Task 5:: Connection refused
[compute-1-5.local:01175] MPI_ABORT invoked on rank 5 in communicator
MPI_COMM_WORLD with errorcode -1
6:armci_CreateSocketAndConnect: connect failed: -1
trying to connect:: Connection refused
6:armci_CreateSocketAndConnect: connect failed: -1
Last System Error Message from Task 6:: Connection refused
Is anybody familiar with this error?
Robert C. Jackson
Software Systems Specialist III
The University of Texas - Pan American
1201 W. University Dr.
Edinburg Texas 78539
Academic Computing Department
ASB 2.162E
956-381-2455 office 956-381-2355 fax
email: rjackson_at_[hidden]