Subject: [OMPI users] PBSPro/OpenMPI Errors
From: Robert Jackson (rjackson_at_[hidden])
Date: 2009-06-25 12:06:58


When using OpenMPI and nwchem standalone (mpirun --byslot --mca btl
self,sm,tcp --mca btl_base_verbose 30 --mca btl_tcp_if_exclude lo,eth1
$NWCHEM h2o.nw > & h2o.nwo.$$) the job runs fine.

 

When running the same job via the PBSPro scheduler I get errors. The PBS
script is called nwrun and is run with the following command - qsub -V
-S /bin/bash ./nwrun.

 

Nwrun listing:

#!/bin/tcsh

#PBS -N h2o

#PBS -l select=4:ncpus=4:mpiprocs=4

#PBS -l walltime=0:10:00

#PBS -e .

#PBS -j eo

#PBS -k eo

#

# set working directory

set echo

cd $PBS_O_WORKDIR

#

# make sure that the proper mpirun is installed

##module load hpc/openmpi-1.2.6-intel

#

# load NWChem

#module load hpc/nwchem-5.1

setenv LD_LIBRARY_PATH
/share/apps/openmpi-1.2.6-intel/lib:/share/apps/intel/mkl/10.0.1.014/lib
/em64t:/s

hare/apps/intel/cce/10.1.015/lib:/share/apps/intel/fce/10.1.015/lib

setenv NWCHEM /share/apps/nwchem-5.1/bin/nwchem

setenv PERMANENT_DIR $PBS_O_WORKDIR

setenv SCRATCH_DIR $TMPDIR

#

setenv | grep LD_LIB

which mpirun

cat $PBS_NODEFILE

# run a parallel job

mpirun --byslot --mca btl self,sm,tcp --mca btl_tcp_if_exclude lo,eth1
$NWCHEM h2o.nw >& h2o.nwo.$$

exit

 

Error listing from error file:

ARMCI configured for 4 cluster nodes. Network protocol is 'TCP/IP
Sockets'.

1:trying connect to host=compute-1-4.local, port=35506 t=5 111

1:armci_CreateSocketAndConnect: connect failed: -1

trying to connect:: Connection refused

1:armci_CreateSocketAndConnect: connect failed: -1

Last System Error Message from Task 1:: Connection refused

[compute-1-4.local:04739] MPI_ABORT invoked on rank 1 in communicator
MPI_COMM_WORLD with errorcode -1

3:trying connect to host=compute-1-4.local, port=35508 t=5 111

trying to connect:: Connection refused

3:armci_CreateSocketAndConnect: connect failed: -1

Last System Error Message from Task 3:: Connection refused

3:armci_CreateSocketAndConnect: connect failed: -1

[compute-1-4.local:04741] MPI_ABORT invoked on rank 3 in communicator
MPI_COMM_WORLD with errorcode -1

6:trying connect to host=compute-1-5.local, port=48920 t=5 111

10:trying connect to host=compute-1-6.local, port=36350 t=5 111

4:armci_CreateSocketAndConnect: connect failed: -1

4:trying connect to host=compute-1-5.local, port=48918 t=5 111

trying to connect:: Connection refused

4:armci_CreateSocketAndConnect: connect failed: -1

Last System Error Message from Task 4:: Connection refused

5:armci_CreateSocketAndConnect: connect failed: -1

5:trying connect to host=compute-1-5.local, port=48919 t=5 111

trying to connect:: Connection refused

5:armci_CreateSocketAndConnect: connect failed: -1

Last System Error Message from Task 5:: Connection refused

[compute-1-5.local:01175] MPI_ABORT invoked on rank 5 in communicator
MPI_COMM_WORLD with errorcode -1

6:armci_CreateSocketAndConnect: connect failed: -1

trying to connect:: Connection refused

6:armci_CreateSocketAndConnect: connect failed: -1

Last System Error Message from Task 6:: Connection refused

 

Is anybody familiar with this error?

 

Robert C. Jackson

Software Systems Specialist III

The University of Texas - Pan American

1201 W. University Dr.

Edinburg Texas 78539

Academic Computing Department

ASB 2.162E

956-381-2455 office 956-381-2355 fax

email: rjackson_at_[hidden]