Subject: [OMPI users] PBSPro/OpenMPI Errors
From: Robert Jackson (rjackson_at_[hidden])
Date: 2009-06-25 12:06:58

When using OpenMPI and nwchem standalone (mpirun --byslot --mca btl
self,sm,tcp --mca btl_base_verbose 30 --mca btl_tcp_if_exclude lo,eth1
$NWCHEM h2o.nw > & h2o.nwo.$$) the job runs fine.


When running the same job via the PBSPro scheduler I get errors. The PBS
script is called nwrun and is run with the following command - qsub -V
-S /bin/bash ./nwrun.


Nwrun listing:


#PBS -N h2o

#PBS -l select=4:ncpus=4:mpiprocs=4

#PBS -l walltime=0:10:00

#PBS -e .

#PBS -j eo

#PBS -k eo


# set working directory

set echo



# make sure that the proper mpirun is installed

##module load hpc/openmpi-1.2.6-intel


# load NWChem

#module load hpc/nwchem-5.1



setenv NWCHEM /share/apps/nwchem-5.1/bin/nwchem




setenv | grep LD_LIB

which mpirun


# run a parallel job

mpirun --byslot --mca btl self,sm,tcp --mca btl_tcp_if_exclude lo,eth1
$NWCHEM h2o.nw >& h2o.nwo.$$



Error listing from error file:

ARMCI configured for 4 cluster nodes. Network protocol is 'TCP/IP

1:trying connect to host=compute-1-4.local, port=35506 t=5 111

1:armci_CreateSocketAndConnect: connect failed: -1

trying to connect:: Connection refused

1:armci_CreateSocketAndConnect: connect failed: -1

Last System Error Message from Task 1:: Connection refused

[compute-1-4.local:04739] MPI_ABORT invoked on rank 1 in communicator
MPI_COMM_WORLD with errorcode -1

3:trying connect to host=compute-1-4.local, port=35508 t=5 111

trying to connect:: Connection refused

3:armci_CreateSocketAndConnect: connect failed: -1

Last System Error Message from Task 3:: Connection refused

3:armci_CreateSocketAndConnect: connect failed: -1

[compute-1-4.local:04741] MPI_ABORT invoked on rank 3 in communicator
MPI_COMM_WORLD with errorcode -1

6:trying connect to host=compute-1-5.local, port=48920 t=5 111

10:trying connect to host=compute-1-6.local, port=36350 t=5 111

4:armci_CreateSocketAndConnect: connect failed: -1

4:trying connect to host=compute-1-5.local, port=48918 t=5 111

trying to connect:: Connection refused

4:armci_CreateSocketAndConnect: connect failed: -1

Last System Error Message from Task 4:: Connection refused

5:armci_CreateSocketAndConnect: connect failed: -1

5:trying connect to host=compute-1-5.local, port=48919 t=5 111

trying to connect:: Connection refused

5:armci_CreateSocketAndConnect: connect failed: -1

Last System Error Message from Task 5:: Connection refused

[compute-1-5.local:01175] MPI_ABORT invoked on rank 5 in communicator
MPI_COMM_WORLD with errorcode -1

6:armci_CreateSocketAndConnect: connect failed: -1

trying to connect:: Connection refused

6:armci_CreateSocketAndConnect: connect failed: -1

Last System Error Message from Task 6:: Connection refused


Is anybody familiar with this error?


Robert C. Jackson

Software Systems Specialist III

The University of Texas - Pan American

1201 W. University Dr.

Edinburg Texas 78539

Academic Computing Department

ASB 2.162E

956-381-2455 office 956-381-2355 fax

email: rjackson_at_[hidden]