From: Nayden D Kambouchev (nayden_at_[hidden])
Date: 2007-05-26 16:56:35


Hi,

I am unable to run batch jobs with my installation of OpenMPI and SLURM. Indeed
I am not sure if this is an OpenMPI issue or a SLURM issue, but here is what is
happening on my little cluster (3 nodes, one login node and 2 backend nodes
with 2 dual core CPUs each). If I run

salloc -n 8 mpirun -np 8 myprog

I get both backend nodes allocated (with their total of 8 cores) and myprog runs

if I run

sbatch -n 8 zrun.sh

where zrun.sh contains

#!/bin/bash
mpirun -np 8 myprog

again both backend nodes get allocated, but the job does not run. In top I see
one mpirun and two srun processes on the first backend node, but they just seem
to be sitting there. On the other backend node I see no mpirun, srun or
anything else which might have been started as a result of the batch job.

Is this the correct way to initiate SLURM batch jobs with OpenMPI?

I also see the following error in the SLURM log of the second backnode

May 26 16:15:21 localhost slurmd[2665]: launch task 82.0 request from
1001.1001_at_127.0.0.1 (port 21721)
May 26 16:15:21 localhost slurmstepd[2747]: jobacct NONE plugin loaded
May 26 16:15:21 localhost slurmstepd[2747]: error: connect io: Connection
refused
May 26 16:15:21 localhost slurmd[node21][2747]: error: IO setup failed:
Connection refused
May 26 16:15:21 localhost slurmd[node21][2747]: error: job_manager exiting
abnormally, rc = 4020
May 26 16:15:21 localhost slurmd[node21][2747]: done with job

The job number assigned by SLURM at the submissin was 82.

What am I doing incorrectly? Is it possible that something in my environment is
not set up correctly?

Thanks,
Nayden Kambouchev