From: Ole Holm Nielsen (Ole.H.Nielsen_at_[hidden])
Date: 2007-05-02 06:55:11


Ralph Castain wrote:
> We would consider it a "feature" that OpenMPI is integrated with Torque. We
> actually read the PBS_NODEFILE internally ourselves. I believe the problem
> here is that specifying the "machinefile" prevents us from using that
> Torque-integrated code and forces us down a different code path that doesn't
> correctly interpret the PBS_NODEFILE format.
>
> We probably should consider your observation a "bug" - frankly, it wasn't
> something anyone anticipated a user ever doing, so nobody I know of ever
> tested it. I'd have to dig into the internals to understand how you wound up
> in that particular error mode.

I'd say that this behavior of mpirun under Torque TM should be considered as
a bug. Ideally, users should not have to design their scripts differently
according to whether the sysadmin decided to configure in TM or not.
Also, for interactive tests one doesn't have TM. I think that mpirun just
ought to work no matter what...

So I'd strongly propose that "-machinefile" should at least be tolerated
when mpirun executes under TM. You might issue a warning about -machinefile
being ignored under TM, but the code should never bomb out, IMHO.
Such behavior would be much easier for users (and sysadmins :-) to
understand than the present situation.

Thanks again,
Ole