From: Ralph Castain (rhc_at_[hidden])
Date: 2007-05-02 12:39:51


On 5/2/07 7:57 AM, "Ole Holm Nielsen" <Ole.H.Nielsen_at_[hidden]> wrote:

<snip>
>
> What I'm saying is that users should be able run the same script in different
> environments, they being Torque or non-Torque, without having to change
> the arguments to the mpirun command. Maybe they submit batch jobs to
> our Linux/Torque cluster, or maybe they run their scripts on their own
> non-Torque workstation. The sysadmin may also reserve a set of nodes in the
> Linux cluster and log in interactively (without using Torque) for test
> purposes, and in this case the very same mpirun executable file will not
> use the TM interface.
>
> IMHO, it is highly desirable that the mpirun command is robust when being run
> in different ways, i.e., mpirun should accept both -np and -machinefile
> under all circumstances (but preferably print a warning message if it chooses
> to ignore -machinefile).

No disagreement - we are just trying to understand why you are seeing a
problem, and trying to get enough info to see where to start debugging.

<snip>

>
> Indeed, except that the above error message is totally unintelligible.
> There is no conflict in this job between "-np 2" which refers to 2 specific
> nodes allocated by Torque, and "-machinefile $PBS_NODEFILE" which refers
> to the very same 2 nodes allocated by Torque. It is beyond me why the
> redundant but consistent mpirun node information (in the case of being run
> under TM control) would cause mpirun to fail as shown above.
>

Just to be clear: "-np 2" does *not* indicate "run on two nodes allocated by
Torque". It only instructs us to run two processes on whatever allocation we
can find.

The machinefile option instructs us to use the nodes found in that file.
There is a potential conflict here with the nodes we might find in the
environment - we are aware of that conflict. We recently had a lengthy
telecon to discuss the wide variety of conflicting requests we have received
for how to resolve the problem of both a machinefile and an allocation.
Believe it or not, there is no consistent definition for that behavior.

We have arrived at some tentative resolution for that problem, but it won't
be implemented in the 1.2 code family (will wait for 1.3).

Meantime, I think we have enough info to chase down why you are encountering
this message. I'm not entirely sure we will resolve it the way you would
like as it would conflict with how others want the two combined options to
behave (and we aren't smart enough to decide who is "right"), but we should
hopefully at least be able to generate a more intelligible error message.

Ralph