$subject_val = "[OMPI users] oob-tcp problem, unreachable in orted_comm"; include("../../include/msg-header.inc"); ?>
Subject: [OMPI users] oob-tcp problem, unreachable in orted_comm
From: Åke Sandgren (ake.sandgren_at_[hidden])
Date: 2009-06-06 12:18:27
Just got this in a user job.
Any idea why it complains like this.
The original error was the infamous "RETRY EXCEEDED ERROR" but instead
of killing the job it showed this and never died.
I have never seen this happen before.
openmpi 1.3.2, built with intel 10.1
This binary is used ALOT (+50% of the system walltime) and has never
shown this specific problem and rarely the "Retry exceeded error"
either.
[p-bc2503.hpc2n.umu.se:11892] [[34820,0],0]-[[34820,0],1] oob-tcp:
Communication
retries exceeded. Can not communicate with peer
[p-bc2503.hpc2n.umu.se:11892] [[34820,0],0] ORTE_ERROR_LOG: Unreachable
in file
orted/orted_comm.c at line 130
[p-bc2503.hpc2n.umu.se:11892] [[34820,0],0] ORTE_ERROR_LOG: Unreachable
in file
orted/orted_comm.c at line 130
[p-bc2503.hpc2n.umu.se:11892] [[34820,0],0]-[[34820,0],1] oob-tcp:
Communication
retries exceeded. Can not communicate with peer
-- Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden Internet: ake_at_[hidden] Phone: +46 90 7866134 Fax: +46 90 7866126 Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se