$subject_val = "Re: [OMPI users] oob-tcp problem, unreachable in orted_comm"; include("../../include/msg-header.inc"); ?>
Subject: Re: [OMPI users] oob-tcp problem, unreachable in orted_comm
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-06-06 12:25:13
Yeah, I've started seeing this on clusters where the TCP stack is a
little congested. We default to trying 60 times to send a message, but
it is done in rapid succession and doesn't really provide a lot of time.
Try setting -mca oob_tcp_peer_retries 1000 (or some number much bigger
than 60). This has always fixed the problem so far.
If it works, you might want to put it in the system default mca param
file.
On Jun 6, 2009, at 10:18 AM, Åke Sandgren wrote:
> Just got this in a user job.
> Any idea why it complains like this.
> The original error was the infamous "RETRY EXCEEDED ERROR" but instead
> of killing the job it showed this and never died.
> I have never seen this happen before.
>
> openmpi 1.3.2, built with intel 10.1
> This binary is used ALOT (+50% of the system walltime) and has never
> shown this specific problem and rarely the "Retry exceeded error"
> either.
>
> [p-bc2503.hpc2n.umu.se:11892] [[34820,0],0]-[[34820,0],1] oob-tcp:
> Communication
> retries exceeded. Can not communicate with peer
> [p-bc2503.hpc2n.umu.se:11892] [[34820,0],0] ORTE_ERROR_LOG:
> Unreachable
> in file
> orted/orted_comm.c at line 130
> [p-bc2503.hpc2n.umu.se:11892] [[34820,0],0] ORTE_ERROR_LOG:
> Unreachable
> in file
> orted/orted_comm.c at line 130
> [p-bc2503.hpc2n.umu.se:11892] [[34820,0],0]-[[34820,0],1] oob-tcp:
> Communication
> retries exceeded. Can not communicate with peer
>
>
> --
> Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
> Internet: ake_at_[hidden] Phone: +46 90 7866134 Fax: +46 90 7866126
> Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users