From: Boris Bierbaum (boris_at_[hidden])
Date: 2007-05-09 10:16:44


I thought about it again: There's probably no call to dat_ep_query()
*because* it returns wrong port numbers and the port numbers saved by
the uDAPL BTL code itself are used.

I'll leave the debugging to those who know the code ... ;-)

Boris

Andrew Friedley wrote:
> OK, strange but good. Yeah I wouldn't be surprised if something has
> been changed, though I wouldn't know what, and I don't have time right
> now to go digging :( Maybe Don Kerr knows something?
>
> Andrew
>
>
> Boris Bierbaum wrote:
>> I've run the whole IMB Benchmark Suite on 2, 3, and 4 nodes with 2
>> processes per node and --mca btl udapl,self. I didn't encouter any problems.
>>
>> The comment above line 197 says that dat_ep_query() returns wrong port
>> numbers (which it does indeed), but I can't find any call to
>> dat_ep_query() in the uDAPL BTL code. Maybe the comment is out of date?
>>
>> Boris
>>
>>
>> Andrew Friedley wrote:
>>> You say that fixes the problem, does it work even when running more than
>>> one MPI process per node? (that is the case the hack fixes) Simply
>>> doing an mpirun with a -np paremeter higher than the number of nodes you
>>> have set up should trigger this case, and making sure to use '-mca btl
>>> udapl,self' (ie not SM or anything else).
>>>
>>> Andrew
>>>
>>> Boris Bierbaum wrote:
>>>> It has been explained in a different thread on [ofa-general] that the
>>>> problem lies in a combination of the OpenIB-cma provider not setting the
>>>> local and remote port numbers on endpoints correctly and Open MPI
>>>> stepping over the IA to save the port number to circumvent this problem,
>>>> thereby confusing the provider.
>>>>
>>>> I commented out line 197 in ompi/mca/btl/udapl/btl_udapl.c (Open MPI
>>>> 1.2.1 release) and this fixes the problem. As the problem in the
>>>> provider is currently being fixed, the whole saving of the port number
>>>> in the uDAPL BTL code will be unnecessary in the future.
>>>>
>>>> Steve Wise wrote:
>>>>>>> Can the UDAPL OFED wizards shed any light on the error messages that
>>>>>>> are listed below? In particular, these seem to be worrysome:
>>>>>>>
>>>>>>>> setup_listener Permission denied
>>>>>>> setup_listener Address already in use
>>>>>> These failures are from rdma_cm_bind indicating the port is already
>>>>>> bound to this IA address. How are you creating the service point?
>>>>>> dat_psp_create or dat_psp_create_any? If it is psp_create_any then you
>>>>>> will see some failures until it gets to a free port. That is normal.
>>>>>> Just make sure your create call returns DAT_SUCCESS.
>>>>>>
>>>>> Arlin, why doesn't dapl_psp_create_any() just pass a port of zero down
>>>>> and let the rdma-cma pick an available port number?
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> general mailing list
>>>>> general_at_[hidden]
>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>>
>>>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
|  _  RWTH | Boris Bierbaum
|_|_`_     | Lehrstuhl fuer Betriebssysteme
   | |_) _  | RWTH Aachen D-52056 Aachen
     |_)(_` | Tel: +49-241-80-27805
        ._) | Fax: +49-241-80-22339