From: Boris Bierbaum (boris_at_[hidden])
Date: 2007-05-08 05:37:24


Hi,

we (my collegue Andreas and me) are still trying to solve this problem.
I have compiled some additional information, maybe somebody has an idea
about what's going on.

OS: Debian GNU/Linux 4.0, Kernel 2.6.18, x86, 32-Bit
IB software: OFED 1.1
SM: OpenSM from OFED 1.1
uDAPL: DAPL reference implementation version gamma 3.02 (using DAPL from
OFED 1.1 doesn't change anything, I suppose it's the same code, at least
roughly)
Test program: Intel MPI Benchmarks Version 2.3
OpenMPI version: 1.2.1

Running OpenMPI directly over IB verbs (mpirun --mca btl self,sm,openib
...) works. Here's the output of ibv_devinfo and ifconfig for the two
nodes on which tried to run the benchmark (ulimit -l is unlimited on
both machines):

------------ 1st node -------------------------------

boris_at_pd-04:/work/boris/IMB_2.3/src$ /opt/infiniband/bin/ibv_devinfo
hca_id: mthca0
        fw_ver: 1.2.0
        node_guid: 0002:c902:0020:b528
        sys_image_guid: 0002:c902:0020:b52b
        vendor_id: 0x02c9
        vendor_part_id: 25204
        hw_ver: 0xA0
        board_id: MT_0230000001
        phys_port_cnt: 1
                port: 1
                        state: PORT_ACTIVE (4)
                        max_mtu: 2048 (4)
                        active_mtu: 2048 (4)
                        sm_lid: 1
                        port_lid: 9
                        port_lmc: 0x00

boris_at_pd-04:/work/boris/IMB_2.3/src$ /sbin/ifconfig

...

ib0 Protokoll:UNSPEC Hardware Adresse
00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00
          inet Adresse:192.168.0.14 Bcast:192.168.0.255
Maske:255.255.255.0
          inet6 Adresse: fe80::202:c902:20:b529/64
Gültigkeitsbereich:Verbindung
          UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
          RX packets:67 errors:0 dropped:0 overruns:0 frame:0
          TX packets:16 errors:0 dropped:2 overruns:0 carrier:0
          Kollisionen:0 Sendewarteschlangenlänge:128
          RX bytes:3752 (3.6 KiB) TX bytes:968 (968.0 b)

...

------------ 2nd node -------------------------------

boris_at_pd-05:~$ /opt/infiniband/bin/ibv_devinfo
hca_id: mthca0
        fw_ver: 1.2.0
        node_guid: 0002:c902:0020:b4f4
        sys_image_guid: 0002:c902:0020:b4f7
        vendor_id: 0x02c9
        vendor_part_id: 25204
        hw_ver: 0xA0
        board_id: MT_0230000001
        phys_port_cnt: 1
                port: 1
                        state: PORT_ACTIVE (4)
                        max_mtu: 2048 (4)
                        active_mtu: 2048 (4)
                        sm_lid: 1
                        port_lid: 10
                        port_lmc: 0x00

boris_at_pd-05:~$ /sbin/ifconfig

...

ib0 Protokoll:UNSPEC Hardware Adresse
00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00
          inet Adresse:192.168.0.15 Bcast:192.168.0.255
Maske:255.255.255.0
          inet6 Adresse: fe80::202:c902:20:b4f5/64
Gültigkeitsbereich:Verbindung
          UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
          RX packets:67 errors:0 dropped:0 overruns:0 frame:0
          TX packets:18 errors:0 dropped:2 overruns:0 carrier:0
          Kollisionen:0 Sendewarteschlangenlänge:128
          RX bytes:3752 (3.6 KiB) TX bytes:1088 (1.0 KiB)

...

-------------------------------------------------------------------------

Here's the output from the failed run, with every DAT and DAPL debug
output enabled:

boris_at_pd-04:/work/boris/IMB_2.3/src$ mpirun -np 2 -x DAT_DBG_TYPE -x
DAPL_DBG_TYPE -x DAT_OVERRIDE --mca btl self,sm,udapl --host pd-04,pd-05
/work/boris/IMB_2.3/src/IMB-MPI1 pingpong
DAT Registry: Started (dat_init)
DAT Registry: static registry file
</home/boris/dapl_on_dope_gamma3.2/doc/dat.conf>

DAT Registry: token
 type string
 value <OpenIB-cma>

DAT Registry: token
 type string
 value <u1.2>

DAT Registry: token
 type string
 value <nonthreadsafe>

DAT Registry: token
 type string
 value <default>

DAT Registry: token
 type string
 value
</home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/libdapl_openib_cma.so>

DAT Registry: token
 type string
 value <mv_dapl.1.2>

DAT Registry: token
 type string
 value <ib0 0>

DAT Registry: token
 type string
 value <>

DAT Registry: token
 type eor
 value <>

DAT Registry: entry
 ia_name OpenIB-cma
 api_version
     type 0x0
     major.minor 1.2
 is_thread_safe 0
 is_default 1
 lib_path
/home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/libdapl_openib_cma.so
 provider_version
     id mv_dapl
     major.minor 1.2
 ia_params ib0 0

DAT Registry: loading provider for OpenIB-cma

DAT Registry: token
 type eof
 value <>

DAT Registry: dat_registry_list_providers () called
DAT Registry: dat_ia_openv (OpenIB-cma,1:2,0) called
DAT Registry: IA OpenIB-cma, trying to load library
/home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/libdapl_openib_cma.so
DAPL: NOT Setting Loopback
 dapl_ib_init:
DAT Registry: dat_registry_add_provider (OpenIB-cma,1:2,0)
 open_hca: ib0 - 0x807cf28
 ib_thread_init(17919)
 ib_thread_init: waiting for ib_thread
 ib_thread(17919,0xa7b08bb0): ENTER: pipe 8 ucma 12
DAT Registry: Started (dat_init)
DAT Registry: static registry file
</home/boris/dapl_on_dope_gamma3.2/doc/dat.conf>

DAT Registry: token
 type string
 value <OpenIB-cma>

DAT Registry: token
 type string
 value <u1.2>

DAT Registry: token
 type string
 value <nonthreadsafe>

DAT Registry: token
 type string
 value <default>

DAT Registry: token
 type string
 value
</home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/libdapl_openib_cma.so>

DAT Registry: token
 type string
 value <mv_dapl.1.2>

DAT Registry: token
 type string
 value <ib0 0>

DAT Registry: token
 type string
 value <>

DAT Registry: token
 type eor
 value <>

DAT Registry: entry
 ia_name OpenIB-cma
 api_version
     type 0x0
     major.minor 1.2
 is_thread_safe 0
 is_default 1
 lib_path
/home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/libdapl_openib_cma.so
 provider_version
     id mv_dapl
     major.minor 1.2
 ia_params ib0 0

DAT Registry: loading provider for OpenIB-cma

DAT Registry: token
 type eof
 value <>

DAT Registry: dat_registry_list_providers () called
DAT Registry: dat_ia_openv (OpenIB-cma,1:2,0) called
DAT Registry: IA OpenIB-cma, trying to load library
/home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/libdapl_openib_cma.so
 ib_thread_init(17919) exit
DAPL: NOT Setting Loopback
 dapl_ib_init:
DAT Registry: dat_registry_add_provider (OpenIB-cma,1:2,0)
 open_hca: ib0 - 0x807cf18
 ib_thread_init(12326)
 ib_thread_init: waiting for ib_thread
 ib_thread(12326,0xa7b75bb0): ENTER: pipe 8 ucma 12
 ib_thread_init(12326) exit
 getipaddr: family 2 port 0 addr 192.168.0.14
 open_hca: ctx=0x809ecd0 port=1 GID subnet fe80000000000000 id
0002c9020020b529
 open_hca: ib0, AF_INET 192.168.0.14 INLINE_MAX=128
 ib_thread(17919) poll_event: async=0x1 pipe=0x1 cm=0x0 cq=0x0
 ib_thread(17919) poll_fd: hca[134729592]=0xb, async=8 pipe=12 cm=13 cq=d
 query_hca: ib0 AF_INET 192.168.0.14
 query_hca: (ver=a0) ep 64512 ep_q 16384 evd 65408 evd_q 131071
 query_hca: msg 2147483648 rdma 2147483648 iov 30 lmr 131056 rmr 0 rd_io 4
 setup_async_cb: ia 0x80a1648 type 0 hdl (nil) cb 0xa7b1ec6c ctx 0x80a16d0
 setup_async_cb: ia 0x80a1648 type 1 hdl (nil) cb 0xa7b1e9c0 ctx 0x80a16d0
 setup_async_cb: ia 0x80a1648 type 3 hdl (nil) cb 0xa7b1eb50 ctx 0x80a1648
dat_set_handle 0x80a1648 to 1
dat_get_ia_handle from 1 to 0x80a1648
 pd_alloc: pd_handle=0x80a1928
dat_get_ia_handle from 1 to 0x80a1648
 query_hca: ib0 AF_INET 192.168.0.14
 query_hca: (ver=a0) ep 64512 ep_q 16384 evd 65408 evd_q 131071
 query_hca: msg 2147483648 rdma 2147483648 iov 30 lmr 131056 rmr 0 rd_io 4
dat_get_ia_handle from 1 to 0x80a1648
 cq_object_create: (0x80a1958,0x80a1a44)
dapls_ib_cq_alloc: evd 0x80a1958 cqlen=32
dapls_ib_cq_alloc: new_cq 0x80a1a68 cqlen=63
 setup_async_cb: ia 0x80a1648 type 2 hdl 0x80a1958 cb 0xa7b1f174 ctx
0x80a1958
dat_get_ia_handle from 1 to 0x80a1648
dat_get_ia_handle from 1 to 0x80a1648
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Address already in use
 listen(ia_ptr 0x80a1648 SID 1025 sp 0x80a7a00 conn 0x80a7a70 id 134904736)
 listen(conn=0x80a7a70 cm_id=134904736)
dat_get_ia_handle from 1 to 0x80a1648
 mr_register: ia=0x80a1648, lmr=0x80a3718 va=0x80ae000 ln=266240 pv=0x0
 mr_register: mr=0x80a37c8 h 4 pd 0x80a1928 ctx 0x809ecd0
lkey=0x72002700 rkey=0x72002700 priv=41000
dat_get_ia_handle from 1 to 0x80a1648
 mr_register: ia=0x80a1648, lmr=0x80a7f18 va=0x80ef000 ln=528384 pv=0x0
 mr_register: mr=0x80a7fc8 h 5 pd 0x80a1928 ctx 0x809ecd0
lkey=0xf2002800 rkey=0xf2002800 priv=81000
 getipaddr: family 2 port 0 addr 192.168.0.15
 open_hca: ctx=0x809ecc0 port=1 GID subnet fe80000000000000 id
0002c9020020b4f5
 open_hca: ib0, AF_INET 192.168.0.15 INLINE_MAX=128
 ib_thread(12326) poll_event: async=0x1 pipe=0x1 cm=0x0 cq=0x0
 ib_thread(12326) poll_fd: hca[134729576]=0xb, async=8 pipe=12 cm=13 cq=d
 query_hca: ib0 AF_INET 192.168.0.15
 query_hca: (ver=a0) ep 64512 ep_q 16384 evd 65408 evd_q 131071
 query_hca: msg 2147483648 rdma 2147483648 iov 30 lmr 131056 rmr 0 rd_io 4
 setup_async_cb: ia 0x80a1638 type 0 hdl (nil) cb 0xa7b8bc6c ctx 0x80a16c0
 setup_async_cb: ia 0x80a1638 type 1 hdl (nil) cb 0xa7b8b9c0 ctx 0x80a16c0
 setup_async_cb: ia 0x80a1638 type 3 hdl (nil) cb 0xa7b8bb50 ctx 0x80a1638
dat_set_handle 0x80a1638 to 1
dat_get_ia_handle from 1 to 0x80a1638
 pd_alloc: pd_handle=0x80a1918
dat_get_ia_handle from 1 to 0x80a1638
 query_hca: ib0 AF_INET 192.168.0.15
 query_hca: (ver=a0) ep 64512 ep_q 16384 evd 65408 evd_q 131071
 query_hca: msg 2147483648 rdma 2147483648 iov 30 lmr 131056 rmr 0 rd_io 4
dat_get_ia_handle from 1 to 0x80a1638
 cq_object_create: (0x80a1948,0x80a1a34)
dapls_ib_cq_alloc: evd 0x80a1948 cqlen=32
dapls_ib_cq_alloc: new_cq 0x80a1a58 cqlen=63
 setup_async_cb: ia 0x80a1638 type 2 hdl 0x80a1948 cb 0xa7b8c174 ctx
0x80a1948
dat_get_ia_handle from 1 to 0x80a1638
dat_get_ia_handle from 1 to 0x80a1638
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 setup_listener Permission denied
 listen(ia_ptr 0x80a1638 SID 1024 sp 0x80a7a00 conn 0x80a7a70 id 134904736)
 listen(conn=0x80a7a70 cm_id=134904736)
dat_get_ia_handle from 1 to 0x80a1638
 mr_register: ia=0x80a1638, lmr=0x80a3708 va=0x80ae000 ln=266240 pv=0x0
 mr_register: mr=0x80a37b8 h 1 pd 0x80a1918 ctx 0x809ecc0
lkey=0x60002400 rkey=0x60002400 priv=41000
dat_get_ia_handle from 1 to 0x80a1638
 mr_register: ia=0x80a1638, lmr=0x80a7ee8 va=0x80ef000 ln=528384 pv=0x0
 mr_register: mr=0x80a7f98 h 2 pd 0x80a1918 ctx 0x809ecc0
lkey=0x60002500 rkey=0x60002500 priv=81000
#---------------------------------------------------
# Intel (R) MPI Benchmark Suite V2.3, MPI-1 part
#---------------------------------------------------
# Date : Tue May 8 11:16:58 2007
# Machine : i686# System : Linux
# Release : 2.6.18
# Version : #1 SMP Tue Nov 14 18:02:03 CET 2006

#
# Minimum message length in bytes: 0
# Maximum message length in bytes: 16777216
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#

# List of Benchmarks to run:

# PingPong
dat_get_ia_handle from 1 to 0x80a1638
 query_hca: MAX msg 2147483648 dto 16384 iov 30 rdma i4,o4
 qp_alloc: ia_ptr 0x80a1638 ep_ptr 0x81741f8 ep_ctx_ptr 0x81741f8
 create_qp Address already in use

-------------------------------------------------------------------------

The jobs hangs at this point. From the output of another simple test
program I assume that it hangs inside of a receive operation. Of course,
I have noticed the "Permission denied" messages, but I don't think that
the probleme is there. These messages seem to come from RDMA CM when
things are set up, but the execution continues from there on and I have
seen these messages on successful DAPL runs, too. I'm not very familiar
with RDMA CM, though, so I don't know the cause of these messages.

That's a lot of information, I know, but it would be great if someone
would have a look at it.

Thanks in advance
Boris

Donald Kerr wrote:
> I have not tried Open MPI uDAPL on Linux nor do I have access to a Linux
> box so I am having a difficult time trying to find a way to help you
> debug this issue.
>
> -DON
>
> Andreas Kuntze wrote:
>
>> On Linux you needn't initialise the dat registry. Your program prints:
>> "provider 1: OpenIB-cma". I successfully tested INTEL MPI and mvapich2
>> with uDAPL .
>>
>> Andreas
>>
>> Donald Kerr wrote:
>>
>>
>>> Andreas,
>>>
>>> I am going to guess at a minimum the interfaces are up and you can
>>> ping them. On Solaris there is an additional step required and that
>>> is initializing the dat registry. If "/usr/sbin/datadm -v" does not
>>> show some driver output then you would need to run "/usr/sbin/datadm
>>> -a /usr/share/dat/SUNWudaplt.conf". I don't know if there is an
>>> equivalent on Linux.
>>>
>>> Attached is a simple udapl program which will check if the interfaces
>>> are available in the dat registry.
>>>
>>> -DON
>>>
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
|  _  RWTH | Boris Bierbaum
|_|_`_     | Lehrstuhl fuer Betriebssysteme
   | |_) _  | RWTH Aachen D-52056 Aachen
     |_)(_` | Tel: +49-241-80-27805
        ._) | Fax: +49-241-80-22339