include("../../include/msg-header.inc"); ?>
From: Boris Bierbaum (boris_at_[hidden])
Date: 2007-05-08 05:37:24
Hi,
we (my collegue Andreas and me) are still trying to solve this problem.
I have compiled some additional information, maybe somebody has an idea
about what's going on.
OS: Debian GNU/Linux 4.0, Kernel 2.6.18, x86, 32-Bit
IB software: OFED 1.1
SM: OpenSM from OFED 1.1
uDAPL: DAPL reference implementation version gamma 3.02 (using DAPL from
OFED 1.1 doesn't change anything, I suppose it's the same code, at least
roughly)
Test program: Intel MPI Benchmarks Version 2.3
OpenMPI version: 1.2.1
Running OpenMPI directly over IB verbs (mpirun --mca btl self,sm,openib
...) works. Here's the output of ibv_devinfo and ifconfig for the two
nodes on which tried to run the benchmark (ulimit -l is unlimited on
both machines):
------------ 1st node -------------------------------
boris_at_pd-04:/work/boris/IMB_2.3/src$ /opt/infiniband/bin/ibv_devinfo
hca_id: mthca0
fw_ver: 1.2.0
node_guid: 0002:c902:0020:b528
sys_image_guid: 0002:c902:0020:b52b
vendor_id: 0x02c9
vendor_part_id: 25204
hw_ver: 0xA0
board_id: MT_0230000001
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 1
port_lid: 9
port_lmc: 0x00
boris_at_pd-04:/work/boris/IMB_2.3/src$ /sbin/ifconfig
...
ib0 Protokoll:UNSPEC Hardware Adresse
00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00
inet Adresse:192.168.0.14 Bcast:192.168.0.255
Maske:255.255.255.0
inet6 Adresse: fe80::202:c902:20:b529/64
Gültigkeitsbereich:Verbindung
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:67 errors:0 dropped:0 overruns:0 frame:0
TX packets:16 errors:0 dropped:2 overruns:0 carrier:0
Kollisionen:0 Sendewarteschlangenlänge:128
RX bytes:3752 (3.6 KiB) TX bytes:968 (968.0 b)
...
------------ 2nd node -------------------------------
boris_at_pd-05:~$ /opt/infiniband/bin/ibv_devinfo
hca_id: mthca0
fw_ver: 1.2.0
node_guid: 0002:c902:0020:b4f4
sys_image_guid: 0002:c902:0020:b4f7
vendor_id: 0x02c9
vendor_part_id: 25204
hw_ver: 0xA0
board_id: MT_0230000001
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 1
port_lid: 10
port_lmc: 0x00
boris_at_pd-05:~$ /sbin/ifconfig
...
ib0 Protokoll:UNSPEC Hardware Adresse
00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00
inet Adresse:192.168.0.15 Bcast:192.168.0.255
Maske:255.255.255.0
inet6 Adresse: fe80::202:c902:20:b4f5/64
Gültigkeitsbereich:Verbindung
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:67 errors:0 dropped:0 overruns:0 frame:0
TX packets:18 errors:0 dropped:2 overruns:0 carrier:0
Kollisionen:0 Sendewarteschlangenlänge:128
RX bytes:3752 (3.6 KiB) TX bytes:1088 (1.0 KiB)
...
-------------------------------------------------------------------------
Here's the output from the failed run, with every DAT and DAPL debug
output enabled:
boris_at_pd-04:/work/boris/IMB_2.3/src$ mpirun -np 2 -x DAT_DBG_TYPE -x
DAPL_DBG_TYPE -x DAT_OVERRIDE --mca btl self,sm,udapl --host pd-04,pd-05
/work/boris/IMB_2.3/src/IMB-MPI1 pingpong
DAT Registry: Started (dat_init)
DAT Registry: static registry file
</home/boris/dapl_on_dope_gamma3.2/doc/dat.conf>
DAT Registry: token
type string
value <OpenIB-cma>
DAT Registry: token
type string
value <u1.2>
DAT Registry: token
type string
value <nonthreadsafe>
DAT Registry: token
type string
value <default>
DAT Registry: token
type string
value
</home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/libdapl_openib_cma.so>
DAT Registry: token
type string
value <mv_dapl.1.2>
DAT Registry: token
type string
value <ib0 0>
DAT Registry: token
type string
value <>
DAT Registry: token
type eor
value <>
DAT Registry: entry
ia_name OpenIB-cma
api_version
type 0x0
major.minor 1.2
is_thread_safe 0
is_default 1
lib_path
/home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/libdapl_openib_cma.so
provider_version
id mv_dapl
major.minor 1.2
ia_params ib0 0
DAT Registry: loading provider for OpenIB-cma
DAT Registry: token
type eof
value <>
DAT Registry: dat_registry_list_providers () called
DAT Registry: dat_ia_openv (OpenIB-cma,1:2,0) called
DAT Registry: IA OpenIB-cma, trying to load library
/home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/libdapl_openib_cma.so
DAPL: NOT Setting Loopback
dapl_ib_init:
DAT Registry: dat_registry_add_provider (OpenIB-cma,1:2,0)
open_hca: ib0 - 0x807cf28
ib_thread_init(17919)
ib_thread_init: waiting for ib_thread
ib_thread(17919,0xa7b08bb0): ENTER: pipe 8 ucma 12
DAT Registry: Started (dat_init)
DAT Registry: static registry file
</home/boris/dapl_on_dope_gamma3.2/doc/dat.conf>
DAT Registry: token
type string
value <OpenIB-cma>
DAT Registry: token
type string
value <u1.2>
DAT Registry: token
type string
value <nonthreadsafe>
DAT Registry: token
type string
value <default>
DAT Registry: token
type string
value
</home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/libdapl_openib_cma.so>
DAT Registry: token
type string
value <mv_dapl.1.2>
DAT Registry: token
type string
value <ib0 0>
DAT Registry: token
type string
value <>
DAT Registry: token
type eor
value <>
DAT Registry: entry
ia_name OpenIB-cma
api_version
type 0x0
major.minor 1.2
is_thread_safe 0
is_default 1
lib_path
/home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/libdapl_openib_cma.so
provider_version
id mv_dapl
major.minor 1.2
ia_params ib0 0
DAT Registry: loading provider for OpenIB-cma
DAT Registry: token
type eof
value <>
DAT Registry: dat_registry_list_providers () called
DAT Registry: dat_ia_openv (OpenIB-cma,1:2,0) called
DAT Registry: IA OpenIB-cma, trying to load library
/home/boris/dapl_on_dope_gamma3.2/dapl/udapl/Target/i686/libdapl_openib_cma.so
ib_thread_init(17919) exit
DAPL: NOT Setting Loopback
dapl_ib_init:
DAT Registry: dat_registry_add_provider (OpenIB-cma,1:2,0)
open_hca: ib0 - 0x807cf18
ib_thread_init(12326)
ib_thread_init: waiting for ib_thread
ib_thread(12326,0xa7b75bb0): ENTER: pipe 8 ucma 12
ib_thread_init(12326) exit
getipaddr: family 2 port 0 addr 192.168.0.14
open_hca: ctx=0x809ecd0 port=1 GID subnet fe80000000000000 id
0002c9020020b529
open_hca: ib0, AF_INET 192.168.0.14 INLINE_MAX=128
ib_thread(17919) poll_event: async=0x1 pipe=0x1 cm=0x0 cq=0x0
ib_thread(17919) poll_fd: hca[134729592]=0xb, async=8 pipe=12 cm=13 cq=d
query_hca: ib0 AF_INET 192.168.0.14
query_hca: (ver=a0) ep 64512 ep_q 16384 evd 65408 evd_q 131071
query_hca: msg 2147483648 rdma 2147483648 iov 30 lmr 131056 rmr 0 rd_io 4
setup_async_cb: ia 0x80a1648 type 0 hdl (nil) cb 0xa7b1ec6c ctx 0x80a16d0
setup_async_cb: ia 0x80a1648 type 1 hdl (nil) cb 0xa7b1e9c0 ctx 0x80a16d0
setup_async_cb: ia 0x80a1648 type 3 hdl (nil) cb 0xa7b1eb50 ctx 0x80a1648
dat_set_handle 0x80a1648 to 1
dat_get_ia_handle from 1 to 0x80a1648
pd_alloc: pd_handle=0x80a1928
dat_get_ia_handle from 1 to 0x80a1648
query_hca: ib0 AF_INET 192.168.0.14
query_hca: (ver=a0) ep 64512 ep_q 16384 evd 65408 evd_q 131071
query_hca: msg 2147483648 rdma 2147483648 iov 30 lmr 131056 rmr 0 rd_io 4
dat_get_ia_handle from 1 to 0x80a1648
cq_object_create: (0x80a1958,0x80a1a44)
dapls_ib_cq_alloc: evd 0x80a1958 cqlen=32
dapls_ib_cq_alloc: new_cq 0x80a1a68 cqlen=63
setup_async_cb: ia 0x80a1648 type 2 hdl 0x80a1958 cb 0xa7b1f174 ctx
0x80a1958
dat_get_ia_handle from 1 to 0x80a1648
dat_get_ia_handle from 1 to 0x80a1648
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Address already in use
listen(ia_ptr 0x80a1648 SID 1025 sp 0x80a7a00 conn 0x80a7a70 id 134904736)
listen(conn=0x80a7a70 cm_id=134904736)
dat_get_ia_handle from 1 to 0x80a1648
mr_register: ia=0x80a1648, lmr=0x80a3718 va=0x80ae000 ln=266240 pv=0x0
mr_register: mr=0x80a37c8 h 4 pd 0x80a1928 ctx 0x809ecd0
lkey=0x72002700 rkey=0x72002700 priv=41000
dat_get_ia_handle from 1 to 0x80a1648
mr_register: ia=0x80a1648, lmr=0x80a7f18 va=0x80ef000 ln=528384 pv=0x0
mr_register: mr=0x80a7fc8 h 5 pd 0x80a1928 ctx 0x809ecd0
lkey=0xf2002800 rkey=0xf2002800 priv=81000
getipaddr: family 2 port 0 addr 192.168.0.15
open_hca: ctx=0x809ecc0 port=1 GID subnet fe80000000000000 id
0002c9020020b4f5
open_hca: ib0, AF_INET 192.168.0.15 INLINE_MAX=128
ib_thread(12326) poll_event: async=0x1 pipe=0x1 cm=0x0 cq=0x0
ib_thread(12326) poll_fd: hca[134729576]=0xb, async=8 pipe=12 cm=13 cq=d
query_hca: ib0 AF_INET 192.168.0.15
query_hca: (ver=a0) ep 64512 ep_q 16384 evd 65408 evd_q 131071
query_hca: msg 2147483648 rdma 2147483648 iov 30 lmr 131056 rmr 0 rd_io 4
setup_async_cb: ia 0x80a1638 type 0 hdl (nil) cb 0xa7b8bc6c ctx 0x80a16c0
setup_async_cb: ia 0x80a1638 type 1 hdl (nil) cb 0xa7b8b9c0 ctx 0x80a16c0
setup_async_cb: ia 0x80a1638 type 3 hdl (nil) cb 0xa7b8bb50 ctx 0x80a1638
dat_set_handle 0x80a1638 to 1
dat_get_ia_handle from 1 to 0x80a1638
pd_alloc: pd_handle=0x80a1918
dat_get_ia_handle from 1 to 0x80a1638
query_hca: ib0 AF_INET 192.168.0.15
query_hca: (ver=a0) ep 64512 ep_q 16384 evd 65408 evd_q 131071
query_hca: msg 2147483648 rdma 2147483648 iov 30 lmr 131056 rmr 0 rd_io 4
dat_get_ia_handle from 1 to 0x80a1638
cq_object_create: (0x80a1948,0x80a1a34)
dapls_ib_cq_alloc: evd 0x80a1948 cqlen=32
dapls_ib_cq_alloc: new_cq 0x80a1a58 cqlen=63
setup_async_cb: ia 0x80a1638 type 2 hdl 0x80a1948 cb 0xa7b8c174 ctx
0x80a1948
dat_get_ia_handle from 1 to 0x80a1638
dat_get_ia_handle from 1 to 0x80a1638
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
setup_listener Permission denied
listen(ia_ptr 0x80a1638 SID 1024 sp 0x80a7a00 conn 0x80a7a70 id 134904736)
listen(conn=0x80a7a70 cm_id=134904736)
dat_get_ia_handle from 1 to 0x80a1638
mr_register: ia=0x80a1638, lmr=0x80a3708 va=0x80ae000 ln=266240 pv=0x0
mr_register: mr=0x80a37b8 h 1 pd 0x80a1918 ctx 0x809ecc0
lkey=0x60002400 rkey=0x60002400 priv=41000
dat_get_ia_handle from 1 to 0x80a1638
mr_register: ia=0x80a1638, lmr=0x80a7ee8 va=0x80ef000 ln=528384 pv=0x0
mr_register: mr=0x80a7f98 h 2 pd 0x80a1918 ctx 0x809ecc0
lkey=0x60002500 rkey=0x60002500 priv=81000
#---------------------------------------------------
# Intel (R) MPI Benchmark Suite V2.3, MPI-1 part
#---------------------------------------------------
# Date : Tue May 8 11:16:58 2007
# Machine : i686# System : Linux
# Release : 2.6.18
# Version : #1 SMP Tue Nov 14 18:02:03 CET 2006
#
# Minimum message length in bytes: 0
# Maximum message length in bytes: 16777216
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#
# List of Benchmarks to run:
# PingPong
dat_get_ia_handle from 1 to 0x80a1638
query_hca: MAX msg 2147483648 dto 16384 iov 30 rdma i4,o4
qp_alloc: ia_ptr 0x80a1638 ep_ptr 0x81741f8 ep_ctx_ptr 0x81741f8
create_qp Address already in use
-------------------------------------------------------------------------
The jobs hangs at this point. From the output of another simple test
program I assume that it hangs inside of a receive operation. Of course,
I have noticed the "Permission denied" messages, but I don't think that
the probleme is there. These messages seem to come from RDMA CM when
things are set up, but the execution continues from there on and I have
seen these messages on successful DAPL runs, too. I'm not very familiar
with RDMA CM, though, so I don't know the cause of these messages.
That's a lot of information, I know, but it would be great if someone
would have a look at it.
Thanks in advance
Boris
Donald Kerr wrote:
> I have not tried Open MPI uDAPL on Linux nor do I have access to a Linux
> box so I am having a difficult time trying to find a way to help you
> debug this issue.
>
> -DON
>
> Andreas Kuntze wrote:
>
>> On Linux you needn't initialise the dat registry. Your program prints:
>> "provider 1: OpenIB-cma". I successfully tested INTEL MPI and mvapich2
>> with uDAPL .
>>
>> Andreas
>>
>> Donald Kerr wrote:
>>
>>
>>> Andreas,
>>>
>>> I am going to guess at a minimum the interfaces are up and you can
>>> ping them. On Solaris there is an additional step required and that
>>> is initializing the dat registry. If "/usr/sbin/datadm -v" does not
>>> show some driver output then you would need to run "/usr/sbin/datadm
>>> -a /usr/share/dat/SUNWudaplt.conf". I don't know if there is an
>>> equivalent on Linux.
>>>
>>> Attached is a simple udapl program which will check if the interfaces
>>> are available in the dat registry.
>>>
>>> -DON
>>>
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
-- | _ RWTH | Boris Bierbaum |_|_`_ | Lehrstuhl fuer Betriebssysteme | |_) _ | RWTH Aachen D-52056 Aachen |_)(_` | Tel: +49-241-80-27805 ._) | Fax: +49-241-80-22339