$subject_val = "Re: [OMPI users] Problem with OpenMPI (MX btl and mtl) and threads"; include("../../include/msg-header.inc"); ?>
Subject: Re: [OMPI users] Problem with OpenMPI (MX btl and mtl) and threads
From: Scott Atchley (atchley_at_[hidden])
Date: 2009-06-11 09:35:25
Francois,
For threads, the FAQ has:
http://www.open-mpi.org/faq/?category=supported-systems#thread-support
It mentions that thread support is designed in, but lightly tested. It
is also possible that the FAQ is out of date and MPI_THREAD_MULTIPLE
is fully supported.
The stack trace below shows:
opal_free()
opal_progress()
MPI_Recv()
I do not know this code, but it may be in the higher level code that
calls the BTLs and/or MTLs and it would be a place to see if that code
handles the TCP BTL differently than MX BTL/MTL.
MX is thread safe with the caveat that two threads may not try to
complete the same request at the same time. This includes calling
mx_test(), mx_wait(), mx_test_any() and/or mx_wait_any() where the
latter two have match bits and match mask that could complete a
request being tested/waited by another thread.
Scott
On Jun 11, 2009, at 6:00 AM, François Trahay wrote:
> Well, according to George Bosilca (http://www.open-mpi.org/community/lists/users/2005/02/0005.php
> ), threads are supported in OpenMPI.
> The program I try to run works with the TCP stack and MX driver is
> thread-safe, so i guess the problem comes from the MX BTL or MTL.
>
> Francois
>
>
> Scott Atchley wrote:
>> Hi Francois,
>>
>> I am not familiar with the internals of the OMPI code. Are you
>> sure, however, that threads are fully supported yet? I was under
>> the impression that thread support was still partial.
>>
>> Can anyone else comment?
>>
>> Scott
>>
>> On Jun 8, 2009, at 8:43 AM, François Trahay wrote:
>>
>>> Hi,
>>> I'm encountering some issues when running a multithreaded program
>>> with
>>> OpenMPI (trunk rev. 21380, configured with --enable-mpi-threads)
>>> My program (included in the tar.bz2) uses several pthreads that
>>> perform
>>> ping pongs concurrently (thread #1 uses tag #1, thread #2 uses tag
>>> #2, etc.)
>>> This program crashes over MX (either btl or mtl) with the following
>>> backtrace:
>>>
>>> concurrent_ping_v2: pml_cm_recvreq.c:53:
>>> mca_pml_cm_recv_request_completion: Assertion `0 ==
>>> ((mca_pml_cm_thin_recv_request_t*)base_request)-
>>> >req_base.req_pml_complete'
>>> failed.
>>> [joe0:01709] *** Process received signal ***
>>> [joe0:01709] *** Process received signal ***
>>> [joe0:01709] Signal: Segmentation fault (11)
>>> [joe0:01709] Signal code: Address not mapped (1)
>>> [joe0:01709] Failing at address: 0x1238949c4
>>> [joe0:01709] Signal: Aborted (6)
>>> [joe0:01709] Signal code: (-6)
>>> [joe0:01709] [ 0] /lib/libpthread.so.0 [0x7f57240be7b0]
>>> [joe0:01709] [ 1] /lib/libc.so.6(gsignal+0x35) [0x7f5722cba065]
>>> [joe0:01709] [ 2] /lib/libc.so.6(abort+0x183) [0x7f5722cbd153]
>>> [joe0:01709] [ 3] /lib/libc.so.6(__assert_fail+0xe9)
>>> [0x7f5722cb3159]
>>> [joe0:01709] [ 0] /lib/libpthread.so.0 [0x7f57240be7b0]
>>> [joe0:01709] [ 1]
>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so.0
>>> [0x7f57238d0a08]
>>> [joe0:01709] [ 2]
>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so.0
>>> [0x7f57238cf8cc]
>>> [joe0:01709] [ 3]
>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so.
>>> 0(opal_free+0x4e)
>>> [0x7f57238bdc69]
>>> [joe0:01709] [ 4]
>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>> mca_mtl_mx.so
>>> [0x7f572060b72f]
>>> [joe0:01709] [ 5]
>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so.
>>> 0(opal_progress+0xbc)
>>> [0x7f57238948e0]
>>> [joe0:01709] [ 6]
>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>> mca_pml_cm.so
>>> [0x7f572081145a]
>>> [joe0:01709] [ 7]
>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>> mca_pml_cm.so
>>> [0x7f57208113b7]
>>> [joe0:01709] [ 8]
>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>> mca_pml_cm.so
>>> [0x7f57208112e7]
>>> [joe0:01709] [ 9]
>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libmpi.so.
>>> 0(MPI_Recv+0x2bc)
>>> [0x7f5723e07690]
>>> [joe0:01709] [10] ./concurrent_ping_v2(client+0x123) [0x401404]
>>> [joe0:01709] [11] /lib/libpthread.so.0 [0x7f57240b6faa]
>>> [joe0:01709] [12] /lib/libc.so.6(clone+0x6d) [0x7f5722d5629d]
>>> [joe0:01709] *** End of error message ***
>>> [joe0:01709] [ 4]
>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>> mca_pml_cm.so
>>> [0x7f57208120bb]
>>> [joe0:01709] [ 5]
>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>> mca_mtl_mx.so
>>> [0x7f572060b80a]
>>> [joe0:01709] [ 6]
>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so.
>>> 0(opal_progress+0xbc)
>>> [0x7f57238948e0]
>>> [joe0:01709] [ 7]
>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>> mca_pml_cm.so
>>> [0x7f572081147a]
>>> [joe0:01709] [ 8]
>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>> mca_pml_cm.so
>>> [0x7f57208113b7]
>>> [joe0:01709] [ 9]
>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>> mca_pml_cm.so
>>> [0x7f57208112e7]
>>> [joe0:01709] [10]
>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libmpi.so.
>>> 0(MPI_Recv+0x2bc)
>>> [0x7f5723e07690]
>>> [joe0:01709] [11] ./concurrent_ping_v2(client+0x123) [0x401404]
>>> [joe0:01709] [12] /lib/libpthread.so.0 [0x7f57240b6faa]
>>> [joe0:01709] [13] /lib/libc.so.6(clone+0x6d) [0x7f5722d5629d]
>>> [joe0:01709] *** End of error message ***
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 1 with PID 1709 on node joe0
>>> exited on
>>> signal 6 (Aborted).
>>> --------------------------------------------------------------------------
>>>
>>>
>>> Any idea ?
>>>
>>> Francois Trahay
>>>
>>> <bug-report.tar.bz2>_______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>