$subject_val = "Re: [OMPI users] Problem with OpenMPI (MX btl and mtl) and threads"; include("../../include/msg-header.inc"); ?>
Subject: Re: [OMPI users] Problem with OpenMPI (MX btl and mtl) and threads
From: François Trahay (francois.trahay_at_[hidden])
Date: 2009-06-11 14:20:15
The stack trace is from the MX MTL (I attach the backtraces I get with
both MX MTL and MX BTL)
Here is the program that I use. It is quite simple. It runs ping pongs
concurrently (with one thread per node, then with two threads per node,
etc.)
The error occurs when two threads run concurrently.
Francois
Scott Atchley wrote:
> Brian and George,
>
> I do not know if the stack trace is complete, but I do not see any
> mx_* functions called which would indicate a crash inside MX due to
> multiple threads trying to complete the same request. It does show an
> assert failed.
>
> Francois, is the stack trace from the MX MTL or BTL? Can you send a
> small program that reproduces this abort?
>
> Scott
>
>
> On Jun 11, 2009, at 12:25 PM, Brian Barrett wrote:
>
>> Neither the CM PML or the MX MTL has been looked at for thread
>> safety. There's not much code to cause problems in the CM PML. The
>> MX MTL would likely need some work to ensure the restrictions Scott
>> mentioned are met (currently, there's no such guarantee in the MX MTL).
>>
>> Brian
>>
>> On Jun 11, 2009, at 10:21 AM, George Bosilca wrote:
>>
>>> The comment on the FAQ (and on the other thread) is only true for
>>> some BTLs (TCP, SM and MX). I don't have resources to test for the
>>> others BTL, it is their developers responsibility to do the required
>>> modifications to make them thread safe.
>>>
>>> In addition, I have to confess that I never tested the MTL for
>>> thread safety. It is a completely different implementations for the
>>> message passing, supposed to map directly on top of the underlying
>>> network capabilities. However, there are clearly few places where
>>> thread safety should be enforced in the MTL layer, and I don't know
>>> if this is the case.
>>>
>>> george.
>>>
>>> On Jun 11, 2009, at 09:35 , Scott Atchley wrote:
>>>
>>>> Francois,
>>>>
>>>> For threads, the FAQ has:
>>>>
>>>> http://www.open-mpi.org/faq/?category=supported-systems#thread-support
>>>>
>>>> It mentions that thread support is designed in, but lightly tested.
>>>> It is also possible that the FAQ is out of date and
>>>> MPI_THREAD_MULTIPLE is fully supported.
>>>>
>>>> The stack trace below shows:
>>>>
>>>> opal_free()
>>>> opal_progress()
>>>> MPI_Recv()
>>>>
>>>> I do not know this code, but it may be in the higher level code
>>>> that calls the BTLs and/or MTLs and it would be a place to see if
>>>> that code handles the TCP BTL differently than MX BTL/MTL.
>>>>
>>>> MX is thread safe with the caveat that two threads may not try to
>>>> complete the same request at the same time. This includes calling
>>>> mx_test(), mx_wait(), mx_test_any() and/or mx_wait_any() where the
>>>> latter two have match bits and match mask that could complete a
>>>> request being tested/waited by another thread.
>>>>
>>>> Scott
>>>>
>>>> On Jun 11, 2009, at 6:00 AM, François Trahay wrote:
>>>>
>>>>> Well, according to George Bosilca
>>>>> (http://www.open-mpi.org/community/lists/users/2005/02/0005.php),
>>>>> threads are supported in OpenMPI.
>>>>> The program I try to run works with the TCP stack and MX driver is
>>>>> thread-safe, so i guess the problem comes from the MX BTL or MTL.
>>>>>
>>>>> Francois
>>>>>
>>>>>
>>>>> Scott Atchley wrote:
>>>>>> Hi Francois,
>>>>>>
>>>>>> I am not familiar with the internals of the OMPI code. Are you
>>>>>> sure, however, that threads are fully supported yet? I was under
>>>>>> the impression that thread support was still partial.
>>>>>>
>>>>>> Can anyone else comment?
>>>>>>
>>>>>> Scott
>>>>>>
>>>>>> On Jun 8, 2009, at 8:43 AM, François Trahay wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>> I'm encountering some issues when running a multithreaded
>>>>>>> program with
>>>>>>> OpenMPI (trunk rev. 21380, configured with --enable-mpi-threads)
>>>>>>> My program (included in the tar.bz2) uses several pthreads that
>>>>>>> perform
>>>>>>> ping pongs concurrently (thread #1 uses tag #1, thread #2 uses
>>>>>>> tag #2, etc.)
>>>>>>> This program crashes over MX (either btl or mtl) with the following
>>>>>>> backtrace:
>>>>>>>
>>>>>>> concurrent_ping_v2: pml_cm_recvreq.c:53:
>>>>>>> mca_pml_cm_recv_request_completion: Assertion `0 ==
>>>>>>> ((mca_pml_cm_thin_recv_request_t*)base_request)->req_base.req_pml_complete'
>>>>>>>
>>>>>>> failed.
>>>>>>> [joe0:01709] *** Process received signal ***
>>>>>>> [joe0:01709] *** Process received signal ***
>>>>>>> [joe0:01709] Signal: Segmentation fault (11)
>>>>>>> [joe0:01709] Signal code: Address not mapped (1)
>>>>>>> [joe0:01709] Failing at address: 0x1238949c4
>>>>>>> [joe0:01709] Signal: Aborted (6)
>>>>>>> [joe0:01709] Signal code: (-6)
>>>>>>> [joe0:01709] [ 0] /lib/libpthread.so.0 [0x7f57240be7b0]
>>>>>>> [joe0:01709] [ 1] /lib/libc.so.6(gsignal+0x35) [0x7f5722cba065]
>>>>>>> [joe0:01709] [ 2] /lib/libc.so.6(abort+0x183) [0x7f5722cbd153]
>>>>>>> [joe0:01709] [ 3] /lib/libc.so.6(__assert_fail+0xe9)
>>>>>>> [0x7f5722cb3159]
>>>>>>> [joe0:01709] [ 0] /lib/libpthread.so.0 [0x7f57240be7b0]
>>>>>>> [joe0:01709] [ 1]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so.0
>>>>>>> [0x7f57238d0a08]
>>>>>>> [joe0:01709] [ 2]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so.0
>>>>>>> [0x7f57238cf8cc]
>>>>>>> [joe0:01709] [ 3]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so.0(opal_free+0x4e)
>>>>>>>
>>>>>>> [0x7f57238bdc69]
>>>>>>> [joe0:01709] [ 4]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/mca_mtl_mx.so
>>>>>>>
>>>>>>> [0x7f572060b72f]
>>>>>>> [joe0:01709] [ 5]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so.0(opal_progress+0xbc)
>>>>>>>
>>>>>>> [0x7f57238948e0]
>>>>>>> [joe0:01709] [ 6]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/mca_pml_cm.so
>>>>>>>
>>>>>>> [0x7f572081145a]
>>>>>>> [joe0:01709] [ 7]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/mca_pml_cm.so
>>>>>>>
>>>>>>> [0x7f57208113b7]
>>>>>>> [joe0:01709] [ 8]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/mca_pml_cm.so
>>>>>>>
>>>>>>> [0x7f57208112e7]
>>>>>>> [joe0:01709] [ 9]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libmpi.so.0(MPI_Recv+0x2bc)
>>>>>>>
>>>>>>> [0x7f5723e07690]
>>>>>>> [joe0:01709] [10] ./concurrent_ping_v2(client+0x123) [0x401404]
>>>>>>> [joe0:01709] [11] /lib/libpthread.so.0 [0x7f57240b6faa]
>>>>>>> [joe0:01709] [12] /lib/libc.so.6(clone+0x6d) [0x7f5722d5629d]
>>>>>>> [joe0:01709] *** End of error message ***
>>>>>>> [joe0:01709] [ 4]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/mca_pml_cm.so
>>>>>>>
>>>>>>> [0x7f57208120bb]
>>>>>>> [joe0:01709] [ 5]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/mca_mtl_mx.so
>>>>>>>
>>>>>>> [0x7f572060b80a]
>>>>>>> [joe0:01709] [ 6]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so.0(opal_progress+0xbc)
>>>>>>>
>>>>>>> [0x7f57238948e0]
>>>>>>> [joe0:01709] [ 7]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/mca_pml_cm.so
>>>>>>>
>>>>>>> [0x7f572081147a]
>>>>>>> [joe0:01709] [ 8]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/mca_pml_cm.so
>>>>>>>
>>>>>>> [0x7f57208113b7]
>>>>>>> [joe0:01709] [ 9]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/mca_pml_cm.so
>>>>>>>
>>>>>>> [0x7f57208112e7]
>>>>>>> [joe0:01709] [10]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libmpi.so.0(MPI_Recv+0x2bc)
>>>>>>>
>>>>>>> [0x7f5723e07690]
>>>>>>> [joe0:01709] [11] ./concurrent_ping_v2(client+0x123) [0x401404]
>>>>>>> [joe0:01709] [12] /lib/libpthread.so.0 [0x7f57240b6faa]
>>>>>>> [joe0:01709] [13] /lib/libc.so.6(clone+0x6d) [0x7f5722d5629d]
>>>>>>> [joe0:01709] *** End of error message ***
>>>>>>> --------------------------------------------------------------------------
>>>>>>>
>>>>>>> mpirun noticed that process rank 1 with PID 1709 on node joe0
>>>>>>> exited on
>>>>>>> signal 6 (Aborted).
>>>>>>> --------------------------------------------------------------------------
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Any idea ?
>>>>>>>
>>>>>>> Francois Trahay
>>>>>>>
>>>>>>> <bug-report.tar.bz2>_______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>