include("../../include/msg-header.inc"); ?>
From: Marcin Skoczylas (Marcin.Skoczylas_at_[hidden])
Date: 2007-05-31 10:39:41
Sorry I completely forgot to mention:
OpenMPI 1.2.2, on intels x86
segmentation fault is still there, but I'm waiting for my admin to fix
the routing to check if that was the problem, however on my stand-alone
linux machine with correct routing that does not appear - but that's not
a prove.
segmentation fault in MPI_Barrier is only when I want to run my program
on the head node, eg.
mpirun -np 4 ./myproggy
if I use also worker nodes:
mpirun -hostfile ./../hosts -np 50 ./myproggy
I got an error in the MPI_Init:
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort (...)
PML add procs failed
--> Returned "Unreachable" (-12) instead of "Success" (0)
so with hostfile and worker nodes it is correct - it gives an error on
the beginning...
but, I wouldn't be so entusiastic as I have lot of strange & complicated
code there before MPI_Barrier, which I updated recently - could be my
mistake that I messed some memory used by openmpi (valgrind does not
complain). who knows... I'll wait for my admin to fix the routing and
I'll post more information as soon as possible. Possibly as recently I
upgraded openmpi from 1.x.x version, maybe there are some libraries
messed up, or other things are wrong.
greets, Marcin
Jelena Pjesivac-Grbovic wrote:
> Hi Marcin,
>
> what version of Open MPI did you use?
> Is it still occurring?
> It is also possible that the connection went down during the execution...
> although, a segfault really should not occur.
>
> Thanks,
> Jelena
>
> On Tue, 29 May 2007, Marcin Skoczylas wrote:
>
>
>> hello,
>>
>> recently my administrator made some changes on our cluster and now I
>> have a crash during MPI_Barrier:
>>
>> [our-host:12566] *** Process received signal ***
>> [our-host:12566] Signal: Segmentation fault (11)
>> [our-host:12566] Signal code: Address not mapped (1)
>> [our-host:12566] Failing at address: 0x4
>> [our-host:12566] [ 0] /lib/tls/libpthread.so.0 [0xa22f80]
>> [our-host:12566] [ 1]
>> /usr/lib/openmpi/mca_btl_sm.so(mca_btl_sm_component_progress+0x68f)
>> [0xcd86d7]
>> [our-host:12566] [ 2]
>> /usr/lib/openmpi/mca_bml_r2.so(mca_bml_r2_progress+0x32) [0xcb7e3a]
>> [our-host:12566] [ 3] /usr/lib/libopen-pal.so.0(opal_progress+0xed)
>> [0xc2b221]
>> [our-host:12566] [ 4] /usr/lib/libmpi.so.0 [0x3aecc5]
>> [our-host:12566] [ 5] /usr/lib/libmpi.so.0(ompi_request_wait_all+0xec)
>> [0x3ae784]
>> [our-host:12566] [ 6]
>> /usr/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_sendrecv_actual+0x77)
>> [0xd025bb]
>> [our-host:12566] [ 7]
>> /usr/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_barrier_intra_recursivedoubling+0xde)
>> [0xd05e3a]
>> [our-host:12566] [ 8]
>> /usr/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_barrier_intra_dec_fixed+0x44)
>> [0xd027d8]
>> [our-host:12566] [ 9] /usr/lib/libmpi.so.0(PMPI_Barrier+0x176) [0x3c0cea]
>>
>> Actually, I made small investigation and I realised that:
>>
>> [user_at_our-host]$ ssh our-host
>> ssh(12704) ssh: connect to host our-host port 22: No route to host
>>
>> that could be the thing, I'm going to talk with my admin soon about this
>> routing change, however if it is really this problem, shouldn't it be
>> recognised during startup, f.e. in MPI_Init? Actually, I'm not sure...
>> your comments?
>>
>> greetings, Marcin
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>
> --
> Jelena Pjesivac-Grbovic, Pjesa
> Graduate Research Assistant
> Innovative Computing Laboratory
> Computer Science Department, UTK
> Claxton Complex 350
> (865) 974 - 6722
> (865) 974 - 6321
> jpjesiva_at_[hidden]
>
> Murphy's Law of Research:
> Enough research will tend to support your theory.
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>