$subject_val = "Re: [OMPI users] After upgrading to 1.3.2 some nodes hang on MPI-Applications"; include("../../include/msg-header.inc"); ?>
Subject: Re: [OMPI users] After upgrading to 1.3.2 some nodes hang on MPI-Applications
From: jody (jody.xha_at_[hidden])
Date: 2009-06-11 06:56:38
More info:
I checked and found that not all nodes are equal:
the ones that don't work have mpi-threads *and* progress-threads enabled,
whereas the ones that work have only mpi-threads enabled
Is there a problem when both thread-types are enabled?
Jody
On Thu, Jun 11, 2009 at 12:19 PM, jody<jody.xha_at_[hidden]> wrote:
> Hi
>
> After updating all my nodes to Open-MPI 1.3.2 (with
> --enable-mpi-threads some of them fail to execute a simple MPI test
> program - they seem to hang.
> With --debug-daemons the application seems to execute (two line os
> output) but hangs before returning:
>
> [jody_at_aplankton neander]$ mpirun -np 2 --host nano_06 --debug-daemons ./MPITest
> Daemon was launched on nano_06 - beginning to initialize
> Daemon [[44301,0],1] checking in as pid 5166 on host nano_06
> Daemon [[44301,0],1] not using static ports
> [nano_06:05166] [[44301,0],1] orted: up and running - waiting for commands!
> [plankton:23859] [[44301,0],0] node[0].name plankton daemon 0 arch ffca0200
> [plankton:23859] [[44301,0],0] node[1].name nano_06 daemon 1 arch ffca0200
> [plankton:23859] [[44301,0],0] orted_cmd: received add_local_procs
> [nano_06:05166] [[44301,0],1] node[0].name plankton daemon 0 arch ffca0200
> [nano_06:05166] [[44301,0],1] node[1].name nano_06 daemon 1 arch ffca0200
> [nano_06:05166] [[44301,0],1] orted_cmd: received add_local_procs
> [nano_06:05166] [[44301,0],1] orted_recv: received sync+nidmap from
> local proc [[44301,1],0]
> [nano_06:05166] [[44301,0],1] orted_recv: received sync+nidmap from
> local proc [[44301,1],1]
> [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
> [plankton:23859] [[44301,0],0] orted_cmd: received collective data cmd
> [plankton:23859] [[44301,0],0] orted_cmd: received message_local_procs
> [plankton:23859] [[44301,0],0] orted_cmd: received collective data cmd
> [plankton:23859] [[44301,0],0] orted_cmd: received message_local_procs
> [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
> [nano_06:05166] [[44301,0],1] orted_cmd: received message_local_procs
> [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
> [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
> [nano_06:05166] [[44301,0],1] orted_cmd: received message_local_procs
> [nano_06]I am #0/2
> [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
> [nano_06]I am #1/2
> [plankton:23859] [[44301,0],0] orted_cmd: received collective data cmd
> [plankton:23859] [[44301,0],0] orted_cmd: received message_local_procs
> [nano_06:05166] [[44301,0],1] orted_cmd: received collective data cmd
> [nano_06:05166] [[44301,0],1] orted_cmd: received message_local_procs
> [nano_06:05166] [[44301,0],1] orted_recv: received sync from local
> proc [[44301,1],1]
> [nano_06:05166] [[44301,0],1] orted_recv: received sync from local
> proc [[44301,1],0]
> (Here it hangs)
>
> Some don't even get to execute:
> [jody_at_plankton neander]$ mpirun -np 2 --host nano_01 --debug-daemons ./MPITest
> Daemon was launched on nano_01 - beginning to initialize
> Daemon [[44293,0],1] checking in as pid 5044 on host nano_01
> Daemon [[44293,0],1] not using static ports
> [nano_01:05044] [[44293,0],1] orted: up and running - waiting for commands!
> [plankton:23867] [[44293,0],0] node[0].name plankton daemon 0 arch ffca0200
> [plankton:23867] [[44293,0],0] node[1].name nano_01 daemon 1 arch ffca0200
> [plankton:23867] [[44293,0],0] orted_cmd: received add_local_procs
> [nano_01:05044] [[44293,0],1] node[0].name plankton daemon 0 arch ffca0200
> [nano_01:05044] [[44293,0],1] node[1].name nano_01 daemon 1 arch ffca0200
> [nano_01:05044] [[44293,0],1] orted_cmd: received add_local_procs
> [nano_01:05044] [[44293,0],1] orted_recv: received sync+nidmap from
> local proc [[44293,1],0]
> [nano_01:05044] [[44293,0],1] orted_cmd: received collective data cmd
> (Here it hangs)
>
> When i call one of the bad nodes with only 1 processor and debug-daemons,
> it works fine (output & clean completion), but without debug-daemons it hangs.
> But sometimes there is a crash (not always reproducible):
>
> [jody_at_plankton neander]$ mpirun -np 1 --host nano_04 --debug-daemons ./MPITest
> Daemon was launched on nano_04 - beginning to initialize
> Daemon [[44431,0],1] checking in as pid 5333 on host nano_04
> Daemon [[44431,0],1] not using static ports
> [plankton:23985] [[44431,0],0] node[0].name plankton daemon 0 arch ffca0200
> [plankton:23985] [[44431,0],0] node[1].name nano_04 daemon 1 arch ffca0200
> [plankton:23985] [[44431,0],0] orted_cmd: received add_local_procs
> [nano_04:05333] [[44431,0],1] orted: up and running - waiting for commands!
> [nano_04:05333] [[44431,0],1] node[0].name plankton daemon 0 arch ffca0200
> [nano_04:05333] [[44431,0],1] node[1].name nano_04 daemon 1 arch ffca0200
> [nano_04:05333] [[44431,0],1] orted_cmd: received add_local_procs
> [nano_04:05333] [[44431,0],1] orted_recv: received sync+nidmap from
> local proc [[44431,1],0]
> [nano_04:05333] [[44431,0],1] orted_cmd: received collective data cmd
> [plankton:23985] [[44431,0],0] orted_cmd: received collective data cmd
> [plankton:23985] [[44431,0],0] orted_cmd: received message_local_procs
> [nano_04:05333] [[44431,0],1] orted_cmd: received message_local_procs
> [nano_04:05333] [[44431,0],1] orted_cmd: received collective data cmd
> [plankton:23985] [[44431,0],0] orted_cmd: received collective data cmd
> [plankton:23985] [[44431,0],0] orted_cmd: received message_local_procs
> [nano_04:05333] [[44431,0],1] orted_cmd: received message_local_procs
> [nano_04:05333] [[44431,0],1] orted_cmd: received collective data cmd
> [nano_04]I am #0/1
> [plankton:23985] [[44431,0],0] orted_cmd: received collective data cmd
> [plankton:23985] [[44431,0],0] orted_cmd: received message_local_procs
> [nano_04:05333] [[44431,0],1] orted_cmd: received message_local_procs
> [nano_04:05333] [[44431,0],1] orted_recv: received sync from local
> proc [[44431,1],0]
> [nano_04:05333] [[44431,0],1] orted_cmd: received iof_complete cmd
> [nano_04:05333] [[44431,0],1] orted_cmd: received waitpid_fired cmd
> [plankton:23985] [[44431,0],0] orted_cmd: received exit
> [nano_04:05333] [[44431,0],1] orted_cmd: received exit
> [nano_04:05333] [[44431,0],1] orted: finalizing
> [nano_04:05333] *** Process received signal ***
> [nano_04:05333] Signal: Segmentation fault (11)
> [nano_04:05333] Signal code: Address not mapped (1)
> [nano_04:05333] Failing at address: 0xb7493e20
> [nano_04:05333] [ 0] [0xffffe40c]
> [nano_04:05333] [ 1]
> /opt/openmpi/lib/libopen-pal.so.0(opal_event_loop+0x27) [0xb7e65417]
> [nano_04:05333] [ 2]
> /opt/openmpi/lib/libopen-pal.so.0(opal_event_dispatch+0x1e)
> [0xb7e6543e]
> [nano_04:05333] [ 3]
> /opt/openmpi/lib/libopen-rte.so.0(orte_daemon+0x761) [0xb7ed3d71]
> [nano_04:05333] [ 4] orted [0x80487b4]
> [nano_04:05333] [ 5] /lib/libc.so.6(__libc_start_main+0xdc) [0xb7cc060c]
> [nano_04:05333] [ 6] orted [0x8048691]
> [nano_04:05333] *** End of error message ***
>
>
>
>
> Is that perhaps a consequence of configuring with --enable-mpi-threads
> and --enable-progress-threads?
>
> Thank You
> Jody
>