From: Luis Kornblueh (luis.kornblueh_at_[hidden])
Date: 2007-05-08 03:27:02


Hi everybody,

we've got some problems on our cluster with openmpi versions 1.2 and
upward.

The following setup does work:

openmpi-1.2b3: SLES 9 SP3 with gcc/g++ 4.1.1 and PGI f95 6.1-1

The following two setups give a SISEGV in mpiexec (stack see below)

openmpi-1.2: SLES 9 SP3 with gcc/g++ 4.1.1 and PGI f95 6.1-1
openmpi-1.2.1: SLES 9 SP3 with gcc/g++ 4.1.1 and PGI f95 6.1-1

All have been compiled with

export F77=pgf95
export FC=pgf95
 
./configure --prefix=/sw/sles9-x64/voltaire/openmpi-1.2b3-pgi \
            --enable-pretty-print-stacktrace \
            --with-libnuma=/usr \
            --with-mvapi=/usr \
            --with-mvapi-libdir=/usr/lib64

(with changing prefix, of course)

The stack trace:

Starting program: /scratch/work/system/sw/sles9-x64/voltaire/openmpi-1.2.1-pgi/bin/mpiexec -host tornado1 --prefix=$MPIROOT -v -np 8 `pwd`/osu_bw
[Thread debugging using libthread_db enabled]
[New Thread 182906198784 (LWP 30805)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 182906198784 (LWP 30805)]
0x0000002a957f1b5b in _int_free () from /sw/sles9-x64/voltaire/openmpi-1.2.1-pgi/lib/libopen-pal.so.0
(gdb) where
#0 0x0000002a957f1b5b in _int_free () from /sw/sles9-x64/voltaire/openmpi-1.2.1-pgi/lib/libopen-pal.so.0
#1 0x0000002a957f1e7d in free () from /sw/sles9-x64/voltaire/openmpi-1.2.1-pgi/lib/libopen-pal.so.0
#2 0x0000002a95563b72 in __tls_get_addr () from /lib64/ld-linux-x86-64.so.2
#3 0x0000002a95fb51ec in __libc_dl_error_tsd () from /lib64/tls/libc.so.6
#4 0x0000002a95dba6ec in __pthread_initialize_minimal_internal () from /lib64/tls/libpthread.so.0
#5 0x0000002a95dba419 in call_initialize_minimal () from /lib64/tls/libpthread.so.0
#6 0x0000002a95ec9000 in ?? ()
#7 0x0000002a95db9fe9 in _init () from /lib64/tls/libpthread.so.0
#8 0x0000007fbfffe7c0 in ?? ()
#9 0x0000002a9556168d in call_init () from /lib64/ld-linux-x86-64.so.2
#10 0x0000002a9556179b in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#11 0x0000002a95fb39ac in dl_open_worker () from /lib64/tls/libc.so.6
#12 0x0000002a955612de in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#13 0x0000002a95fb3160 in _dl_open () from /lib64/tls/libc.so.6
#14 0x0000002a959413b5 in dlopen_doit () from /lib64/libdl.so.2
#15 0x0000002a955612de in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#16 0x0000002a959416fa in _dlerror_run () from /lib64/libdl.so.2
#17 0x0000002a95941362 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#18 0x0000002a957db2ee in vm_open () from /sw/sles9-x64/voltaire/openmpi-1.2.1-pgi/lib/libopen-pal.so.0
#19 0x0000002a957d9645 in tryall_dlopen () from /sw/sles9-x64/voltaire/openmpi-1.2.1-pgi/lib/libopen-pal.so.0
#20 0x0000002a957d981e in tryall_dlopen_module () from /sw/sles9-x64/voltaire/openmpi-1.2.1-pgi/lib/libopen-pal.so.0
#21 0x0000002a957daab1 in try_dlopen () from /sw/sles9-x64/voltaire/openmpi-1.2.1-pgi/lib/libopen-pal.so.0
#22 0x0000002a957dacd6 in lt_dlopenext () from /sw/sles9-x64/voltaire/openmpi-1.2.1-pgi/lib/libopen-pal.so.0
#23 0x0000002a957e04f5 in open_component () from /sw/sles9-x64/voltaire/openmpi-1.2.1-pgi/lib/libopen-pal.so.0
#24 0x0000002a957e0f60 in mca_base_component_find () from /sw/sles9-x64/voltaire/openmpi-1.2.1-pgi/lib/libopen-pal.so.0
#25 0x0000002a957e189c in mca_base_components_open () from /sw/sles9-x64/voltaire/openmpi-1.2.1-pgi/lib/libopen-pal.so.0
#26 0x0000002a956a6119 in orte_rds_base_open () from /sw/sles9-x64/voltaire/openmpi-1.2.1-pgi/lib/libopen-rte.so.0
#27 0x0000002a95681d18 in orte_init_stage1 () from /sw/sles9-x64/voltaire/openmpi-1.2.1-pgi/lib/libopen-rte.so.0
#28 0x0000002a95684eba in orte_system_init () from /sw/sles9-x64/voltaire/openmpi-1.2.1-pgi/lib/libopen-rte.so.0
#29 0x0000002a9568179d in orte_init () from /sw/sles9-x64/voltaire/openmpi-1.2.1-pgi/lib/libopen-rte.so.0
#30 0x0000000000402a3a in orterun (argc=8, argv=0x7fbfffe778) at orterun.c:374
#31 0x00000000004028d3 in main (argc=8, argv=0x7fbfffe778) at main.c:13
(gdb) quit

In case access to our cluster could help, we would be happy to
provide an account.

Cheerio,
Luis

-- 
                             \\\\\\
                             (-0^0-)
--------------------------oOO--(_)--OOo-----------------------------
 Luis Kornblueh                           Tel. : +49-40-41173289
 Max-Planck-Institute for Meteorology     Fax. : +49-40-41173298
 Bundesstr. 53              
 D-20146 Hamburg                   Email: luis.kornblueh_at_[hidden]
 Federal Republic of Germany