Subject: [OMPI users] Intermittent corruption
From: Nick Collier (nick.collier_at_[hidden])
Date: 2009-06-11 17:02:44


Hi,

I'm developing under OSX 10.5.7 with Open-MPI 1.3.2 and am running
into intermittent corruption when send / recv user defined data type.
When running with less than four processes (i.e. mpirun -np [2,3]),
the data is fine, when running with 4 or more the received data is
intermittently corrupted. By corrupted, I mean things like what should
be small integer values in a struct are very large as if the memory
hasn't been assigned properly. This occurs intermittently -- some runs
will be fine and others won't be, leading to crashes like:

[belafonte:30191] *** Process received signal ***
[belafonte:30191] Signal: Bus error (10)
[belafonte:30191] Signal code: (2)
[belafonte:30191] Failing at address: 0x9
[belafonte:30191] [ 0] 2 libSystem.B.dylib
0x945af2bb _sigtramp + 43
[belafonte:30191] [ 1] 3 ???
0xffffffff 0x0 + 4294967295

I'm not sure how to proceed or what might be wrong. The closest thing
I could find on google was http://icl.cs.utk.edu/lapack-forum/viewtopic.php?f=2&t=614
  where someone reports having issues with ScaLapack in combination
with openmpi and OSX's stock gcc 4.01 that were fixed by using gcc
4.3.1.

At any rate, any suggestions on how to move forward would be
appreciated.

thanks,

Nick