$subject_val = "[OMPI users] Bug in 1.3.2?: sm btl and isend is serializes"; include("../../include/msg-header.inc"); ?>
Subject: [OMPI users] Bug in 1.3.2?: sm btl and isend is serializes
From: Mark Bolstad (the.render.dude_at_[hidden])
Date: 2009-06-19 10:18:19
I have a small test code that I've managed to duplicate the results from a
larger code. In essence, using the sm btl with ISend, I wind up with the
communication being completely serialized, i.e., all the calls from process
1 complete, then all from 2, ...
This is version 1.3.2, vanilla compile. I get the same results on my RHEL5
nehalem and an OS X laptop.
Here's an example of the output (note: there is a usleep in the code to
mimick my computation loop, and ensure that this is not a simple I/O
sequencing issue):
---- Ignore the "next" in the output below, it was a broadcast test.
mpirun -np 5 ./mpi_split_test
Master [id = 0 of 5] is running on bolstadm-lm1
[0] next = 10
Server [id = 3, 2, 1 of 5] is running on bolstadm-lm1
Compositor [id = 1, 0 of 5] is running on bolstadm-lm1
[1] next = 10
Sending buffer 0 from 1
Server [id = 2, 1, 0 of 5] is running on bolstadm-lm1
[2] next = 10
Sending buffer 0 from 2
[3] next = 10
Server [id = 4, 3, 2 of 5] is running on bolstadm-lm1
[4] next = 10
Sending buffer 0 from 3
Sending buffer 1 from 1
Sending buffer 1 from 2
Sending buffer 1 from 3
Sending buffer 2 from 1
Sending buffer 2 from 2
Sending buffer 2 from 3
Sending buffer 3 from 1
Sending buffer 3 from 2
Sending buffer 4 from 1
Receiving buffer from 1, buffer = hello from 1 for the 0 time
Receiving buffer from 1, buffer = hello from 1 for the 1 time
Sending buffer 4 from 2
Sending buffer 4 from 3
Sending buffer 5 from 1
Receiving buffer from 1, buffer = hello from 1 for the 2 time
Sending buffer 6 from 1
Receiving buffer from 1, buffer = hello from 1 for the 3 time
-----At this point, processes 2 & 3 are stuck in an MPI_Wait
...
Sending buffer 9 from 1
Receiving buffer from 1, buffer = hello from 1 for the 6 time
Receiving buffer from 1, buffer = hello from 1 for the 7 time
Receiving buffer from 1, buffer = hello from 1 for the 8 time
Receiving buffer from 1, buffer = hello from 1 for the 9 time
Receiving buffer from 2, buffer = hello from 2 for the 0 time
Receiving buffer from 2, buffer = hello from 2 for the 1 time
Receiving buffer from 2, buffer = hello from 2 for the 2 time
Sending buffer 5 from 2
Sending buffer 6 from 2
Receiving buffer from 2, buffer = hello from 2 for the 3 time
---- Now process 2 is now running, 1 is in a barrier, 3 is still in Wait
....
Sending buffer 9 from 2
Receiving buffer from 2, buffer = hello from 2 for the 6 time
Receiving buffer from 2, buffer = hello from 2 for the 7 time
Receiving buffer from 2, buffer = hello from 2 for the 8 time
Receiving buffer from 2, buffer = hello from 2 for the 9 time
Receiving buffer from 3, buffer = hello from 3 for the 0 time
Sending buffer 5 from 3
Receiving buffer from 3, buffer = hello from 3 for the 1 time
Receiving buffer from 3, buffer = hello from 3 for the 2 time
---- And now process 3 goes
...
Receiving buffer from 3, buffer = hello from 3 for the 8 time
Receiving buffer from 3, buffer = hello from 3 for the 9 time
Now running under TCP:
mpirun --mca btl tcp,self -np 5 ./mpi_split_test
Compositor [id = 1, 0 of 5] is running on bolstadm-lm1
Master [id = 0 of 5] is running on bolstadm-lm1
[0] next = 10
Server [id = 2, 1, 0 of 5] is running on bolstadm-lm1
Server [id = 3, 2, 1 of 5] is running on bolstadm-lm1
Server [id = 4, 3, 2 of 5] is running on bolstadm-lm1
[4] next = 10
Sending buffer 0 from 3
Sending buffer 0 from 1
[2] next = 10
[1] next = 10
Sending buffer 0 from 2
[3] next = 10
Receiving buffer from 1, buffer = hello from 1 for the 0 time
Receiving buffer from 3, buffer = hello from 3 for the 0 time
Receiving buffer from 2, buffer = hello from 2 for the 0 time
Sending buffer 1 from 3
Sending buffer 1 from 1
Sending buffer 1 from 2
Receiving buffer from 1, buffer = hello from 1 for the 1 time
Receiving buffer from 2, buffer = hello from 2 for the 1 time
Receiving buffer from 3, buffer = hello from 3 for the 1 time
Sending buffer 2 from 3
Sending buffer 2 from 2
Sending buffer 2 from 1
Receiving buffer from 1, buffer = hello from 1 for the 2 time
Receiving buffer from 2, buffer = hello from 2 for the 2 time
Receiving buffer from 3, buffer = hello from 3 for the 2 time
...
So, has this been reported before? I've seen some messages on the developer
list about hanging with the sm btl.
I'll post the test code if requested (this email is already long)
Mark