Subject: Re: [OMPI users] Bug in 1.3.2?: sm btl and isend is serializes
From: Mark Bolstad (the.render.dude_at_[hidden])
Date: 2009-06-20 09:18:37


Thanks, that at least explains what is going on. Because I have an
unbalanced work load (at least for now) I assume that I'll need to poll. If
I replace the compositor loop with the following, it appears that I prevent
the serialization/starvation and service the servers equally. I can think of
edge cases where it isn't very efficient, so I'll explore different options
(perhaps instead of looping I can probe one higher and increment on each
receive).

Thanks again.

Here's the new output:
...
Sending buffer 3 from 3
Sending buffer 3 from 2
Sending buffer 4 from 1
Receiving buffer from 1, buffer = hello from 1 for the 0 time
 -- Probing for 2
 -- Found a message
Sending buffer 4 from 3
Sending buffer 4 from 2
Receiving buffer from 2, buffer = hello from 2 for the 0 time
 -- Probing for 3
 -- Found a message
Receiving buffer from 3, buffer = hello from 3 for the 0 time
 -- Probing for 1
 -- Found a message
Sending buffer 5 from 1
Receiving buffer from 1, buffer = hello from 1 for the 1 time
 -- Probing for 2
 -- Found a message
Sending buffer 5 from 2
Sending buffer 5 from 3
Receiving buffer from 2, buffer = hello from 2 for the 1 time
 -- Probing for 3
 -- Found a message
Receiving buffer from 3, buffer = hello from 3 for the 1 time
...
and the replacement code:

     int last = 0;

     for (i = 0; i < LOOPS * ( size - 1 ); i++)
     {
        int which_source, which_tag, flag;

        MPI_Probe( MPI_ANY_SOURCE, MPI_ANY_TAG, comp_comm, &status );
        which_source = status.MPI_SOURCE;
        which_tag = status.MPI_TAG;
        if ( which_source <= last )
        {
           MPI_Status probe_status;

           for (j = 0; j < size - 1; j++)
           {
          int probe_id = ( ( last + j ) % ( size - 1) ) + 1;

          printf( " -- Probing for %d\n", probe_id );

          MPI_Iprobe( probe_id, MPI_ANY_TAG, comp_comm, &flag, &probe_status
);
          if ( flag )
          {
             printf( " -- Found a message\n" );
             which_source = probe_status.MPI_SOURCE;
             which_tag = probe_status.MPI_TAG;
             break;
          }
           }
        }

        printf( "Receiving buffer from %d, buffer = ", which_source );
        MPI_Recv( buffer, BUFLEN, MPI_CHAR, which_source, which_tag,
comp_comm, &status );
        printf( "%s\n", buffer );
        last = which_source;
     }

Mark

On Fri, Jun 19, 2009 at 5:33 PM, Eugene Loh <Eugene.Loh_at_[hidden]> wrote:

> George Bosilca wrote:
>
> MPI does not impose any global order on the messages. The only
>> requirement is that between two peers on the same communicator the
>> messages (or at least the part required for the matching) is delivered in
>> order. This make both execution traces you sent with your original email
>> (shared memory and TCP) valid from the MPI perspective.
>>
>> Moreover, MPI doesn't impose any order in the matching when ANY_SOURCE is
>> used. In Open MPI we do the matching _ALWAYS_ starting from rank 0 to n in
>> the specified communicator. BEWARE: The remaining of this paragraph is deep
>> black magic of an MPI implementation internals. The main difference between
>> the behavior of SM and TCP here directly reflect their eager size, 4K for
>> SM and 64K for TCP. Therefore, for your example, for TCP all your messages
>> are eager messages (i.e. are completely transfered to the destination
>> process in just one go), while for SM they all require a rendez-vous. This
>> directly impact the ordering of the messages on the receiver, and therefore
>> the order of the matching. However, I have to insist on this, this behavior
>> is correct based on the MPI standard specifications.
>>
>
> I'm going to try a technical explanation of what's going on inside OMPI and
> then words of advice to Mark.
>
> First, the technical explanation. As George says, what's going on is
> legal. The "servers" all queue up sends to the "compositor". These are
> long, rendezvous sends (at least when they're on-node). So, none of these
> sends completes. The compositor looks for an in-coming message. It's gets
> the header of the message and sends back an acknowledgement that the rest of
> the message can be sent. The "server" gets the acknowledgement and starts
> sending more of the message. The compositor, in order to get to the
> remainder of the message, keeps draining all the other stuff servers are
> sending it. Once the first message is completely received, the compositor
> looks for the next message to process and happens to pick up the first
> server again. It won't go to anyone else under server 1 is exhausted.
> Legal, but from Mark's point of view not desirable. The compositor is busy
> all the time. Mark just wants it to employ a different order.
>
> The receives are "serialized". Of course they must be since the receiver
> is a single process. But Mark's performance issue is that the servers
> aren't being serviced equally. So, they back up while server unfairly gets
> all the attention.
>
> Mark, your test code has a set of buffers it cycles through on each server.
> Could you do something similar on the compositor side? Have a set of
> resources for each server. If you want the compositor to service all
> servers equally/fairly, you're going to have to prescribe this behavior in
> your MPI code. The MPI implementation can't be relied on to do this for
> you.
>
> If this doesn't make sense, let me know and I'll try to sketch is out more
> explicitly.
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>