Subject: [OMPI users] Problem with qlogic cards InfiniPath_QLE7240 and AlltoAll call
From: D'Auria, Raffaella (dauria_at_[hidden])
Date: 2009-06-25 13:29:39


Dear All,

I have been encountering a fatal type "error polling LP CQ with status RETRY EXCEEDED ERROR status number 12" whenever I try to run a simple MPI code (see below) that performs an AlltoAll call.

We are running the OpenMPI 1.3.2 stack on top of the OFED 1.4.1 stack. Our cluster is composed of mostly Mellanox HCAs (MT_03B0140001) and some Qlogic (InfiniPath_QLE724) cards.

The problem manifests itself as soon as the size of the vector, which components are being swapped between processes with the all to all call, is equal or larger than 68MB.

Please note that I have this problem only when at least one of the computational nodes in the host list of mpiexec is a node with the qlogic card InfiniPath_QLE724.

The code runs with no problem if all of the host in the host list of mpiexec have Mellanox HCA (MT_03B0140001).

Please note that I can run the OSU mpi tests and the example codes in the openmpi distribution across the nodes of our heterogeneous IB fabric with no problem. The only time, so far, we have encountered a problem is with the alltoall call when the vector which componets are swapped across nodes is larger than 68MB at least (as stated above).

Please note that when I querry the nodes with ibstat or ibv_devinfo I see that the links are up. This is the ibv_devinfo from one of the Qlogic nodes:

-----------------------------------------------
hca_id: ipath0
        fw_ver: 0.0.0
        node_guid: 0011:7500:00ff:7530
        sys_image_guid: 0011:7500:00ff:7530
        vendor_id: 0x1077
        vendor_part_id: 29216
        hw_ver: 0x2
        board_id: InfiniPath_QLE7240
        phys_port_cnt: 1
                port: 1
                        state: PORT_ACTIVE (4)
                        max_mtu: 4096 (5)
                        active_mtu: 2048 (4)
                        sm_lid: 2
                        port_lid: 329
                        port_lmc: 0x00
-----------------------------------------------

This is the ibv_devinfo from one of the Mellanox nodes:

-----------------------------------------------
hca_id: mthca0
        fw_ver: 1.2.936
        node_guid: 0002:c902:0027:c650
        sys_image_guid: 0002:c902:0027:c653
        vendor_id: 0x02c9
        vendor_part_id: 25204
        hw_ver: 0xA0
        board_id: MT_03B0140001
        phys_port_cnt: 1
                port: 1
                        state: PORT_ACTIVE (4)
                        max_mtu: 2048 (4)
                        active_mtu: 2048 (4)
                        sm_lid: 2
                        port_lid: 209
                        port_lmc: 0x00
 
-----------------------------------------------

I was wondering whether, perhaps, this is a bug with the openMPI stack.

Here is the code that causes the problem:

-----------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"
#include <iostream>
using namespace std;

void a2a(double *a, double *b, int n1, int n2);

int main(int argc, char *argv[])
{
  const int n1 = 4096, n2 = 4096, numIT = 100;
  int rank, nproc;
  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &nproc);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  double *a = (double *)malloc(sizeof(*a)*n1*n2);
  double *b = (double *)malloc(sizeof(*a)*n1*n2);

  if (rank==0)
    {
      cout << "Number of processes = " << nproc << endl;
      cout << "Alltoall data size per process = "
           << sizeof(*a)*n1*n2/(1024*1024) << " MB\n";
    }

  for (int i=0; i<n1*n2; ++i)
    a[i] = (double) rand() / (RAND_MAX + 1.0);

  for (int i=0; i<numIT; ++i)
    {
      double t1 = MPI_Wtime();
      a2a(a,b,n1,n2);
      double t2 = MPI_Wtime();
      if (rank==0)
        printf("iter %4d wall-clock seconds = %.5e\n",i,t2-t1);
    }

  free(a); free(b);
  MPI_Finalize();
  return 0;
}

void a2a(double *a, double *b, const int n1, const int n2)
{
  int nproc;
  MPI_Comm_size(MPI_COMM_WORLD, &nproc);
  int cnt = n1*n2/nproc;
  MPI_Alltoall(a, cnt, MPI_DOUBLE,
               b, cnt, MPI_DOUBLE, MPI_COMM_WORLD);
  return;
}
-----------------------------------------------

Please note I compile the above program with:

mpic++ -o a2a a2a.cpp

Please note that to avoid benign kind of warnings (coming from the
qlogic nodes) I had to specify the following MCA parameters in the /etc directory of the openmpi:

btl_openib_max_inline_data = 0
btl_openib_ib_timeout = 30

I run the code with:

mpiexec -n 16 -hostfile mach1 ./a2a

and mach1 is:

-----------------------------------------------
n92 slots=6
n147 slots=4
n243 slots=6
-----------------------------------------------

And here is the output I get if at least one of the nodes on which the
program is executed is a qlogic InfiniPath_QLE7240 one:

-----------------------------------------------

Number of processes = 16
Alltoall data size per process = 128 MB
[[11772,1],11][../../../../../ompi/mca/btl/openib/btl_openib_component.c:2929:handle_wc] from n243 to: n92 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 231193640 opcode 2 vendor error 0 qp_idx 3
[[11772,1],12][../../../../../ompi/mca/btl/openib/btl_openib_component.c:2929:handle_wc] from n243 to: n147 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 141569064 opcode 2 vendor error 0 qp_idx 3
--------------------------------------------------------------------------
The InfiniBand retry count between two MPI processes has been
exceeded. "Retry count" is defined in the InfiniBand spec 1.2
(section 12.7.38):

    The total number of times that the sender wishes the receiver to
    retry timeout, packet sequence, etc. errors before posting a
    completion error.

This error typically means that there is something awry within the
InfiniBand fabric itself. You should note the hosts on which this
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.

Two MCA parameters can be used to control Open MPI's behavior with
respect to the retry count:

* btl_openib_ib_retry_count - The number of times the sender will
  attempt to retry (defaulted to 7, the maximum value).
* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
  to 10). The actual timeout value used is calculated as:

     4.096 microseconds * (2^btl_openib_ib_timeout)

  See the InfiniBand spec 1.2 (section 12.7.34) for more details.

Below is some information about the host that raised the error and the
peer to which it was connected:

  Local host: n243
  Local device: ipath0
  Peer host: n147

You may need to consult with your system administrator to get this
problem fixed.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec has exited due to process rank 12 with PID 27613 on
node n243 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------
[i01:10608] 1 more process has sent help message help-mpi-btl-openib.txt / pp retry exceeded
[i01:10608] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

-----------------------------------------------

Has anybody encountered a similar problem? Does anyone have an idea
how to fix it?

Thanks a lot,

Raffaella.

N.B.: I am attaching a compressed file with config.log (in the openmpi dir) and the output of ompi_info --all.