$subject_val = "Re: [OMPI users] CPU user time vs. system time"; include("../../include/msg-header.inc"); ?>
Subject: Re: [OMPI users] CPU user time vs. system time
From: Qiming He (qiming.he_at_[hidden])
Date: 2009-06-28 17:57:15
I try a couple of things including your suggestion. I also find out this has
been reported before,
http://www.open-mpi.org/community/lists/users/2007/03/2904.php
but there seems to be no clear solution so far:
Here is what I observe:
I keep the problem size fixed with 24 processes. I use two nodes with 8-core
each and 2-core each.
1. When it is oversubscribed (12 process/processor), sys vs. user time is
much higher than less-subscribed (1.5 process/processor).
Almost The wall clock does not improve too much :-(
2. I try following options, individually and collectively, no difference
mpirun --mpi_yield_when_idle 1 --mca btl tcp,sm,self --mca
coll_hierarch_priority 100 ...
3. older openmpi version (1.3) seems to be better than new version (1.3.2),
but not significantly.
By the way, I am working on Amazon EC2 (VM-host). Will that make any
difference?
Please advise
Thanks
On Fri, Jun 26, 2009 at 11:28 PM, Ralph Castain <rhc_at_[hidden]> wrote:
> If you are running fewer processes on your nodes than they have processors,
> then you can improve performance by adding
>
> -mca mpi_paffinity_alone 1
>
> to your cmd line. This will bind your processes to individual cores, which
> helps with latency. If your program involves collectives, then you can try
> setting
>
> -mca coll_hierarch_priority 100
>
> This will activate the hierarchical collectives, which utilize shared
> memory for messages between procs on the same node.
>
> Ralph
>
>
>
> On Jun 26, 2009, at 9:09 PM, Qiming He wrote:
>
> Hi all,
>>
>> I am new to OpenMPI, and have an urgent run-time question. I have
>> openmpi-1.3.2 compiled with Intel Fortran compiler v.11 simply by
>>
>> ./configure --prefix=<my-dir> F77=ifort FC=ifort
>> then I set my LD_LIBRARY_PATH to include <openmpi-lib> and <intel-lib>
>> and compile my Fortran program properly. No compilation error.
>>
>> I run my program on single node. Everything looks ok. However, when I run
>> it on multiple nodes.
>> mpirun -np <num> --hostfile <my-hosts> <my-program>
>> The performance is much worse than a single node with the same size of the
>> problem to solve (MPICH2 has 50% improvement)
>>
>> I use top and saidar to find that user time (CPU user) is much lower than
>> system time (CPU system), i.e,
>> only small portion of CPU time is used by user application, while the rest
>> is busy with system.
>> No wonder I got bad performance. I am assuming "CPU system" is used for
>> MPI communication.
>> I notice the total traffic (on eth0) is not that big (~5Mb/sec). What is
>> CPU system busy for?
>>
>> Can anyone help? Anything I need to tune?
>>
>> Thanks in advance
>>
>> -Qiming
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>