From: Steven Truong (midair77_at_[hidden])
Date: 2007-05-18 19:38:44


Hi, all. Once again, I am ver y frustrated with what I have run into so far.

My system is CentOS 4.4 x86_64, ifort 9.1.043, torque, maui.
I configured openmpi 1.2.1 with this command.
./configure --prefix=/usr/local/openmpi-1.2.1
--with-tm=/usr/local/pbs --enable-static

Now I just tried to run a test command like in the FAQ and it did not work.

[struong_at_neptune 4cpu4npar10nsim]$ mpirun --mca btl tcp,self -np 1
--host node07 hostname
bash: orted: command not found
[neptune.myhost.com:01403] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 275
[neptune.myhost.com:01403] [0,0,0] ORTE_ERROR_LOG: Timeout in file
pls_rsh_module.c at line 1165
[neptune.myhost.com:01403] [0,0,0] ORTE_ERROR_LOG: Timeout in file
errmgr_hnp.c at line 90
[neptune.myhost.com:01403] ERROR: A daemon on node node07 failed to
start as expected.
[neptune.myhost.com:01403] ERROR: There may be more information available from
[neptune.myhost.com:01403] ERROR: the remote shell (see above).
[neptune.myhost.com:01403] ERROR: The daemon exited unexpectedly with
status 127.
[neptune.myhost.com:01403] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 188
[neptune.myhost.com:01403] [0,0,0] ORTE_ERROR_LOG: Timeout in file
pls_rsh_module.c at line 1197
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons for this job.
Returned value Timeout instead of ORTE_SUCCESS.

Here is my environment:
[struong_at_neptune 4cpu4npar10nsim]$ printenv
HOSTNAME=neptune.myhost.com
TERM=xterm
SHELL=/bin/bash
HISTSIZE=1000
SSH_CLIENT=::ffff:192.168.0.185 37304 22
INSTALL_DIR=/usr/local/rrdtool-1.2.12
SSH_TTY=/dev/pts/1
USER=struong
LS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35:
RSHCOMMAND=/usr/bin/ssh
KDEDIR=/usr
MAIL=/var/spool/mail/struong
PATH=/usr/local/NWChem/ecce-v4.0.1/apps/scripts:/opt/intel/fce/9.1.043/bin:/usr/local/openmpi-1.2.1/bin:/opt/c3-4:/opt/bin:/usr/local/torque/bin:/usr/local/torque/sbin:/usr/local/maui/bin:/usr/local/maui/sbin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin:/usr/local/rrdtool-1.2.12/bin:/home/struong/bin
F90=/opt/intel/fce/9.1.043/bin/ifort
INPUTRC=/etc/inputrc
PWD=/home/struong/Set3Bench/GAM/4cpu4npar10nsim
LANG=en_US.UTF-8
F77=/opt/intel/fce/9.1.043/bin/ifort
SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
SHLVL=1
HOME=/home/struong
PBS_DEFAULT=neptune
FC=/opt/intel/fce/9.1.043/bin/ifort
ECCE_HOME=/usr/local/NWChem/ecce-v4.0.1/apps
BASH_ENV=/home/struong/.bashrc
LOGNAME=struong
BUILD_DIR=/tmp/rrdbuil
SSH_CONNECTION=::ffff:192.168.0.185 37304 ::ffff:192.168.0.182 22
LESSOPEN=|/usr/bin/lesspipe.sh %s
G_BROKEN_FILENAMES=1
_=/usr/bin/printenv
OLDPWD=/home/struong

And I have not modified any MCA parameters anywhere.
It appeared that something are not right with pls (Process launch
subsystem) and related components like ssh but I set up so that i can
ssh without password to all the nodes. And could somebody tell me why
orted is not found?

[struong_at_neptune 4cpu4npar10nsim]$ ssh node07
Last login: Fri May 18 15:23:37 2007 from neptune.myhost.com
[struong_at_node07 ~]$ which orted
/usr/local/openmpi-1.2.1/bin/orted
[root_at_node07 ~]# cat /etc/ld.so.conf
include ld.so.conf.d/*.conf
/usr/ofed/lib64
/usr/local/lib
/opt/intel/fce/9.1.043/lib
/opt/intel_mkl/8.1/lib/em64t
/opt/mpich/lib
/usr/local/pbs/lib
/usr/local/maui/lib
/opt/acml3.6.0/ifort64/lib
/opt/acml3.6.0/ifort64_mp/lib
/usr/local/openmpi-1.2.1/lib

[struong_at_neptune 4cpu4npar10nsim]$ ompi_info
                Open MPI: 1.2.1
   Open MPI SVN revision: r14481
                Open RTE: 1.2.1
   Open RTE SVN revision: r14481
                    OPAL: 1.2.1
       OPAL SVN revision: r14481
                  Prefix: /usr/local/openmpi-1.2.1
 Configured architecture: x86_64-unknown-linux-gnu
           Configured by: root
           Configured on: Thu May 17 18:22:20 PDT 2007
          Configure host: neptune.myhost.com
                Built by: root
                Built on: Thu May 17 18:33:47 PDT 2007
              Built host: neptune.myhost.com
              C bindings: yes
            C++ bindings: yes
      Fortran77 bindings: yes (all)
      Fortran90 bindings: yes
 Fortran90 bindings size: small
              C compiler: gcc
     C compiler absolute: /usr/bin/gcc
            C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
      Fortran77 compiler: /opt/intel/fce/9.1.043/bin/ifort
  Fortran77 compiler abs: /opt/intel/fce/9.1.043/bin/ifort
      Fortran90 compiler: /opt/intel/fce/9.1.043/bin/ifort
  Fortran90 compiler abs: /opt/intel/fce/9.1.043/bin/ifort
             C profiling: yes
           C++ profiling: yes
     Fortran77 profiling: yes
     Fortran90 profiling: yes
          C++ exceptions: no
          Thread support: posix (mpi: no, progress: no)
  Internal debug support: no
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
         libltdl support: yes
   Heterogeneous support: yes
 mpirun default --prefix: no
           MCA backtrace: execinfo (MCA v1.0, API v1.0, Component v1.2.1)
              MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.2.1)
           MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.2.1)
           MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2.1)
           MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.2.1)
               MCA timer: linux (MCA v1.0, API v1.0, Component v1.2.1)
         MCA installdirs: env (MCA v1.0, API v1.0, Component v1.2.1)
         MCA installdirs: config (MCA v1.0, API v1.0, Component v1.2.1)
           MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
           MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
                MCA coll: basic (MCA v1.0, API v1.0, Component v1.2.1)
                MCA coll: self (MCA v1.0, API v1.0, Component v1.2.1)
                MCA coll: sm (MCA v1.0, API v1.0, Component v1.2.1)
                MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2.1)
                  MCA io: romio (MCA v1.0, API v1.0, Component v1.2.1)
               MCA mpool: rdma (MCA v1.0, API v1.0, Component v1.2.1)
               MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2.1)
                 MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.1)
                 MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2.1)
                 MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2.1)
              MCA rcache: vma (MCA v1.0, API v1.0, Component v1.2.1)
                 MCA btl: self (MCA v1.0, API v1.0.1, Component v1.2.1)
                 MCA btl: sm (MCA v1.0, API v1.0.1, Component v1.2.1)
                 MCA btl: tcp (MCA v1.0, API v1.0.1, Component v1.0)
                MCA topo: unity (MCA v1.0, API v1.0, Component v1.2.1)
                 MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.2.1)
              MCA errmgr: hnp (MCA v1.0, API v1.3, Component v1.2.1)
              MCA errmgr: orted (MCA v1.0, API v1.3, Component v1.2.1)
              MCA errmgr: proxy (MCA v1.0, API v1.3, Component v1.2.1)
                 MCA gpr: null (MCA v1.0, API v1.0, Component v1.2.1)
                 MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.2.1)
                 MCA gpr: replica (MCA v1.0, API v1.0, Component v1.2.1)
                 MCA iof: proxy (MCA v1.0, API v1.0, Component v1.2.1)
                 MCA iof: svc (MCA v1.0, API v1.0, Component v1.2.1)
                  MCA ns: proxy (MCA v1.0, API v2.0, Component v1.2.1)
                  MCA ns: replica (MCA v1.0, API v2.0, Component v1.2.1)
                 MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
                 MCA ras: dash_host (MCA v1.0, API v1.3, Component v1.2.1)
                 MCA ras: localhost (MCA v1.0, API v1.3, Component v1.2.1)
                 MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.1)
                 MCA ras: slurm (MCA v1.0, API v1.3, Component v1.2.1)
                 MCA ras: tm (MCA v1.0, API v1.3, Component v1.2.1)
                 MCA rds: hostfile (MCA v1.0, API v1.3, Component v1.2.1)
                 MCA rds: proxy (MCA v1.0, API v1.3, Component v1.2.1)
                 MCA rds: resfile (MCA v1.0, API v1.3, Component v1.2.1)
               MCA rmaps: round_robin (MCA v1.0, API v1.3, Component v1.2.1)
                MCA rmgr: proxy (MCA v1.0, API v2.0, Component v1.2.1)
                MCA rmgr: urm (MCA v1.0, API v2.0, Component v1.2.1)
                 MCA rml: oob (MCA v1.0, API v1.0, Component v1.2.1)
                 MCA pls: proxy (MCA v1.0, API v1.3, Component v1.2.1)
                 MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.1)
                 MCA pls: rsh (MCA v1.0, API v1.3, Component v1.2.1)
                 MCA pls: slurm (MCA v1.0, API v1.3, Component v1.2.1)
                 MCA pls: tm (MCA v1.0, API v1.3, Component v1.2.1)
                 MCA sds: env (MCA v1.0, API v1.0, Component v1.2.1)
                 MCA sds: seed (MCA v1.0, API v1.0, Component v1.2.1)
                 MCA sds: singleton (MCA v1.0, API v1.0, Component v1.2.1)
                 MCA sds: pipe (MCA v1.0, API v1.0, Component v1.2.1)
                 MCA sds: slurm (MCA v1.0, API v1.0, Component v1.2.1)

[struong_at_neptune 4cpu4npar10nsim]$ ompi_info --param pls all
                 MCA pls: parameter "pls_base_reuse_daemons" (current value:
                          "0")
                          If nonzero, reuse daemons to launch dynamically
                          spawned processes. If zero, do not reuse daemons
                          (default)
                 MCA pls: parameter "pls" (current value: <none>)
                          Default selection set of components for the pls
                          framework (<none> means "use all components that can
                          be found")
                 MCA pls: parameter "pls_base_verbose" (current value: "0")
                          Verbosity level for the pls framework (0 = no
                          verbosity)
                 MCA pls: parameter "pls_proxy_priority" (current value: "0")
                 MCA pls: parameter "pls_gridengine_debug" (current value: "0")
                          Enable debugging of gridengine pls component
                 MCA pls: parameter "pls_gridengine_verbose" (current value:
                          "0")
                          Enable verbose output of the gridengine qrsh -inherit
                          command
                 MCA pls: parameter "pls_gridengine_priority" (current value:
                          "100")
                          Priority of the gridengine pls component
                 MCA pls: parameter "pls_gridengine_orted" (current value:
                          "orted")
                          The command name that the gridengine pls component
                          will invoke for the ORTE daemon
                 MCA pls: parameter "pls_rsh_debug" (current value: "0")
                          Whether or not to enable debugging output for the rsh
                          pls component (0 or 1)
                 MCA pls: parameter "pls_rsh_num_concurrent" (current value:
                          "128")
                          How many pls_rsh_agent instances to invoke
                          concurrently (must be > 0)
                 MCA pls: parameter "pls_rsh_force_rsh" (current value: "0")
                          Force the launcher to always use rsh, even for local
                          daemons
                 MCA pls: parameter "pls_rsh_orted" (current value: "orted")
                          The command name that the rsh pls component will
                          invoke for the ORTE daemon
                 MCA pls: parameter "pls_rsh_priority" (current value: "10")
                          Priority of the rsh pls component
                 MCA pls: parameter "pls_rsh_delay" (current value: "1")
                          Delay (in seconds) between invocations of the remote
                          agent, but only used when the "debug" MCA
parameter is true, or the top-level MCA
debugging is enabled
                          (otherwise this value is ignored)
                 MCA pls: parameter "pls_rsh_reap" (current value: "1")
                          If set to 1, wait for all the processes to complete
                          before exiting. Otherwise, quit immediately --
                          without waiting for confirmation that all other
                          processes in the job have completed.
                 MCA pls: parameter "pls_rsh_assume_same_shell" (current value:
                          "1")
                          If set to 1, assume that the shell on the remote node
                          is the same as the shell on the local node.
                          Otherwise, probe for what the remote shell.
                 MCA pls: parameter "pls_rsh_agent" (current value: "ssh :
                          rsh")
                          The command used to launch executables on
remote nodes (typically either "ssh" or
"rsh")
                 MCA pls: parameter "pls_slurm_debug" (current value: "0")
                          Enable debugging of slurm pls
                 MCA pls: parameter "pls_slurm_priority" (current value: "75")
                          Default selection priority
                 MCA pls: parameter "pls_slurm_orted" (current value: "orted")
                          Command to use to start proxy orted
                 MCA pls: parameter "pls_slurm_args" (current value: <none>)
                          Custom arguments to srun
                 MCA pls: parameter "pls_tm_debug" (current value: "0")
                          Enable debugging of the TM pls
                 MCA pls: parameter "pls_tm_verbose" (current value: "0")
                          Enable verbose output of the TM pls
                 MCA pls: parameter "pls_tm_priority" (current value: "75")
                          Default selection priority
                 MCA pls: parameter "pls_tm_orted" (current value: "orted")
                          Command to use to start proxy orted
                 MCA pls: parameter "pls_tm_want_path_check" (current value:
                          "1")
                          Whether the launching process should check for the
                          pls_tm_orted executable in the PATH before launching
                          (the TM API does not give an idication of failure;
                          this is a somewhat-lame workaround; non-zero values
                          enable this check)

Thank you.