$subject_val = "Re: [OMPI users] vfs_write returned -14"; include("../../include/msg-header.inc"); ?>
Subject: Re: [OMPI users] vfs_write returned -14
From: Kritiraj Sajadah (ksajadah_at_[hidden])
Date: 2009-06-18 19:33:15
Hello Josh,
ThanK you again for your respond. I tried chekpointing a simple c program using BLCR...and got the same error, i.e:
- vfs_write returned -14
- file_header: write returned -14
Checkpoint failed: Bad address
This is how i installed and run mpi programs for checkpointing:
1) configure and install blcr
tar zxf blcr-<VERSION>.tar.gz
cd blcr-<VERSION>
mkdir builddir
cd builddir
../configure --prefix=/usr/local/ --enable-debug=yes --enable-libcr-tracing=yes --enable-kernel-tracing=yes --enable-testsuite=yes --enable-all-static=yes --enable-static=yes
make
make install
2) configure and install openmpi
./configure --prefix=/usr/local/ --enable-picky --enable-debug --enable-mpi-profile --enable-mpi-cxx --enable-pretty-print-stacktrace --enable-binaries --enable-trace --enable-static=yes --enable-debug --with-devel-headers=1 --with-mpi-param-check=always --with-ft=cr --enable-ft-thread --with-blcr=/usr/local/ --with-blcr-libdir=/usr/local/lib --enable-mpi-threads=yes
make all install
3) Compile and run mpi program as follows:
raj> mpicc helloworld.c -o helloworld
raj> mpirun -am ft-enable-cr helloworld
4) To checkpoint the running program,
raj> ompi-checkpoint [any option] pid
for example: ompi-checkpoint -v 11527
5) To restart your checkpoint, locate the checkpoint file and type the following from the command line:
raj> mpi-restart ompi_global_snapshot_XXXX.ckpt
The did another test with BLCR however,
I tried checkpointing my c application from the /tmp directory instead of my $HOME directory and it checkpointed fine.
So, it looks like the problem is with my $HOME directory.
I have "drwx" rights on my $HOME directory which seems fine for me.
Then i tried it with open MPI. However, with open mpi the checkpoint file automatically get saved in the $HOME directory.
Is there a way to have the file saved in a different location? I checked that LAM/MPI has some command line options :
$ mpirun -np 2 -ssi cr_base_dir /somewhere/else a.out
Do we have a similar option for open mpi?
Thanks a lot
regards,
Raj
--- On Wed, 6/17/09, Josh Hursey <jjhursey_at_[hidden]> wrote:
> From: Josh Hursey <jjhursey_at_[hidden]>
> Subject: Re: [OMPI users] vfs_write returned -14
> To: "Open MPI Users" <users_at_[hidden]>
> Date: Wednesday, June 17, 2009, 1:42 AM
> Did you try checkpointing a non-MPI
> application with BLCR on the
> cluster? If that does not work then I would suspect that
> BLCR is not
> working properly on the system.
>
> However if a non-MPI application can be checkpointed and
> restarted
> correctly on this machine then it may be something odd with
> the Open
> MPI installation or runtime environment. To help debug here
> I would
> need to know how Open MPI was configured and how the
> application was
> ran on the machine (command line arguments, environment
> variables, ...).
>
> I should note that for the program that you sent it is
> important that
> you compile Open MPI with the Fault Tolerance Thread
> enabled to ensure
> a timely checkpoint. Otherwise the checkpoint will be
> delayed until
> the MPI program enters the MPI_Finalize function.
>
> Let me know what you find out.
>
> Josh
>
> On Jun 16, 2009, at 5:08 PM, Kritiraj Sajadah wrote:
>
> >
> > Hi Josh,
> >
> > Thanks for the email. I have install BLCR 0.8.1 and
> openmpi 1.3 on
> > my laptop with Ubuntu 8.04 on it. It works fine.
> >
> > I now tried the installation on the cluster ( on one
> machine for
> > now) in my university. ( the administrator installed
> it) i am not
> > sure if he followed the steps i gave him.
> >
> > I am checkpointing a simple mpi application which
> looks as follows:
> >
> > #include <mpi.h>
> > #include <stdio.h>
> >
> > int main(int argc, char **argv)
> > {
> > int rank,size;
> > MPI_Init(&argc, &argv);
> > MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> > MPI_Comm_size(MPI_COMM_WORLD, &size);
> > printf("I am processor no %d of a total of %d procs
> \n", rank, size);
> > system("sleep 30");
> > printf("I am processor no %d of a total of %d procs
> \n", rank, size);
> > system("sleep 30");
> > printf("I am processor no %d of a total of %d procs
> \n", rank, size);
> > system("sleep 30");
> > printf("bye \n");
> > MPI_Finalize();
> > return 0;
> > }
> >
> > Do you think its better to re install BLCR?
> >
> >
> > Thanks
> >
> > Raj
> > --- On Tue, 6/16/09, Josh Hursey <jjhursey_at_[hidden]>
> wrote:
> >
> >> From: Josh Hursey <jjhursey_at_[hidden]>
> >> Subject: Re: [OMPI users] vfs_write returned -14
> >> To: "Open MPI Users" <users_at_[hidden]>
> >> Date: Tuesday, June 16, 2009, 6:42 PM
> >>
> >> These are errors from BLCR. It may be a problem
> with your
> >> BLCR installation and/or your application. Are you
> able to
> >> checkpoint/restart a non-MPI application with BLCR
> on these
> >> machines?
> >>
> >> What kind of MPI application are you trying to
> checkpoint?
> >> Some of the MPI interfaces are not fully supported
> at the
> >> moment (outlined in the FT User Document that I
> mentioned in
> >> a previous email).
> >>
> >> -- Josh
> >>
> >> On Jun 16, 2009, at 11:30 AM, Kritiraj Sajadah
> wrote:
> >>
> >>>
> >>> Dear All,
> >>> I
> have install
> >> openmpi 1.3 and blcr 0.8.1 on a linux machine
> (ubuntu).
> >> however, when i try checkpointing an MPI
> application, I get
> >> the following error:
> >>>
> >>> - vfs_write returned -14
> >>> - file_header: write returned -14
> >>>
> >>> Can someone help please.
> >>>
> >>> Regards,
> >>>
> >>> Raj
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >
> >
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>