$subject_val = "Re: [OMPI users] vfs_write returned -14"; include("../../include/msg-header.inc"); ?>
Subject: Re: [OMPI users] vfs_write returned -14
From: Kritiraj Sajadah (ksajadah_at_[hidden])
Date: 2009-06-20 13:48:57
Hi Josh,
Thank you for the email. I can now checkpoint the application on the cluster using OPEN MPI. But I am now facing another problem.
When i tried restarting the checkpoint, nothing happens. I copied the checkpoint file to the $HOME directory and tried restarting it there and got the following error:
- open('/var/cache/nscd/passwd', 0x0) failed: -13
- mmap failed: /var/cache/nscd/passwd
- thaw_threads returned error, aborting. -13
- thaw_threads returned error, aborting. -13
- thaw_threads returned error, aborting. -13
Restart failed: Permission denied
On my laptop it works fine. So, I am assuming its again something to do with my $HOME directory.
Is it possible to restart the chekpoint from the /tmp directory itself without have to copy it back to the $HOME directory.
I s there another way to compile and build openmpi so that everthing happens in the /tmp directory instead of the $HOME directory?
Thank you
Raj
--- On Fri, 6/19/09, Josh Hursey <jjhursey_at_[hidden]> wrote:
> From: Josh Hursey <jjhursey_at_[hidden]>
> Subject: Re: [OMPI users] vfs_write returned -14
> To: "Open MPI Users" <users_at_[hidden]>
> Date: Friday, June 19, 2009, 2:48 PM
>
> On Jun 18, 2009, at 7:33 PM, Kritiraj Sajadah wrote:
>
> >
> > Hello Josh,
> > ThanK you
> again for your respond. I tried chekpointing a
> > simple c program using BLCR...and got the same error,
> i.e:
> >
> > - vfs_write returned -14
> > - file_header: write returned -14
> > Checkpoint failed: Bad address
>
> So I would look at how your NFS file system is setup, and
> work with
> your sysadmin (and maybe the BLCR list) to resolve this
> before
> experimenting too much with checkpointing with Open MPI.
>
> >
> > This is how i installed and run mpi programs for
> checkpointing:
> >
> > 1) configure and install blcr
> > 2) configure and install openmpi
> > 3) Compile and run mpi program as follows:
> > 4) To checkpoint the running program,
> > 5) To restart your checkpoint, locate the checkpoint
> file and type
> > the following from the command line:
> >
>
> This all looks ok to me.
>
> > The did another test with BLCR however,
> >
> > I tried checkpointing my c application from the /tmp
> directory
> > instead of my $HOME directory and it checkpointed
> fine.
> >
> > So, it looks like the problem is with my $HOME
> directory.
> >
> > I have "drwx" rights on my $HOME directory which seems
> fine for me.
> >
> > Then i tried it with open MPI. However, with
> open mpi the
> > checkpoint file automatically get saved in the $HOME
> directory.
> >
> > Is there a way to have the file saved in a different
> location? I
> > checked that LAM/MPI has some command line
> options :
> >
> > $ mpirun -np 2 -ssi cr_base_dir /somewhere/else a.out
> >
> > Do we have a similar option for open mpi?
>
> By default Open MPI places the global snapshot in the $HOME
> directory.
> But you can also specify a different directory for the
> global snapshot
> using the following MCA option:
> -mca snapc_base_global_snapshot_dir
> /somewhere/else
>
> For the best results you will likely want to set this in
> the MCA
> params file in your home directory:
> shell$ cat ~/.openmpi/mca-params.conf
> snapc_base_global_snapshot_dir=/somewhere/else
>
> You can also stage the file to local disk, then have Open
> MPI transfer
> the checkpoints back to a {logically} central storage
> device (both can
> be /tmp on a local disk if you like). For more details on
> this and the
> above option you will want to read through the FT Users
> Guide attached
> to the wiki page at the link below:
> https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR
>
> -- Josh
>
> >
> >
> > Thanks a lot
> >
> > regards,
> >
> > Raj
> >
> > --- On Wed, 6/17/09, Josh Hursey <jjhursey_at_[hidden]>
> wrote:
> >
> >> From: Josh Hursey <jjhursey_at_[hidden]>
> >> Subject: Re: [OMPI users] vfs_write returned -14
> >> To: "Open MPI Users" <users_at_[hidden]>
> >> Date: Wednesday, June 17, 2009, 1:42 AM
> >> Did you try checkpointing a non-MPI
> >> application with BLCR on the
> >> cluster? If that does not work then I would
> suspect that
> >> BLCR is not
> >> working properly on the system.
> >>
> >> However if a non-MPI application can be
> checkpointed and
> >> restarted
> >> correctly on this machine then it may be something
> odd with
> >> the Open
> >> MPI installation or runtime environment. To help
> debug here
> >> I would
> >> need to know how Open MPI was configured and how
> the
> >> application was
> >> ran on the machine (command line arguments,
> environment
> >> variables, ...).
> >>
> >> I should note that for the program that you sent
> it is
> >> important that
> >> you compile Open MPI with the Fault Tolerance
> Thread
> >> enabled to ensure
> >> a timely checkpoint. Otherwise the checkpoint will
> be
> >> delayed until
> >> the MPI program enters the MPI_Finalize function.
> >>
> >> Let me know what you find out.
> >>
> >> Josh
> >>
> >> On Jun 16, 2009, at 5:08 PM, Kritiraj Sajadah
> wrote:
> >>
> >>>
> >>> Hi Josh,
> >>>
> >>> Thanks for the email. I have install BLCR
> 0.8.1 and
> >> openmpi 1.3 on
> >>> my laptop with Ubuntu 8.04 on it. It works
> fine.
> >>>
> >>> I now tried the installation on the cluster (
> on one
> >> machine for
> >>> now) in my university. ( the administrator
> installed
> >> it) i am not
> >>> sure if he followed the steps i gave him.
> >>>
> >>> I am checkpointing a simple mpi application
> which
> >> looks as follows:
> >>>
> >>> #include <mpi.h>
> >>> #include <stdio.h>
> >>>
> >>> int main(int argc, char **argv)
> >>> {
> >>> int rank,size;
> >>> MPI_Init(&argc, &argv);
> >>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> >>> MPI_Comm_size(MPI_COMM_WORLD, &size);
> >>> printf("I am processor no %d of a total of %d
> procs
> >> \n", rank, size);
> >>> system("sleep 30");
> >>> printf("I am processor no %d of a total of %d
> procs
> >> \n", rank, size);
> >>> system("sleep 30");
> >>> printf("I am processor no %d of a total of %d
> procs
> >> \n", rank, size);
> >>> system("sleep 30");
> >>> printf("bye \n");
> >>> MPI_Finalize();
> >>> return 0;
> >>> }
> >>>
> >>> Do you think its better to re install BLCR?
> >>>
> >>>
> >>> Thanks
> >>>
> >>> Raj
> >>> --- On Tue, 6/16/09, Josh Hursey <jjhursey_at_[hidden]>
> >> wrote:
> >>>
> >>>> From: Josh Hursey <jjhursey_at_[hidden]>
> >>>> Subject: Re: [OMPI users] vfs_write
> returned -14
> >>>> To: "Open MPI Users" <users_at_[hidden]>
> >>>> Date: Tuesday, June 16, 2009, 6:42 PM
> >>>>
> >>>> These are errors from BLCR. It may be a
> problem
> >> with your
> >>>> BLCR installation and/or your application.
> Are you
> >> able to
> >>>> checkpoint/restart a non-MPI application
> with BLCR
> >> on these
> >>>> machines?
> >>>>
> >>>> What kind of MPI application are you
> trying to
> >> checkpoint?
> >>>> Some of the MPI interfaces are not fully
> supported
> >> at the
> >>>> moment (outlined in the FT User Document
> that I
> >> mentioned in
> >>>> a previous email).
> >>>>
> >>>> -- Josh
> >>>>
> >>>> On Jun 16, 2009, at 11:30 AM, Kritiraj
> Sajadah
> >> wrote:
> >>>>
> >>>>>
> >>>>> Dear All,
> >>>>>
> I
> >> have install
> >>>> openmpi 1.3 and blcr 0.8.1 on a linux
> machine
> >> (ubuntu).
> >>>> however, when i try checkpointing an MPI
> >> application, I get
> >>>> the following error:
> >>>>>
> >>>>> - vfs_write returned -14
> >>>>> - file_header: write returned -14
> >>>>>
> >>>>> Can someone help please.
> >>>>>
> >>>>> Regards,
> >>>>>
> >>>>> Raj
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >> _______________________________________________
> >>>>> users mailing list
> >>>>> users_at_[hidden]
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>>
> _______________________________________________
> >>>> users mailing list
> >>>> users_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>
> >>>
> >>>
> >>>
> >>>
> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >
> >
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>