$subject_val = "Re: [OMPI users] vfs_write returned -14"; include("../../include/msg-header.inc"); ?>
Subject: Re: [OMPI users] vfs_write returned -14
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2009-06-20 15:39:39
On Jun 20, 2009, at 1:48 PM, Kritiraj Sajadah wrote:
>
> Hi Josh,
> Thank you for the email. I can now checkpoint the
> application on the cluster using OPEN MPI. But I am now facing
> another problem.
>
> When i tried restarting the checkpoint, nothing happens. I copied
> the checkpoint file to the $HOME directory and tried restarting it
> there and got the following error:
>
> - open('/var/cache/nscd/passwd', 0x0) failed: -13
> - mmap failed: /var/cache/nscd/passwd
> - thaw_threads returned error, aborting. -13
> - thaw_threads returned error, aborting. -13
> - thaw_threads returned error, aborting. -13
> Restart failed: Permission denied
>
> On my laptop it works fine. So, I am assuming its again something to
> do with my $HOME directory.
This issue is documented in the BLCR FAQ:
http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#eperm
I would follow the directions there to resolve this issue.
>
> Is it possible to restart the chekpoint from the /tmp directory
> itself without have to copy it back to the $HOME directory.
The '--preload' or '-p' option to ompi-restart will let you restart a
parallel job without a shared file system. I believe that the FT
User's Guide outlines this option as well (if it does not let me know
and I'll add some text for it).
>
> I s there another way to compile and build openmpi so that everthing
> happens in the /tmp directory instead of the $HOME directory?
There are no compile time options for this, just the runtime options
that I previously mentioned.
Best,
Josh
>
>
> Thank you
>
> Raj
>
> --- On Fri, 6/19/09, Josh Hursey <jjhursey_at_[hidden]> wrote:
>
>> From: Josh Hursey <jjhursey_at_[hidden]>
>> Subject: Re: [OMPI users] vfs_write returned -14
>> To: "Open MPI Users" <users_at_[hidden]>
>> Date: Friday, June 19, 2009, 2:48 PM
>>
>> On Jun 18, 2009, at 7:33 PM, Kritiraj Sajadah wrote:
>>
>>>
>>> Hello Josh,
>>> ThanK you
>> again for your respond. I tried chekpointing a
>>> simple c program using BLCR...and got the same error,
>> i.e:
>>>
>>> - vfs_write returned -14
>>> - file_header: write returned -14
>>> Checkpoint failed: Bad address
>>
>> So I would look at how your NFS file system is setup, and
>> work with
>> your sysadmin (and maybe the BLCR list) to resolve this
>> before
>> experimenting too much with checkpointing with Open MPI.
>>
>>>
>>> This is how i installed and run mpi programs for
>> checkpointing:
>>>
>>> 1) configure and install blcr
>>> 2) configure and install openmpi
>>> 3) Compile and run mpi program as follows:
>>> 4) To checkpoint the running program,
>>> 5) To restart your checkpoint, locate the checkpoint
>> file and type
>>> the following from the command line:
>>>
>>
>> This all looks ok to me.
>>
>>> The did another test with BLCR however,
>>>
>>> I tried checkpointing my c application from the /tmp
>> directory
>>> instead of my $HOME directory and it checkpointed
>> fine.
>>>
>>> So, it looks like the problem is with my $HOME
>> directory.
>>>
>>> I have "drwx" rights on my $HOME directory which seems
>> fine for me.
>>>
>>> Then i tried it with open MPI. However, with
>> open mpi the
>>> checkpoint file automatically get saved in the $HOME
>> directory.
>>>
>>> Is there a way to have the file saved in a different
>> location? I
>>> checked that LAM/MPI has some command line
>> options :
>>>
>>> $ mpirun -np 2 -ssi cr_base_dir /somewhere/else a.out
>>>
>>> Do we have a similar option for open mpi?
>>
>> By default Open MPI places the global snapshot in the $HOME
>> directory.
>> But you can also specify a different directory for the
>> global snapshot
>> using the following MCA option:
>> -mca snapc_base_global_snapshot_dir
>> /somewhere/else
>>
>> For the best results you will likely want to set this in
>> the MCA
>> params file in your home directory:
>> shell$ cat ~/.openmpi/mca-params.conf
>> snapc_base_global_snapshot_dir=/somewhere/else
>>
>> You can also stage the file to local disk, then have Open
>> MPI transfer
>> the checkpoints back to a {logically} central storage
>> device (both can
>> be /tmp on a local disk if you like). For more details on
>> this and the
>> above option you will want to read through the FT Users
>> Guide attached
>> to the wiki page at the link below:
>> https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR
>>
>> -- Josh
>>
>>>
>>>
>>> Thanks a lot
>>>
>>> regards,
>>>
>>> Raj
>>>
>>> --- On Wed, 6/17/09, Josh Hursey <jjhursey_at_[hidden]>
>> wrote:
>>>
>>>> From: Josh Hursey <jjhursey_at_[hidden]>
>>>> Subject: Re: [OMPI users] vfs_write returned -14
>>>> To: "Open MPI Users" <users_at_[hidden]>
>>>> Date: Wednesday, June 17, 2009, 1:42 AM
>>>> Did you try checkpointing a non-MPI
>>>> application with BLCR on the
>>>> cluster? If that does not work then I would
>> suspect that
>>>> BLCR is not
>>>> working properly on the system.
>>>>
>>>> However if a non-MPI application can be
>> checkpointed and
>>>> restarted
>>>> correctly on this machine then it may be something
>> odd with
>>>> the Open
>>>> MPI installation or runtime environment. To help
>> debug here
>>>> I would
>>>> need to know how Open MPI was configured and how
>> the
>>>> application was
>>>> ran on the machine (command line arguments,
>> environment
>>>> variables, ...).
>>>>
>>>> I should note that for the program that you sent
>> it is
>>>> important that
>>>> you compile Open MPI with the Fault Tolerance
>> Thread
>>>> enabled to ensure
>>>> a timely checkpoint. Otherwise the checkpoint will
>> be
>>>> delayed until
>>>> the MPI program enters the MPI_Finalize function.
>>>>
>>>> Let me know what you find out.
>>>>
>>>> Josh
>>>>
>>>> On Jun 16, 2009, at 5:08 PM, Kritiraj Sajadah
>> wrote:
>>>>
>>>>>
>>>>> Hi Josh,
>>>>>
>>>>> Thanks for the email. I have install BLCR
>> 0.8.1 and
>>>> openmpi 1.3 on
>>>>> my laptop with Ubuntu 8.04 on it. It works
>> fine.
>>>>>
>>>>> I now tried the installation on the cluster (
>> on one
>>>> machine for
>>>>> now) in my university. ( the administrator
>> installed
>>>> it) i am not
>>>>> sure if he followed the steps i gave him.
>>>>>
>>>>> I am checkpointing a simple mpi application
>> which
>>>> looks as follows:
>>>>>
>>>>> #include <mpi.h>
>>>>> #include <stdio.h>
>>>>>
>>>>> int main(int argc, char **argv)
>>>>> {
>>>>> int rank,size;
>>>>> MPI_Init(&argc, &argv);
>>>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>>> MPI_Comm_size(MPI_COMM_WORLD, &size);
>>>>> printf("I am processor no %d of a total of %d
>> procs
>>>> \n", rank, size);
>>>>> system("sleep 30");
>>>>> printf("I am processor no %d of a total of %d
>> procs
>>>> \n", rank, size);
>>>>> system("sleep 30");
>>>>> printf("I am processor no %d of a total of %d
>> procs
>>>> \n", rank, size);
>>>>> system("sleep 30");
>>>>> printf("bye \n");
>>>>> MPI_Finalize();
>>>>> return 0;
>>>>> }
>>>>>
>>>>> Do you think its better to re install BLCR?
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> Raj
>>>>> --- On Tue, 6/16/09, Josh Hursey <jjhursey_at_[hidden]>
>>>> wrote:
>>>>>
>>>>>> From: Josh Hursey <jjhursey_at_[hidden]>
>>>>>> Subject: Re: [OMPI users] vfs_write
>> returned -14
>>>>>> To: "Open MPI Users" <users_at_[hidden]>
>>>>>> Date: Tuesday, June 16, 2009, 6:42 PM
>>>>>>
>>>>>> These are errors from BLCR. It may be a
>> problem
>>>> with your
>>>>>> BLCR installation and/or your application.
>> Are you
>>>> able to
>>>>>> checkpoint/restart a non-MPI application
>> with BLCR
>>>> on these
>>>>>> machines?
>>>>>>
>>>>>> What kind of MPI application are you
>> trying to
>>>> checkpoint?
>>>>>> Some of the MPI interfaces are not fully
>> supported
>>>> at the
>>>>>> moment (outlined in the FT User Document
>> that I
>>>> mentioned in
>>>>>> a previous email).
>>>>>>
>>>>>> -- Josh
>>>>>>
>>>>>> On Jun 16, 2009, at 11:30 AM, Kritiraj
>> Sajadah
>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> Dear All,
>>>>>>>
>> I
>>>> have install
>>>>>> openmpi 1.3 and blcr 0.8.1 on a linux
>> machine
>>>> (ubuntu).
>>>>>> however, when i try checkpointing an MPI
>>>> application, I get
>>>>>> the following error:
>>>>>>>
>>>>>>> - vfs_write returned -14
>>>>>>> - file_header: write returned -14
>>>>>>>
>>>>>>> Can someone help please.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Raj
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users