Subject: Re: [OMPI users] vfs_write returned -14
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2009-06-19 09:48:26


On Jun 18, 2009, at 7:33 PM, Kritiraj Sajadah wrote:

>
> Hello Josh,
> ThanK you again for your respond. I tried chekpointing a
> simple c program using BLCR...and got the same error, i.e:
>
> - vfs_write returned -14
> - file_header: write returned -14
> Checkpoint failed: Bad address

So I would look at how your NFS file system is setup, and work with
your sysadmin (and maybe the BLCR list) to resolve this before
experimenting too much with checkpointing with Open MPI.

>
> This is how i installed and run mpi programs for checkpointing:
>
> 1) configure and install blcr
> 2) configure and install openmpi
> 3) Compile and run mpi program as follows:
> 4) To checkpoint the running program,
> 5) To restart your checkpoint, locate the checkpoint file and type
> the following from the command line:
>

This all looks ok to me.

> The did another test with BLCR however,
>
> I tried checkpointing my c application from the /tmp directory
> instead of my $HOME directory and it checkpointed fine.
>
> So, it looks like the problem is with my $HOME directory.
>
> I have "drwx" rights on my $HOME directory which seems fine for me.
>
> Then i tried it with open MPI. However, with open mpi the
> checkpoint file automatically get saved in the $HOME directory.
>
> Is there a way to have the file saved in a different location? I
> checked that LAM/MPI has some command line options :
>
> $ mpirun -np 2 -ssi cr_base_dir /somewhere/else a.out
>
> Do we have a similar option for open mpi?

By default Open MPI places the global snapshot in the $HOME directory.
But you can also specify a different directory for the global snapshot
using the following MCA option:
   -mca snapc_base_global_snapshot_dir /somewhere/else

For the best results you will likely want to set this in the MCA
params file in your home directory:
  shell$ cat ~/.openmpi/mca-params.conf
  snapc_base_global_snapshot_dir=/somewhere/else

You can also stage the file to local disk, then have Open MPI transfer
the checkpoints back to a {logically} central storage device (both can
be /tmp on a local disk if you like). For more details on this and the
above option you will want to read through the FT Users Guide attached
to the wiki page at the link below:
   https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR

-- Josh

>
>
> Thanks a lot
>
> regards,
>
> Raj
>
> --- On Wed, 6/17/09, Josh Hursey <jjhursey_at_[hidden]> wrote:
>
>> From: Josh Hursey <jjhursey_at_[hidden]>
>> Subject: Re: [OMPI users] vfs_write returned -14
>> To: "Open MPI Users" <users_at_[hidden]>
>> Date: Wednesday, June 17, 2009, 1:42 AM
>> Did you try checkpointing a non-MPI
>> application with BLCR on the
>> cluster? If that does not work then I would suspect that
>> BLCR is not
>> working properly on the system.
>>
>> However if a non-MPI application can be checkpointed and
>> restarted
>> correctly on this machine then it may be something odd with
>> the Open
>> MPI installation or runtime environment. To help debug here
>> I would
>> need to know how Open MPI was configured and how the
>> application was
>> ran on the machine (command line arguments, environment
>> variables, ...).
>>
>> I should note that for the program that you sent it is
>> important that
>> you compile Open MPI with the Fault Tolerance Thread
>> enabled to ensure
>> a timely checkpoint. Otherwise the checkpoint will be
>> delayed until
>> the MPI program enters the MPI_Finalize function.
>>
>> Let me know what you find out.
>>
>> Josh
>>
>> On Jun 16, 2009, at 5:08 PM, Kritiraj Sajadah wrote:
>>
>>>
>>> Hi Josh,
>>>
>>> Thanks for the email. I have install BLCR 0.8.1 and
>> openmpi 1.3 on
>>> my laptop with Ubuntu 8.04 on it. It works fine.
>>>
>>> I now tried the installation on the cluster ( on one
>> machine for
>>> now) in my university. ( the administrator installed
>> it) i am not
>>> sure if he followed the steps i gave him.
>>>
>>> I am checkpointing a simple mpi application which
>> looks as follows:
>>>
>>> #include <mpi.h>
>>> #include <stdio.h>
>>>
>>> int main(int argc, char **argv)
>>> {
>>> int rank,size;
>>> MPI_Init(&argc, &argv);
>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>> MPI_Comm_size(MPI_COMM_WORLD, &size);
>>> printf("I am processor no %d of a total of %d procs
>> \n", rank, size);
>>> system("sleep 30");
>>> printf("I am processor no %d of a total of %d procs
>> \n", rank, size);
>>> system("sleep 30");
>>> printf("I am processor no %d of a total of %d procs
>> \n", rank, size);
>>> system("sleep 30");
>>> printf("bye \n");
>>> MPI_Finalize();
>>> return 0;
>>> }
>>>
>>> Do you think its better to re install BLCR?
>>>
>>>
>>> Thanks
>>>
>>> Raj
>>> --- On Tue, 6/16/09, Josh Hursey <jjhursey_at_[hidden]>
>> wrote:
>>>
>>>> From: Josh Hursey <jjhursey_at_[hidden]>
>>>> Subject: Re: [OMPI users] vfs_write returned -14
>>>> To: "Open MPI Users" <users_at_[hidden]>
>>>> Date: Tuesday, June 16, 2009, 6:42 PM
>>>>
>>>> These are errors from BLCR. It may be a problem
>> with your
>>>> BLCR installation and/or your application. Are you
>> able to
>>>> checkpoint/restart a non-MPI application with BLCR
>> on these
>>>> machines?
>>>>
>>>> What kind of MPI application are you trying to
>> checkpoint?
>>>> Some of the MPI interfaces are not fully supported
>> at the
>>>> moment (outlined in the FT User Document that I
>> mentioned in
>>>> a previous email).
>>>>
>>>> -- Josh
>>>>
>>>> On Jun 16, 2009, at 11:30 AM, Kritiraj Sajadah
>> wrote:
>>>>
>>>>>
>>>>> Dear All,
>>>>> I
>> have install
>>>> openmpi 1.3 and blcr 0.8.1 on a linux machine
>> (ubuntu).
>>>> however, when i try checkpointing an MPI
>> application, I get
>>>> the following error:
>>>>>
>>>>> - vfs_write returned -14
>>>>> - file_header: write returned -14
>>>>>
>>>>> Can someone help please.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Raj
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users