« Return to documentation listing
Table of Contents
OPAL_CRS - Open PAL MCA Checkpoint/Restart Service (CRS):
Overview of Open PAL’s CRS framework, and selected modules. Open MPI 1.5.5.
Open PAL can involuntarily checkpoint and restart sequential
programs. Doing so requires that Open PAL was compiled with thread support
and that the back-end checkpointing systems are available at run-time.
Open PAL defines three phases for checkpoint / restart
support in a procress:
- Checkpoint
- When the checkpoint request arrives,
the procress is notified of the request before the checkpoint is taken.
- Continue
- After a checkpoint has successfully completed, the same process
as the checkpoint is notified of its successful continuation of execution.
- Restart
- After a checkpoint has successfully completed, a new / restarted
process is notified of its successful restart.
The Continue and Restart
phases are identical except for the process in which they are invoked. The
Continue phase is invoked in the same process as the Checkpoint phase was
invoked. The Restart phase is only invoked in newly restarted processes.
In order for a process to use the Open
PAL CRS components it must adhear to a few programmatic requirements.
First,
the program must call OPAL_INIT early in its execution. This should only
be called once, and it is not possible to checkpoint the process without
it first having called this function.
The program must call OPAL_FINALIZE
before termination. This does a significant amount of cleanup. If it is
not called, then it is very likely that remnants are left in the filesystem.
To checkpoint and restart a process you must use the Open PAL tools to
do so. Using the backend checkpointer’s checkpoint and restart tools will
lead to undefined behavior. To checkpoint a process use opal_checkpoint
(opal_checkpoint(1)). To restart a process use opal_restart (opal_restart(1)).
Open PAL ships with two CRS components: self and
blcr.
The following MCA parameters apply to all components:
- crs_base_verbose
- Set the verbosity level for all components. Default is 0, or silent except
on error.
- crs_base_snapshot_dir
- The directory to store the checkpoint snapshots.
Default is /tmp.
The self component invokes user-defined
functions to save and restore checkpoints. It is simply a mechanism for
user-defined functions to be invoked at Open PAL’s Checkpoint, Continue,
and Restart phases. Hence, the only data that is saved during the checkpoint
is what is written in the user’s checkpoint function. No libary state is
saved at all.
As such, the model for the self component is slightly differnt
than for other components. Specifically, the Restart function is not invoked
in the same process image of the process that was checkpointed. The Restart
phase is invoked during OPAL_INIT of the new instance of the applicaiton
(i.e., it starts over from main()).
The self component has the following
MCA parameters:
- crs_self_prefix
- Speficy a string prefix for the name of
the checkpoint, continue, and restart functions that Open PAL will invoke
during the respective stages. That is, by specifying "-mca crs_self_prefix
foo" means that Open PAL expects to find three functions at run-time:
int foo_checkpoint()
int foo_continue()
int foo_restart()
By default, the prefix is set to "opal_crs_self_user".
- crs_self_priority
- Set the self components default priority
- crs_self_verbose
- Set the verbosity
level. Default is 0, or silent except on error.
- crs_self_do_restart
- This
is mostly internally used. A general user should never need to set this
value. This is set to non-0 when a the new process should invoke the restart
callback in OPAL_INIT. Default is 0, or normal execution.
The
Berkeley Lab Checkpoint/Restart (BLCR) single-process checkpoint is a software
system developed at Lawrence Berkeley National Laboratory. See the project
website for more details:
http://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml
The blcr component has the following MCA parameters:
- crs_blcr_priority
- Set the blcr components default priority.
- crs_blcr_verbose
- Set the verbosity
level. Default is 0, or silent except on error.
The
none component simply selects no CRS component. All of the CRS function
calls return immediately with OPAL_SUCCESS.
This component is the last
component to be selected by default. This means that if another component
is available, and the none component was not explicity requested then OPAL
will attempt to activate all of the available components before falling
back to this component.
opal_checkpoint(1), opal_restart(1)
Table of Contents
« Return to documentation listing
|