« Return to documentation listing
Table of Contents
ORTE_SNAPC - Open RTE MCA Snapshot Coordination (SnapC)
Framework: Overview of Open RTE’s SnapC framework, and selected modules.
Open MPI 1.6.4
Open RTE can coordinate the generation of
a global snapshot of a parallel job from many distributed local snapshots.
The components in this framework determine how to: Initiate the checkpoint
of the parallel application, gather together the many distributed local
snapshots, and provide the user with a global snapshot handle reference
that can be used to restart the parallel application.
In order for a process to use the Open RTE SnapC components
it must adhear to a few programmatic requirements.
First, the program must
call ORTE_INIT early in its execution. This should only be called once,
and it is not possible to checkpoint the process without it first having
called this function.
The program must call ORTE_FINALIZE before termination.
A user may initiate a checkpoint of a parallel application by using the
orte-checkpoint(1) and orte-restart(1) commands.
Open
RTE ships with one SnapC component: full.
The following MCA parameters
apply to all components:
- snapc_base_verbose
- Set the verbosity level for
all components. Default is 0, or silent except on error.
- snapc_base_global_snapshot_dir
- The directory to store the checkpoint snapshots. Default is /tmp.
The full component gathers together the local snapshots
to the disk local to the Head Node Process (HNP) before completing the
checkpoint of the process. This component does not currently support replicated
HNPs, or timer based gathering of local snapshot data. This is a 3-tiered
hierarchy of coordinators.
The full component has the following MCA parameters:
- snapc_full_priority
- The component’s priority to use when selecting the
most appropriate component for a run.
- snapc_full_verbose
- Set the verbosity
level for this component. Default is 0, or silent except on error.
The none component simply selects no SnapC component. All
of the SnapC function calls return immediately with ORTE_SUCCESS.
This
component is the last component to be selected by default. This means that
if another component is available, and the none component was not explicity
requested then ORTE will attempt to activate all of the available components
before falling back to this component.
orte-checkpoint(1),
orte-restart(1), opal-checkpoint(1), opal-restart(1),
orte_filem(7), opal_crs(7)
Table of Contents
« Return to documentation listing
|