[97] | 1 | .\" |
---|
| 2 | .\" Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana |
---|
| 3 | .\" University Research and Technology |
---|
| 4 | .\" Corporation. All rights reserved. |
---|
| 5 | .\" Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved. |
---|
| 6 | .\" |
---|
| 7 | .\" Man page for OPAL's CRS Functionality |
---|
| 8 | .\" |
---|
| 9 | .\" .TH name section center-footer left-footer center-header |
---|
| 10 | .TH OPAL_CRS 7 "Dec 08, 2009" "1.4" "Open MPI" |
---|
| 11 | |
---|
| 12 | .\" ************************** |
---|
| 13 | .\" Name Section |
---|
| 14 | .\" ************************** |
---|
| 15 | .SH NAME |
---|
| 16 | . |
---|
| 17 | Open PAL MCA Checkpoint/Restart Service (CRS) \- Overview of Open PAL's CRS |
---|
| 18 | framework, and selected modules. Open MPI 1.4. |
---|
| 19 | . |
---|
| 20 | .\" ************************** |
---|
| 21 | .\" Description Section |
---|
| 22 | .\" ************************** |
---|
| 23 | .SH DESCRIPTION |
---|
| 24 | . |
---|
| 25 | .PP |
---|
| 26 | Open PAL can involuntarily checkpoint and restart sequential programs. |
---|
| 27 | Doing so requires that Open PAL was compiled with thread support and |
---|
| 28 | that the back-end checkpointing systems are available at run-time. |
---|
| 29 | . |
---|
| 30 | .SS Phases of Checkpoint / Restart |
---|
| 31 | .PP |
---|
| 32 | Open PAL defines three phases for checkpoint / restart support in a |
---|
| 33 | procress: |
---|
| 34 | . |
---|
| 35 | .TP 4 |
---|
| 36 | Checkpoint |
---|
| 37 | When the checkpoint request arrives, the procress is notified of the |
---|
| 38 | request before the checkpoint is taken. |
---|
| 39 | . |
---|
| 40 | .TP 4 |
---|
| 41 | Continue |
---|
| 42 | After a checkpoint has successfully completed, the same process as the |
---|
| 43 | checkpoint is notified of its successful continuation of execution. |
---|
| 44 | . |
---|
| 45 | .TP 4 |
---|
| 46 | Restart |
---|
| 47 | After a checkpoint has successfully completed, a new / restarted |
---|
| 48 | process is notified of its successful restart. |
---|
| 49 | . |
---|
| 50 | .PP |
---|
| 51 | The Continue and Restart phases are identical except for the process |
---|
| 52 | in which they are invoked. The Continue phase is invoked in the same process |
---|
| 53 | as the Checkpoint phase was invoked. The Restart phase is only invoked in newly |
---|
| 54 | restarted processes. |
---|
| 55 | . |
---|
| 56 | .\" ************************** |
---|
| 57 | .\" General Process Requirements Section |
---|
| 58 | .\" ************************** |
---|
| 59 | .SH GENERAL PROCESS REQUIREMENTS |
---|
| 60 | .PP |
---|
| 61 | In order for a process to use the Open PAL CRS components it must adhear to a |
---|
| 62 | few programmatic requirements. |
---|
| 63 | .PP |
---|
| 64 | First, the program must call \fIOPAL_INIT\fR early in its execution. This |
---|
| 65 | should only be called once, and it is not possible to checkpoint the process |
---|
| 66 | without it first having called this function. |
---|
| 67 | .PP |
---|
| 68 | The program must call \fIOPAL_FINALIZE\fR before termination. This does a |
---|
| 69 | significant amount of cleanup. If it is not called, then it is very likely that |
---|
| 70 | remnants are left in the filesystem. |
---|
| 71 | .PP |
---|
| 72 | To checkpoint and restart a process you must use the Open PAL tools to do |
---|
| 73 | so. Using the backend checkpointer's checkpoint and restart tools will lead |
---|
| 74 | to undefined behavior. |
---|
| 75 | To checkpoint a process use \fIopal_checkpoint\fR (opal_checkpoint(1)). |
---|
| 76 | To restart a process use \fIopal_restart\fR (opal_restart(1)). |
---|
| 77 | . |
---|
| 78 | .\" ********************************** |
---|
| 79 | .\" Available Components Section |
---|
| 80 | .\" ********************************** |
---|
| 81 | .SH AVAILABLE COMPONENTS |
---|
| 82 | .PP |
---|
| 83 | Open PAL ships with two CRS components: \fIself\fR and \fIblcr\fR. |
---|
| 84 | . |
---|
| 85 | .PP |
---|
| 86 | The following MCA parameters apply to all components: |
---|
| 87 | . |
---|
| 88 | .TP 4 |
---|
| 89 | crs_base_verbose |
---|
| 90 | Set the verbosity level for all components. Default is 0, or silent except on error. |
---|
| 91 | . |
---|
| 92 | .TP |
---|
| 93 | crs_base_snapshot_dir |
---|
| 94 | The directory to store the checkpoint snapshots. Default is \fB/tmp\fP. |
---|
| 95 | . |
---|
| 96 | .\" Self Component |
---|
| 97 | .\" ****************** |
---|
| 98 | .SS self CRS Component |
---|
| 99 | .PP |
---|
| 100 | The \fIself\fR component invokes user-defined functions to save and restore |
---|
| 101 | checkpoints. It is simply a mechanism for user-defined functions to be invoked |
---|
| 102 | at Open PAL's Checkpoint, Continue, and Restart phases. Hence, the only data |
---|
| 103 | that is saved during the checkpoint is what is written in the user's checkpoint |
---|
| 104 | function. No libary state is saved at all. |
---|
| 105 | . |
---|
| 106 | .PP |
---|
| 107 | As such, the model for the \fIself\fR component is slightly differnt than for |
---|
| 108 | other components. Specifically, the Restart function is not invoked in the same |
---|
| 109 | process image of the process that was checkpointed. The Restart phase is |
---|
| 110 | invoked during \fBOPAL_INIT\fR of the new instance of the applicaiton (i.e., it |
---|
| 111 | starts over from main()). |
---|
| 112 | . |
---|
| 113 | .PP |
---|
| 114 | The \fIself\fR component has the following MCA parameters: |
---|
| 115 | .TP 4 |
---|
| 116 | crs_self_prefix |
---|
| 117 | Speficy a string prefix for the name of the checkpoint, continue, and restart |
---|
| 118 | functions that Open PAL will invoke during the respective stages. That is, |
---|
| 119 | by specifying "-mca crs_self_prefix foo" means that Open PAL expects to find |
---|
| 120 | three functions at run-time: |
---|
| 121 | |
---|
| 122 | int foo_checkpoint() |
---|
| 123 | |
---|
| 124 | int foo_continue() |
---|
| 125 | |
---|
| 126 | int foo_restart() |
---|
| 127 | |
---|
| 128 | By default, the prefix is set to "opal_crs_self_user". |
---|
| 129 | . |
---|
| 130 | .TP 4 |
---|
| 131 | crs_self_priority |
---|
| 132 | Set the \fIself\fR components default priority |
---|
| 133 | . |
---|
| 134 | .TP 4 |
---|
| 135 | crs_self_verbose |
---|
| 136 | Set the verbosity level. Default is 0, or silent except on error. |
---|
| 137 | . |
---|
| 138 | .TP 4 |
---|
| 139 | crs_self_do_restart |
---|
| 140 | This is mostly internally used. A general user should never need to set this |
---|
| 141 | value. This is set to non-0 when a the new process should invoke the restart |
---|
| 142 | callback in \fIOPAL_INIT\fR. Default is 0, or normal execution. |
---|
| 143 | . |
---|
| 144 | .\" BLCR Component |
---|
| 145 | .\" ****************** |
---|
| 146 | .SS blcr CRS Component |
---|
| 147 | .PP |
---|
| 148 | The Berkeley Lab Checkpoint/Restart (BLCR) single-process checkpoint is a |
---|
| 149 | software system developed at Lawrence Berkeley National Laboratory. See the |
---|
| 150 | project website for more details: |
---|
| 151 | |
---|
| 152 | \fI http://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml \fR |
---|
| 153 | . |
---|
| 154 | .PP |
---|
| 155 | The \fIblcr\fR component has the following MCA parameters: |
---|
| 156 | .TP 4 |
---|
| 157 | crs_blcr_priority |
---|
| 158 | Set the \fIblcr\fR components default priority. |
---|
| 159 | . |
---|
| 160 | .TP 4 |
---|
| 161 | crs_blcr_verbose |
---|
| 162 | Set the verbosity level. Default is 0, or silent except on error. |
---|
| 163 | . |
---|
| 164 | .\" Special 'none' option |
---|
| 165 | .\" ************************ |
---|
| 166 | .SS none CRS Component |
---|
| 167 | .PP |
---|
| 168 | The \fInone\fP component simply selects no CRS component. All of the CRS |
---|
| 169 | function calls return immediately with OPAL_SUCCESS. |
---|
| 170 | . |
---|
| 171 | .PP |
---|
| 172 | This component is the last component to be selected by default. This means that if |
---|
| 173 | another component is available, and the \fInone\fP component was not explicity |
---|
| 174 | requested then OPAL will attempt to activate all of the available components |
---|
| 175 | before falling back to this component. |
---|
| 176 | . |
---|
| 177 | .\" ************************** |
---|
| 178 | .\" See Also Section |
---|
| 179 | .\" ************************** |
---|
| 180 | . |
---|
| 181 | .SH SEE ALSO |
---|
| 182 | opal_checkpoint(1), opal_restart(1) |
---|
| 183 | .\", orte_crs(7), ompi_crs(7) |
---|