1 | .\" |
---|
2 | .\" Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana |
---|
3 | .\" University Research and Technology |
---|
4 | .\" Corporation. All rights reserved. |
---|
5 | .\" Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved. |
---|
6 | .\" |
---|
7 | .\" Man page for OPAL's CRS Functionality |
---|
8 | .\" |
---|
9 | .\" .TH name section center-footer left-footer center-header |
---|
10 | .TH OPAL_CRS 7 "Dec 08, 2009" "1.4" "Open MPI" |
---|
11 | |
---|
12 | .\" ************************** |
---|
13 | .\" Name Section |
---|
14 | .\" ************************** |
---|
15 | .SH NAME |
---|
16 | . |
---|
17 | Open PAL MCA Checkpoint/Restart Service (CRS) \- Overview of Open PAL's CRS |
---|
18 | framework, and selected modules. Open MPI 1.4. |
---|
19 | . |
---|
20 | .\" ************************** |
---|
21 | .\" Description Section |
---|
22 | .\" ************************** |
---|
23 | .SH DESCRIPTION |
---|
24 | . |
---|
25 | .PP |
---|
26 | Open PAL can involuntarily checkpoint and restart sequential programs. |
---|
27 | Doing so requires that Open PAL was compiled with thread support and |
---|
28 | that the back-end checkpointing systems are available at run-time. |
---|
29 | . |
---|
30 | .SS Phases of Checkpoint / Restart |
---|
31 | .PP |
---|
32 | Open PAL defines three phases for checkpoint / restart support in a |
---|
33 | procress: |
---|
34 | . |
---|
35 | .TP 4 |
---|
36 | Checkpoint |
---|
37 | When the checkpoint request arrives, the procress is notified of the |
---|
38 | request before the checkpoint is taken. |
---|
39 | . |
---|
40 | .TP 4 |
---|
41 | Continue |
---|
42 | After a checkpoint has successfully completed, the same process as the |
---|
43 | checkpoint is notified of its successful continuation of execution. |
---|
44 | . |
---|
45 | .TP 4 |
---|
46 | Restart |
---|
47 | After a checkpoint has successfully completed, a new / restarted |
---|
48 | process is notified of its successful restart. |
---|
49 | . |
---|
50 | .PP |
---|
51 | The Continue and Restart phases are identical except for the process |
---|
52 | in which they are invoked. The Continue phase is invoked in the same process |
---|
53 | as the Checkpoint phase was invoked. The Restart phase is only invoked in newly |
---|
54 | restarted processes. |
---|
55 | . |
---|
56 | .\" ************************** |
---|
57 | .\" General Process Requirements Section |
---|
58 | .\" ************************** |
---|
59 | .SH GENERAL PROCESS REQUIREMENTS |
---|
60 | .PP |
---|
61 | In order for a process to use the Open PAL CRS components it must adhear to a |
---|
62 | few programmatic requirements. |
---|
63 | .PP |
---|
64 | First, the program must call \fIOPAL_INIT\fR early in its execution. This |
---|
65 | should only be called once, and it is not possible to checkpoint the process |
---|
66 | without it first having called this function. |
---|
67 | .PP |
---|
68 | The program must call \fIOPAL_FINALIZE\fR before termination. This does a |
---|
69 | significant amount of cleanup. If it is not called, then it is very likely that |
---|
70 | remnants are left in the filesystem. |
---|
71 | .PP |
---|
72 | To checkpoint and restart a process you must use the Open PAL tools to do |
---|
73 | so. Using the backend checkpointer's checkpoint and restart tools will lead |
---|
74 | to undefined behavior. |
---|
75 | To checkpoint a process use \fIopal_checkpoint\fR (opal_checkpoint(1)). |
---|
76 | To restart a process use \fIopal_restart\fR (opal_restart(1)). |
---|
77 | . |
---|
78 | .\" ********************************** |
---|
79 | .\" Available Components Section |
---|
80 | .\" ********************************** |
---|
81 | .SH AVAILABLE COMPONENTS |
---|
82 | .PP |
---|
83 | Open PAL ships with two CRS components: \fIself\fR and \fIblcr\fR. |
---|
84 | . |
---|
85 | .PP |
---|
86 | The following MCA parameters apply to all components: |
---|
87 | . |
---|
88 | .TP 4 |
---|
89 | crs_base_verbose |
---|
90 | Set the verbosity level for all components. Default is 0, or silent except on error. |
---|
91 | . |
---|
92 | .TP |
---|
93 | crs_base_snapshot_dir |
---|
94 | The directory to store the checkpoint snapshots. Default is \fB/tmp\fP. |
---|
95 | . |
---|
96 | .\" Self Component |
---|
97 | .\" ****************** |
---|
98 | .SS self CRS Component |
---|
99 | .PP |
---|
100 | The \fIself\fR component invokes user-defined functions to save and restore |
---|
101 | checkpoints. It is simply a mechanism for user-defined functions to be invoked |
---|
102 | at Open PAL's Checkpoint, Continue, and Restart phases. Hence, the only data |
---|
103 | that is saved during the checkpoint is what is written in the user's checkpoint |
---|
104 | function. No libary state is saved at all. |
---|
105 | . |
---|
106 | .PP |
---|
107 | As such, the model for the \fIself\fR component is slightly differnt than for |
---|
108 | other components. Specifically, the Restart function is not invoked in the same |
---|
109 | process image of the process that was checkpointed. The Restart phase is |
---|
110 | invoked during \fBOPAL_INIT\fR of the new instance of the applicaiton (i.e., it |
---|
111 | starts over from main()). |
---|
112 | . |
---|
113 | .PP |
---|
114 | The \fIself\fR component has the following MCA parameters: |
---|
115 | .TP 4 |
---|
116 | crs_self_prefix |
---|
117 | Speficy a string prefix for the name of the checkpoint, continue, and restart |
---|
118 | functions that Open PAL will invoke during the respective stages. That is, |
---|
119 | by specifying "-mca crs_self_prefix foo" means that Open PAL expects to find |
---|
120 | three functions at run-time: |
---|
121 | |
---|
122 | int foo_checkpoint() |
---|
123 | |
---|
124 | int foo_continue() |
---|
125 | |
---|
126 | int foo_restart() |
---|
127 | |
---|
128 | By default, the prefix is set to "opal_crs_self_user". |
---|
129 | . |
---|
130 | .TP 4 |
---|
131 | crs_self_priority |
---|
132 | Set the \fIself\fR components default priority |
---|
133 | . |
---|
134 | .TP 4 |
---|
135 | crs_self_verbose |
---|
136 | Set the verbosity level. Default is 0, or silent except on error. |
---|
137 | . |
---|
138 | .TP 4 |
---|
139 | crs_self_do_restart |
---|
140 | This is mostly internally used. A general user should never need to set this |
---|
141 | value. This is set to non-0 when a the new process should invoke the restart |
---|
142 | callback in \fIOPAL_INIT\fR. Default is 0, or normal execution. |
---|
143 | . |
---|
144 | .\" BLCR Component |
---|
145 | .\" ****************** |
---|
146 | .SS blcr CRS Component |
---|
147 | .PP |
---|
148 | The Berkeley Lab Checkpoint/Restart (BLCR) single-process checkpoint is a |
---|
149 | software system developed at Lawrence Berkeley National Laboratory. See the |
---|
150 | project website for more details: |
---|
151 | |
---|
152 | \fI http://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml \fR |
---|
153 | . |
---|
154 | .PP |
---|
155 | The \fIblcr\fR component has the following MCA parameters: |
---|
156 | .TP 4 |
---|
157 | crs_blcr_priority |
---|
158 | Set the \fIblcr\fR components default priority. |
---|
159 | . |
---|
160 | .TP 4 |
---|
161 | crs_blcr_verbose |
---|
162 | Set the verbosity level. Default is 0, or silent except on error. |
---|
163 | . |
---|
164 | .\" Special 'none' option |
---|
165 | .\" ************************ |
---|
166 | .SS none CRS Component |
---|
167 | .PP |
---|
168 | The \fInone\fP component simply selects no CRS component. All of the CRS |
---|
169 | function calls return immediately with OPAL_SUCCESS. |
---|
170 | . |
---|
171 | .PP |
---|
172 | This component is the last component to be selected by default. This means that if |
---|
173 | another component is available, and the \fInone\fP component was not explicity |
---|
174 | requested then OPAL will attempt to activate all of the available components |
---|
175 | before falling back to this component. |
---|
176 | . |
---|
177 | .\" ************************** |
---|
178 | .\" See Also Section |
---|
179 | .\" ************************** |
---|
180 | . |
---|
181 | .SH SEE ALSO |
---|
182 | opal_checkpoint(1), opal_restart(1) |
---|
183 | .\", orte_crs(7), ompi_crs(7) |
---|