Checkpointing
This page describes how to checkpoint and restart your parallel process,
and the limitations of checkpointing.
How to checkpoint your parallel process
You can specify the checkpoint
option on the scrun
command in the following manner to periodically checkpoint your parallel
process:
scrun -checkpoint=interval a.out
where a.out
is the program name and
interval
is the checkpointing period specified in
format of [1-9][0-9]*[sSmMhHdD]
.
The sSmHhHdD
is a suffix specifying a time unit.
s/S, m/M, h/H,
and d/D
are interpreted as in
seconds, minutes, hours, and days, respectively. The default unit is
in minutes.
After the parallel process is started, you will see the following messages
printed to stderr periodically:
SCORE: Checkpointing ... done.
This message is printed after each successful checkpoint.
Another way to checkpoint your process is to send SIGQUIT
to the FEP (Front-End Process) process. This is done by simply pressing
"^\
" in the console window where you executed the program.
You will be informed of the successful checkpoint with the above message.
How your parallel process is restarted from a checkpoint
If SCore-D terminates for some reason, your FEP process will simply be
terminated with a following message printed to stderr:
FEP:ERROR SCore-D unexpectedly terminated.
By specifying the -checkpoint
option, you can avoid this
behavior; i.e., the FEP process will stay alive and the following
message is shown:
FEP:WARNING SCore-D unexpectedly terminated.
FEP: [07/Feb/2000 15:28:40] Waiting for SCore-D to be restarted ...
Do not kill this FEP process if you want to restart your parallel
process from a checkpoint. After a system crash, if the system
administrator runs SCore-D with the restart option, your parallel
process will be automatically restarted from the latest checkpoint, as
long as the FEP process is still alive. You will then see the following
message:
FEP: [07/Feb/2000 15:30:43] SCore-D restarted.
and your parallel process will continue from the last checkpoint image.
If a checkpoint image was not taken before a system crash, your parallel
process will execute from the beginning again.
If the system administrator does not rerun SCore-D with the restart option,
then your FEP process will be killed and the checkpoint for the parallel
process is lost. You will see the following message:
FEP: [07/Feb/2000 16:31:17] restart canceled by SCore-D.
Limitations
Following limitations exist when using the checkpointing facility in
SCore-D:
- A dynamically-linked program is uncheckpointable. Use
-static
to make your program checkpointable.
- Programs using zero-copy communication is unable to
restart. (This problem will be fixed in the future release)
- Only the latest checkpoint can be used to restart the parallel
process. A new checkpoint overrides the previous one.
- File contents cannot be resumed. You must preserve the contents of
files so they remain unchanged from the checkpoint time until the
parallel process is restarted.
- Sockets, pipes, and shared memory cannot be resumed. The only
exception to this is the shared memory area which is
mmap
(2)'ed with the MAP_SHARED
option and
the file descriptor is binded to a persistent file. Such a memory
area can be resumed as long as the file contents are preserved at
the checkpoint.
- Unable to restart on another node. The parallel process is
restarted on the same node as at checkpoint. (A parallel
process migration facility will be provided for this purpose.)
- Kernel/hardware status cannot be resumed. For example, the value
returned by
getpid
(2) is not the same as at checkpoint,
and gettimeofday
(2) will return the current system
clock time, not adjusted by the interval time from the checkpoint.
- Unable to resume from a disk crash. Checkpoints will be permanently
lost in such a case. (This problem will be fixed in the future
release)
Messages
The following messages may appear while checkpointing/restarting:
- Warning messages:
-
Initializing checkpointer failed:
error-code
This message is shown while initializing the application
program. The program will be normally started, but a
request for checkpointing will be ignored.
-
Checkpointing failed:
error-code
This message is shown while checkpointing. The
requested checkpoint is cancelled for this time.
Subsequent checkpoints, however, will be normally
executed.
-
Restarting failed:
error-code
This message is shown while restarting. The program
will normally be started without restarting.
-
Dynamically linked program is uncheckpointable.
This message is shown when you request checkpointing by
sending SIGQUIT to the FEP process. The requested
checkpoint is cancelled.
- Error messages:
-
Dynamically linked program is uncheckpointable
This message is shown when you specify the
-checkpoint=
number option to your
scrun
command.
The execution will be totally cancelled.
-
Fatal error in restarting:
error-code
This message is shown while restarting. Due to some
hazardous error, the program will not be started even
normally. The execution will be totally cancelled.
Bugs
The checkpoint function in the current version of SCore has the following bugs:
- Parallel processes using PM Zero-Copy communication are not checkpointable.
$Id: checkpoint.html,v 1.2 2002/03/08 06:23:14 hirose Exp $