checkpoint
option on the scrun
command in the following manner to periodically checkpoint your parallel
process:
scrun -checkpoint=interval a.out
If you want to checkpoint with single user mode,
you must execute with group option wuthout the SCOUT environment.
scrun -group=pcc,checkpoint=interval a.out
where a.out
is the program name and
interval
is the checkpointing period specified in
format of [1-9][0-9]*[sSmMhHdD]
.
The sSmHhHdD
is a suffix specifying a time unit.
s/S, m/M, h/H,
and d/D
are interpreted as in
seconds, minutes, hours, and days, respectively. The default unit is
in minutes.
SCORE: Checkpointing ... done.
This message is printed after each successful checkpoint.
SIGQUIT
to the FEP (Front-End Process) process. This is done by simply pressing
"^\
" in the console window where you executed the program.
You will be informed of the successful checkpoint with the above message.
installdir/crt.target/etc/init.d/crt start
You can execute the scrun
with checkpoint
option.
FEP:ERROR SCore-D unexpectedly terminated.
By specifying the -checkpoint
option, you can avoid this
behavior; i.e., the FEP process will stay alive and the following
message is shown:
FEP:WARNING SCore-D unexpectedly terminated.
FEP: [07/Feb/2000 15:28:40] Waiting for SCore-D to be restarted ...
Do not kill this FEP process if you want to restart your parallel
process from a checkpoint. After a system crash, if the system
administrator runs SCore-D with the restart option, your parallel
process will be automatically restarted from the latest checkpoint, as
long as the FEP process is still alive. You will then see the following
message:
FEP: [07/Feb/2000 15:30:43] SCore-D restarted.
and your parallel process will continue from the last checkpoint image.
If a checkpoint image was not taken before a system crash, your parallel
process will execute from the beginning again.
FEP: [07/Feb/2000 16:31:17] restart canceled by SCore-D.
mmap
(2)'ed with the MAP_SHARED
option and
the file descriptor is binded to a persistent file. Such a memory
area can be resumed as long as the file contents are preserved at
the checkpoint.gettimeofday
(2)
will return the current system clock time, not adjusted by the
interval time from the checkpoint.Initializing checkpointer failed:
error-code
Checkpointing failed:
error-code
Restarting failed:
error-code
Dynamically linked program is uncheckpointable.
Dynamically linked program is uncheckpointable
-checkpoint=
number option to your
scrun
command.
The execution will be totally cancelled.
Fatal error in restarting:
error-code
![]() |
PC Cluster Consortium |