Checkpointing


This page describes how to checkpoint and restart your parallel process, and the limitations of checkpointing.

How to checkpoint your parallel process

You can specify the checkpoint option on the scrun command in the following manner to periodically checkpoint your parallel process:
scrun -checkpoint=interval a.out
If you want to checkpoint with single user mode, you must execute with group option wuthout the SCOUT environment.
scrun -group=pcc,checkpoint=interval a.out
where a.out is the program name and interval is the checkpointing period specified in format of [1-9][0-9]*[sSmMhHdD]. The sSmHhHdD is a suffix specifying a time unit. s/S, m/M, h/H, and d/D are interpreted as in seconds, minutes, hours, and days, respectively. The default unit is in minutes.

After the parallel process is started, you will see the following messages printed to stderr periodically:
SCORE: Checkpointing ... done.
This message is printed after each successful checkpoint.

Another way to checkpoint your process is to send SIGQUIT to the FEP (Front-End Process) process. This is done by simply pressing "^\" in the console window where you executed the program. You will be informed of the successful checkpoint with the above message.

How your parallel process is restarted from a checkpoint

If SCore-D terminates for some reason, your FEP process will simply be terminated with a following message printed to stderr:
FEP:ERROR SCore-D unexpectedly terminated.
By specifying the -checkpoint option, you can avoid this behavior; i.e., the FEP process will stay alive and the following message is shown:
FEP:WARNING SCore-D unexpectedly terminated.
FEP: [07/Feb/2000 15:28:40] Waiting for SCore-D to be restarted ...
Do not kill this FEP process if you want to restart your parallel process from a checkpoint. After a system crash, if the system administrator runs SCore-D with the restart option, your parallel process will be automatically restarted from the latest checkpoint, as long as the FEP process is still alive. You will then see the following message:
FEP: [07/Feb/2000 15:30:43] SCore-D restarted.
and your parallel process will continue from the last checkpoint image. If a checkpoint image was not taken before a system crash, your parallel process will execute from the beginning again.

If the system administrator does not rerun SCore-D with the restart option, then your FEP process will be killed and the checkpoint for the parallel process is lost. You will see the following message:
FEP: [07/Feb/2000 16:31:17] restart canceled by SCore-D.

Limitations

Following limitations exist when using the checkpointing facility in SCore-D:

Messages

The following messages may appear while checkpointing/restarting:


PCCC logo PC Cluster Consotium

CREDIT
This document is a part of the SCore cluster system software developed at PC Cluster Consortium, Japan. Copyright (C) 2003 PC Cluster Consortium.