Resuming SCore-D from an Unexpected Failure
SCore-D checkpoints itself every time a user logs in and out.
If SCore-D is re-invoked with the restart
option, it tries to
recover itself from its most recent checkpoint. The command-line is as
follows:
scored -restart
By resuming SCore-D, restartable users parallel processes are recovered.
User parallel processes which were specified with a restart
or
checkpoint
option, or those which were checkpointed with
"^\
" (SIGQUIT
) are considered restartable.
If a checkpoint image was found for a restartable parallel process,
SCore-D tries to resume it from the checkpoint.
Following is an example of a successful restart of SCore-D and a
user-parallel process:
# scored -restart
SYSLOG: Timeslice is set to 500[ms]
SYSLOG: Cluster[0]: comp0.trc.rwcp.or.jp@0...comp3.trc.rwcp.or.jp@3
SYSLOG: BIN=linux, CPUGEN=pentium-iii, SMP=1, SPEED=500
SYSLOG: Network[0]: myrinet/myrinet
SYSLOG: SCore-D network: myrinet/myrinet
SYSLOG: Recover: user1@host1.trc.rwcp.or.jp:4681
SYSLOG: SCore-D server: comp3.trc.rwcp.or.jp:9901
If the restart
option is not specified when re-invoking SCore-D,
previously checkpointed user parallel processes are not restarted and the
checkpoint images are lost:
# scored
SYSLOG: Timeslice is set to 500[ms]
SYSLOG: Cluster[0]: comp0.trc.rwcp.or.jp@0...comp3.trc.rwcp.or.jp@3
SYSLOG: BIN=linux, CPUGEN=pentium-iii, SMP=1, SPEED=500
SYSLOG: Network[0]: myrinet/myrinet
SYSLOG: SCore-D network: myrinet/myrinet
SYSLOG: Recover canceled by SCore-D: user1@host1.trc.rwcp.or.jp:4672
SYSLOG: SCore-D server: comp3.trc.rwcp.or.jp:9901
If the restart
option is specified but the user parallel process
has already been killed by the user, then the following messages will be
observed:
# scored -restart
SYSLOG: Timeslice is set to 500[ms]
SYSLOG: Cluster[0]: comp0.trc.rwcp.or.jp@0...comp3.trc.rwcp.or.jp@3
<7> SCore-D:WARNING connect_fep(host1.trc.rwcp.or.jp:4679)=111 failed !!
SYSLOG: BIN=linux, CPUGEN=pentium-iii, SMP=1, SPEED=500
SYSLOG: Network[0]: myrinet/myrinet
SYSLOG: SCore-D network: myrinet/myrinet
SYSLOG: Recover canceled by user: user1@host1.trc.rwcp.or.jp:4679
SYSLOG: SCore-D server: comp3.trc.rwcp.or.jp:9901
If restart
option does not work well, then reset SCore-D
environment must be done. Use reset
option in this
case. Note that user programs will not be restarted when
reset
option is specified.
- CREDIT
- This document is a part of the SCore cluster system software
developed at Real World Computing Partnership, Japan.
Copyright (c) 2000, 1999 Real World Computing Partnership.