Resuming SCore-D from an Unexpected Failure
SCore-D checkpoints itself every time a user logs in and out.
If SCore-D is re-invoked with the restart option, it tries to
recover itself from its most recent checkpoint. The command-line is as
follows:
scored -restart
By resuming SCore-D, restartable users parallel processes are recovered.
User parallel processes which were specified with a restart or
checkpoint option, or those which were checkpointed with
"^\" (SIGQUIT) are considered restartable.
If a checkpoint image was found for a restartable parallel process,
SCore-D tries to resume it from the checkpoint.
Following is an example of a successful restart of SCore-D and a
user-parallel process:
# scored -restart
SYSLOG: /opt/score/deploy/scored
SYSLOG: SCore-D 4.1 $Id: init.cc,v 1.63 2001/09/07 09:10:26 hori Exp $
SYSLOG: Compile option(s):
SYSLOG: SCore-D network: myrinet/myrinet2k
SYSLOG: Cluster[0]: (0..15)x2.i386-redhat7-linux2_4.i686.800
SYSLOG: Memory: 501[MB], Swap: 259[MB], Disk: 3027[MB]
SYSLOG: Network[0]: myrinet/myrinet2k
SYSLOG: Network[1]: ethernet/ethernet
SYSLOG: Scheduler initiated: Timeslice = 500 [msec]
SYSLOG: Queue[0] activated, exclusive scheduling
SYSLOG: Queue[1] activated, time-sharing scheduling
SYSLOG: Queue[2] activated, time-sharing scheduling
SYSLOG: Session ID: 0
SYSLOG: Server Host: comp00.pccluster.org
SYSLOG: Backup Host: comp0f.pccluster.org
SYSLOG: Operated by: albus
SYSLOG: Recovery: harry@host1.pccluster.org:4670, JID-ID: 194
SYSLOG: --------- SCore-D (4.1) bootup --------
If the restart option is not specified when re-invoking SCore-D,
previously checkpointed user parallel processes are not restarted and the
checkpoint images are lost:
# scored
SYSLOG: /opt/score/deploy/scored
SYSLOG: SCore-D 4.1 $Id: init.cc,v 1.63 2001/09/07 09:10:26 hori Exp $
SYSLOG: Compile option(s):
SYSLOG: SCore-D network: myrinet/myrinet2k
SYSLOG: Cluster[0]: (0..15)x2.i386-redhat7-linux2_4.i686.800
SYSLOG: Memory: 501[MB], Swap: 259[MB], Disk: 3027[MB]
SYSLOG: Network[0]: myrinet/myrinet2k
SYSLOG: Network[1]: ethernet/ethernet
SYSLOG: Scheduler initiated: Timeslice = 500 [msec]
SYSLOG: Queue[0] activated, exclusive scheduling
SYSLOG: Queue[1] activated, time-sharing scheduling
SYSLOG: Queue[2] activated, time-sharing scheduling
SYSLOG: Session ID: 0
SYSLOG: Server Host: comp00.pccluster.org
SYSLOG: Backup Host: comp0f.pccluster.org
SYSLOG: Operated by: albus
SYSLOG: Recover canceled by SCore-D: tom@host1.pccluster.org:4672
SYSLOG: --------- SCore-D (4.1) bootup --------
If the restart option is specified but the user parallel process
has already been killed by the user, then the following messages will be
observed:
# scored -restart
SYSLOG: /opt/score/deploy/scored
SYSLOG: SCore-D 4.1 $Id: init.cc,v 1.63 2001/09/07 09:10:26 hori Exp $
SYSLOG: Compile option(s):
SYSLOG: SCore-D network: myrinet/myrinet2k
SYSLOG: Cluster[0]: (0..15)x2.i386-redhat7-linux2_4.i686.800
SYSLOG: Memory: 501[MB], Swap: 259[MB], Disk: 3027[MB]
SYSLOG: Network[0]: myrinet/myrinet2k
SYSLOG: Network[1]: ethernet/ethernet
SYSLOG: Scheduler initiated: Timeslice = 500 [msec]
SYSLOG: Queue[0] activated, exclusive scheduling
SYSLOG: Queue[1] activated, time-sharing scheduling
SYSLOG: Queue[2] activated, time-sharing scheduling
SYSLOG: Session ID: 0
SYSLOG: Server Host: comp00.pccluster.org
SYSLOG: Backup Host: comp0f.pccluster.org
SYSLOG: Operated by: albus
SYSLOG: Recover canceled by user: wormtail@host1.pccluster.org:4679
SYSLOG: --------- SCore-D (4.1) bootup --------
If restart option does not work well, then reset SCore-D
environment must be done. Use reset option in this
case. Note that user programs will not be restarted when
reset option is specified.
See also
-
scout(1),
scrun(1),
sctop(1),
scorehosts.db(5),
scoreboard(8),
scbcast(8)