Resuming SCore-D from an Unexpected Failure

SCore-D checkpoints itself every time a user logs in and out. If SCore-D is re-invoked with the restart option, it tries to recover itself from its most recent checkpoint. The command-line is as follows:
scored -restart
By resuming SCore-D, restartable users parallel processes are recovered. User parallel processes which were specified with a restart or checkpoint option, or those which were checkpointed with "^\" (SIGQUIT) are considered restartable. If a checkpoint image was found for a restartable parallel process, SCore-D tries to resume it from the checkpoint.

Following is an example of a successful restart of SCore-D and a user-parallel process:
# scored -restart
SYSLOG: /opt/score/deploy/scored
SYSLOG: SCore-D 4.1 $Id: init.cc,v 1.63 2001/09/07 09:10:26 hori Exp $
SYSLOG: Compile option(s): 
SYSLOG: SCore-D network: myrinet/myrinet2k
SYSLOG: Cluster[0]: (0..15)x2.i386-redhat7-linux2_4.i686.800
SYSLOG:   Memory: 501[MB], Swap: 259[MB], Disk: 3027[MB]
SYSLOG:   Network[0]: myrinet/myrinet2k
SYSLOG:   Network[1]: ethernet/ethernet
SYSLOG: Scheduler initiated: Timeslice = 500 [msec]
SYSLOG:   Queue[0] activated,  exclusive scheduling
SYSLOG:   Queue[1] activated,  time-sharing scheduling
SYSLOG:   Queue[2] activated,  time-sharing scheduling
SYSLOG: Session ID: 0
SYSLOG: Server Host: comp00.pccluster.org
SYSLOG: Backup Host: comp0f.pccluster.org
SYSLOG: Operated by: albus
SYSLOG: Recovery: harry@host1.pccluster.org:4670, JID-ID: 194
SYSLOG: --------- SCore-D (4.1) bootup --------
If the restart option is not specified when re-invoking SCore-D, previously checkpointed user parallel processes are not restarted and the checkpoint images are lost:
# scored
SYSLOG: /opt/score/deploy/scored
SYSLOG: SCore-D 4.1 $Id: init.cc,v 1.63 2001/09/07 09:10:26 hori Exp $
SYSLOG: Compile option(s): 
SYSLOG: SCore-D network: myrinet/myrinet2k
SYSLOG: Cluster[0]: (0..15)x2.i386-redhat7-linux2_4.i686.800
SYSLOG:   Memory: 501[MB], Swap: 259[MB], Disk: 3027[MB]
SYSLOG:   Network[0]: myrinet/myrinet2k
SYSLOG:   Network[1]: ethernet/ethernet
SYSLOG: Scheduler initiated: Timeslice = 500 [msec]
SYSLOG:   Queue[0] activated,  exclusive scheduling
SYSLOG:   Queue[1] activated,  time-sharing scheduling
SYSLOG:   Queue[2] activated,  time-sharing scheduling
SYSLOG: Session ID: 0
SYSLOG: Server Host: comp00.pccluster.org
SYSLOG: Backup Host: comp0f.pccluster.org
SYSLOG: Operated by: albus
SYSLOG: Recover canceled by SCore-D: tom@host1.pccluster.org:4672
SYSLOG: --------- SCore-D (4.1) bootup --------
If the restart option is specified but the user parallel process has already been killed by the user, then the following messages will be observed:
# scored -restart
SYSLOG: /opt/score/deploy/scored
SYSLOG: SCore-D 4.1 $Id: init.cc,v 1.63 2001/09/07 09:10:26 hori Exp $
SYSLOG: Compile option(s): 
SYSLOG: SCore-D network: myrinet/myrinet2k
SYSLOG: Cluster[0]: (0..15)x2.i386-redhat7-linux2_4.i686.800
SYSLOG:   Memory: 501[MB], Swap: 259[MB], Disk: 3027[MB]
SYSLOG:   Network[0]: myrinet/myrinet2k
SYSLOG:   Network[1]: ethernet/ethernet
SYSLOG: Scheduler initiated: Timeslice = 500 [msec]
SYSLOG:   Queue[0] activated,  exclusive scheduling
SYSLOG:   Queue[1] activated,  time-sharing scheduling
SYSLOG:   Queue[2] activated,  time-sharing scheduling
SYSLOG: Session ID: 0
SYSLOG: Server Host: comp00.pccluster.org
SYSLOG: Backup Host: comp0f.pccluster.org
SYSLOG: Operated by: albus
SYSLOG: Recover canceled by user: wormtail@host1.pccluster.org:4679
SYSLOG: --------- SCore-D (4.1) bootup --------
If restart option does not work well, then reset SCore-D environment must be done. Use reset option in this case. Note that user programs will not be restarted when reset option is specified.

See also

scout(1), scrun(1), sctop(1), scorehosts.db(5), scoreboard(8), scbcast(8)

$Id: resume.html,v 1.3 2002/03/07 12:03:44 kameyama Exp $