sc_watch
command enables automati operation of SCore-D. First,
sc_watch
creates a scout environment, then invokes
SCore-D in the scout environment. Then sc_watch
keeps
watching SCore-D's response in a watch-dog way. When SCore-D does not
responds more than a few minutes, then sc_watch
assumes
that something happened on the cluster, and tries to reboot the system
from the beginning.
sc_watch
program can
invoke a Unix command when it detects a failure.
If the Unix command is a shell script, and there is a
Unix mail
command to send a mail to the administrater,
then the administrater will get an e-mail when the system goes down.
When sc_watch
detects a system failure, it can also
invoke a scout command to cleanup side-effects.
Following is an example of sc_watch
execution:
When a system failure is detected,# sc_watch -g pcc scored [14/Sep/2001,16:31:43] SC_WATCH (4.1) started. [14/Sep/2001,16:31:43] Interval is set to 10 minutes. [14/Sep/2001,16:31:43] Local Action = (none) [14/Sep/2001,16:31:43] Remote Action = (none) [14/Sep/2001,16:31:43] Abort action = (none) [14/Sep/2001,16:31:43] Boot Retry Max. = 10 [14/Sep/2001,16:31:43] Booting System: scored SCOUT: Spawning done. 14/Sep/2001 16:31:51 SYSLOG: /opt/score/deploy/scored 14/Sep/2001 16:31:51 SYSLOG: SCore-D 4.1 $Id: init.cc,v 1.63 2001/09/07 09:10:26 hori Exp $ 14/Sep/2001 16:31:51 SYSLOG: Compile option(s): 14/Sep/2001 16:31:51 SYSLOG: SCore-D network: myrinet/myrinet2k 14/Sep/2001 16:31:51 SYSLOG: Cluster[0]: (0..15)x2.i386-redhat7-linux2_4.i686.800 14/Sep/2001 16:31:51 SYSLOG: Memory: 501[MB], Swap: 259[MB], Disk: 3027[MB] 14/Sep/2001 16:31:51 SYSLOG: Network[0]: myrinet/myrinet2k 14/Sep/2001 16:31:51 SYSLOG: Network[1]: ethernet/ethernet 14/Sep/2001 16:31:51 SYSLOG: Scheduler initiated: Timeslice = 500 [msec] 14/Sep/2001 16:31:51 SYSLOG: Queue[0] activated, exclusive scheduling 14/Sep/2001 16:31:51 SYSLOG: Queue[1] activated, time-sharing scheduling 14/Sep/2001 16:31:51 SYSLOG: Queue[2] activated, time-sharing scheduling 14/Sep/2001 16:31:51 SYSLOG: Session ID: 0 14/Sep/2001 16:31:51 SYSLOG: Server Host: comp00.pccluster.org 14/Sep/2001 16:31:51 SYSLOG: Backup Host: comp0f.pccluster.org 14/Sep/2001 16:31:51 SYSLOG: Operated by: root 14/Sep/2001 16:31:51 SYSLOG: SCore-D Watcher (server.pccluster.orgf:46514) 14/Sep/2001 16:31:51 SYSLOG: --------- SCore-D (4.1) bootup -------- ...
sc_watch
tries to
terminate SCore-D and then reboot the system.
Unlike most of the other SCore commands,[14/Sep/2001 16:41:22] System failure detected. SCOUT: session done [14/Sep/2001 16:41:24] System has been shutdown. [14/Sep/2001 16:41:30] Booting System: scored SCOUT: Spawn done. 14/Sep/2001 16:41:51 SYSLOG: /opt/score/deploy/scored 14/Sep/2001 16:41:51 SYSLOG: SCore-D 4.1 $Id: init.cc,v 1.63 2001/09/07 09:10:26 hori Exp $ ...
sc_watch
must
be invoked OUTSIDE of the scout environment. Because it kills SCore-D
processes running on a cluster via
scout
.
Here in this example, sc_watch
is invoked with a host
group option, similar to the scout
command.
In the last SYSLOG output before the bootup message, there is
a message that SCore-D is successfully connected SCore-D watcher, that
is a sc_watch
process invoked by a user. Through this TCP
connection, sc_watch
is watching SCore-D .
sc_watch
process terminates when SCore-D is normally
shutdown or by ^C (SIGINT).
The sceptic command investigates the hosts in the host group specified in the host_group shell variable. And it outputs the list of defected host(s) to the scorehosts.defects file.host_group=pcc install_root=/opt/score # $install_root/bin/sceptic -g $host_group >> /opt/score/etc/scorehosts.defects echo defected hosts cat /opt/score/etc/scorehosts.defects echo new host list $install_root/bin/scorehosts -r $host_group /etc/rc.d/init.d/scoreboard stop /etc/rc.d/init.d/scoreboard start echo scoreboard is restarted.
The output of sceptic command must be appended to the /opt/score/etc/scorehosts.defects file. Otherwise, when defected host is repaired and come back, and then another host goes down, eventually two hosts are simultaneously replaced. In SCore 4.1, checkpointing file has parity blocks within the file so that the lost of a file on a host can be recovered. When two hosts are replaced at once, restarting from a checkpoint may fail if the parallel process was running on the replaced hosts.
The next things you have to do for the high availability is modifying the /etc/rc.d/init.d/scoreboard script file. You will find the following function in the script.
This function must be modified like the folowing, so that the scoreboard command can locate the file listing defected hostnames.startsccoreboard() { pid=`pidofproc scoreboard` [ -n "$pid" ] && ps h $pid >/dev/null 2>&1 && return ulimit -c 0 su nobody -c "$INSTALL_ROOT/sbin/scoreboard -file /opt/score/etc/scorehosts.db -pid" > /var/run/scoreboard.pid && success }
Finally, the sc_watch command is invoked on the server host where the scoreboard process is running.startsccoreboard() { pid=`pidofproc scoreboard` [ -n "$pid" ] && ps h $pid >/dev/null 2>&1 && return ulimit -c 0 su nobody -c "$INSTALL_ROOT/sbin/scoreboard -file /opt/score/etc/scorehosts.db -defects /opt/score/etc/scorehosts.defects -pid" > /var/run/scoreboard.pid && success }
Everytime SCore-D crashes, the sceptic command checks the cluster hosts. If there is a defected host, the name of defected host is recorded in the defect file. When the scoreboard process is restarted by the local action script, the defected host is replaced by the host specified by the spare attribute in the scorehosts.db file. Finally, SCore-D is restarted by the sc_watch command. If some user parallel processes have been checkpointed, then the lost checkpoint file on the defectd host is recovered using the parity blocks in the checkpoint files on the other hosts. Eventually user program execution is totally recovered.# sc_watch -g pcc -l replace.sh scored [14/Sep/2001,16:31:43] SC_WATCH (4.1) started. [14/Sep/2001,16:31:43] Interval is set to 10 minutes. [14/Sep/2001,16:31:43] Local Action = replace.sh [14/Sep/2001,16:31:43] Remote Action = (none) [14/Sep/2001,16:31:43] Abort action = (none) [14/Sep/2001,16:31:43] Boot Retry Max. = 10 [14/Sep/2001,16:31:43] Booting System: scored SCOUT: Spawning done. ...