sc_watch
command enables automati operation of SCore-D. First,
sc_watch
creates a scout environment, then invokes
SCore-D in the scout environment. Then sc_watch
keeps
watching SCore-D's response in a watch-dog way. When SCore-D does not
responds more than a few minutes, then sc_watch
assumes
that something happened on the cluster, and tries to reboot the system
from the beginning.
sc_watch
program can
invoke a Unix command when it detects a failure.
If the Unix command is a shell script, and there is a
Unix mail
command to send a mail to the administrater,
then the administrater will get an e-mail when the system goes down.
When sc_watch
detects a system failure, it can also
invoke a scout command to cleanup side-effects. There could be a case
in that System-V shared memory region(s) might be left when SCore-D
crashes in some reason. The remote command invokation can be used for
cancelling that kind of side-effect.
Following is an example of sc_watch
execution:
When a system failure is detected,# sc_watch -g pcc scored [19/Apr/2000,17:43:50] Booting System: scored SCOUT: Spawn done. SYSLOG: Cluster[0]: comp0.trc.rwcp.or.jp@0...comp3.trc.rwcp.or.jp@3 SYSLOG: BIN=linux, CPUGEN=pentium-iii, SMP=2, SPEED=500 SYSLOG: Network[0]: myrinet/myrinet SYSLOG: SCore-D network: myrinet/myrinet SYSLOG: Timeslice is set to 200[ms] SYSLOG: SCore-D: $Id: init.cc,v 1.58 2001/04/05 08:27:37 hori Exp $ SYSLOG: SCore-D server: comp3.trc.rwcp.or.jp:9901 SYSLOG: Operated by user: hori SYSLOG: --------- SCore-D (3.2.0) bootup -------- SYSLOG: SCore-D Watcher (server.trc.rwcp.or.jp:9990) ...
sc_watch
tries to
terminate SCore-D and then reboot the system.
Unlike most of the other SCore commands,[19/Apr/2000,18:10:52] System failure detected. SCOUT: session done [19/Apr/2000,18:10:53] System has been shutdown. [19/Apr/2000,18:10:57] Booting System: scored SCOUT: Spawn done. SYSLOG: Cluster[0]: comp0.trc.rwcp.or.jp@0...comp3.trc.rwcp.or.jp@3 ...
sc_watch
must
be invoked OUTSIDE of the scout environment. Because it kills SCore-D
processes running on a cluster via scout
. Here in this
example, sc_watch
is invoked with a host group option,
as if scout
command, followed by scored
.
In the last SYSLOG output, there is a message that SCore-D is
successfully connected SCore-D watcher, that is a
sc_watch
process invoked by a user. Through this TCP
connection, sc_watch
is watching SCore-D .
The next example is more practical than the first example:
In this exampe, when a system crash happens, first inform-admin script is executed, then# sc_watch -g pcc -l inform-admin -r ipcrmm -f /var/spool/logfile scored -restart
ipcrmm
script (possibly installed in
/opt/score/deploy/
) is executed to cleanup shared memory
region, and finally SCore-D will be rebooted. All the output of
SCore-D is logged into the file specified by "-f" option. When the
"-f" option is specified, sc_watch
becomes a daemon
process.
sc_watch
process terminates when SCore-D is normally
shutdown.