Automatic Operation of SCore-D

sc_watch command enables automati operation of SCore-D. First, sc_watch creates a scout environment, then invokes SCore-D in the scout environment. Then sc_watch keeps watching SCore-D's response in a watch-dog way. When SCore-D does not responds more than a few minutes, then sc_watch assumes that something happened on the cluster, and tries to reboot the system from the beginning.

sc_watch program can invoke a Unix command when it detects a failure. If the Unix command is a shell script, and there is a Unix mail command to send a mail to the administrater, then the administrater will get an e-mail when the system goes down.

When sc_watch detects a system failure, it can also invoke a scout command to cleanup side-effects. There could be a case in that System-V shared memory region(s) might be left when SCore-D crashes in some reason. The remote command invokation can be used for cancelling that kind of side-effect.

Following is an example of sc_watch execution:

# sc_watch -g pcc scored
[19/Apr/2000,17:43:50] Booting System: scored
SCOUT: Spawn done.    
SYSLOG: Cluster[0]: comp0.trc.rwcp.or.jp@0...comp3.trc.rwcp.or.jp@3
SYSLOG:   BIN=linux, CPUGEN=pentium-iii, SMP=2, SPEED=500
SYSLOG:   Network[0]: myrinet/myrinet
SYSLOG: SCore-D network: myrinet/myrinet
SYSLOG: Timeslice is set to 200[ms]
SYSLOG: SCore-D: $Id: init.cc,v 1.58 2001/04/05 08:27:37 hori Exp $
SYSLOG: SCore-D server: comp3.trc.rwcp.or.jp:9901
SYSLOG: Operated by user: hori
SYSLOG: --------- SCore-D (3.2.0) bootup --------
SYSLOG: SCore-D Watcher (server.trc.rwcp.or.jp:9990)
...

When a system failure is detected, sc_watch tries to terminate SCore-D and then reboot the system.

[19/Apr/2000,18:10:52] System failure detected.
SCOUT: session done
[19/Apr/2000,18:10:53] System has been shutdown.
[19/Apr/2000,18:10:57] Booting System: scored
SCOUT: Spawn done.    
SYSLOG: Cluster[0]: comp0.trc.rwcp.or.jp@0...comp3.trc.rwcp.or.jp@3
...

Unlike most of the other SCore commands, sc_watch must be invoked OUTSIDE of the scout environment. Because it kills SCore-D processes running on a cluster via scout. Here in this example, sc_watch is invoked with a host group option, as if scout command, followed by scored.

In the last SYSLOG output, there is a message that SCore-D is successfully connected SCore-D watcher, that is a sc_watch process invoked by a user. Through this TCP connection, sc_watch is watching SCore-D .

The next example is more practical than the first example:

# sc_watch -g pcc -l inform-admin -r ipcrmm -f /var/spool/logfile scored -restart

In this exampe, when a system crash happens, first inform-admin script is executed, then ipcrmm script (possibly installed in /opt/score/deploy/) is executed to cleanup shared memory region, and finally SCore-D will be rebooted. All the output of SCore-D is logged into the file specified by "-f" option. When the "-f" option is specified, sc_watch becomes a daemon process. sc_watch process terminates when SCore-D is normally shutdown.

CREDIT
This document is a part of the SCore cluster system software developed at Real World Computing Partnership, Japan. Copyright (c) 2000, 1999 Real World Computing Partnership.