[SCore-users] sc_watch problem

Jure Jerman jure.jerman at rzs-hm.si
Thu Dec 4 19:15:31 JST 2003


Dear Score users,

we have a Score 5.4 14 node cluster. The cluster and score are working well
and very stable except one little annoying detail:

Under heavy (IO ??) load from cluster members the sc_watch seems to get some kind
of a timeout and it decides to rerun itself. This usualy happens once or twice
per night, when we are running many jobs.


We run scored from sc_watch with next option:

$INSTALL_ROOT/bin/sc_watch -g $SCORED_GROUP -f $LOGFILE $ACTION -pid \
	$INSTALL_ROOT/deploy/scored -restart -syslog $SCBCAST_HOST \
	-nomemlimit -server tuba0 \
	-sysmon $SCBCAST_HOST $NETOPTION -operator sms


The message in a log file is:
[03/Dec/2003,20:28:40] System failure detected.
[03/Dec/2003,20:28:40] System has been shutdown.
[03/Dec/2003,20:28:40] Local Action: /etc/scorereset
[03/Dec/2003,20:28:45] Rebooting System [10 times, first retry]: /opt/score5.4.0/deploy/scored

I do not know what makes sc_watch to believe that there is a system failure. Is it some time
out or some other diagnostics?

What would be the best way to tackle the problem?


Thank you very much for any hint,


Jure Jerman




More information about the SCore-users mailing list