[SCore-users-jp] [SCore-users] sc_watch problem

Jure Jerman jure.jerman @ rzs-hm.si
2003年 12月 4日 (木) 19:15:31 JST


Dear Score users,

we have a Score 5.4 14 node cluster. The cluster and score are working well
and very stable except one little annoying detail:

Under heavy (IO ??) load from cluster members the sc_watch seems to get some kind
of a timeout and it decides to rerun itself. This usualy happens once or twice
per night, when we are running many jobs.


We run scored from sc_watch with next option:

$INSTALL_ROOT/bin/sc_watch -g $SCORED_GROUP -f $LOGFILE $ACTION -pid \
	$INSTALL_ROOT/deploy/scored -restart -syslog $SCBCAST_HOST \
	-nomemlimit -server tuba0 \
	-sysmon $SCBCAST_HOST $NETOPTION -operator sms


The message in a log file is:
[03/Dec/2003,20:28:40] System failure detected.
[03/Dec/2003,20:28:40] System has been shutdown.
[03/Dec/2003,20:28:40] Local Action: /etc/scorereset
[03/Dec/2003,20:28:45] Rebooting System [10 times, first retry]: /opt/score5.4.0/deploy/scored

I do not know what makes sc_watch to believe that there is a system failure. Is it some time
out or some other diagnostics?

What would be the best way to tackle the problem?


Thank you very much for any hint,


Jure Jerman

_______________________________________________
SCore-users mailing list
SCore-users @ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users



SCore-users-jp メーリングリストの案内