[SCore-users-jp] Re: [SCore-users] sc_watch did'nt recognize crash of scored.exe on one node
Atsushi HORI
hori @ swimmy-soft.com
2002年 11月 13日 (水) 14:03:57 JST
Hi.
>But sc_watch didn't notice that error - it simply does nothing.
>
>So the question is: How is sc_watch monitoring all the
>score daemons, that it didn't notice the immediate crash of one of them ?
Well, sc_watch samples scored activity, if it is working or not. The
default interval is 10 minutes.
>The node is now in the defected list, but if I get some hints
>how to determine why the scored.exe is crashing immidiately on that
>node or somebody want to look a it, I'll try to debug that further.
The difficulty of watch dog timer, in general, is that the higher the
frequency might not result in the higher the accuracy in time domain.
Think about the case that Linux kernel is very busy (for swapping
memory, for example) and has not enough time to schedule SCore
processes. The sc_watch process may not have any response by the time
of next sampling, and this results in rebooting SCore processes. But
actually OS kernel is simlply but really heavy-loaded.
>On the node there are no related messages in the syslog file and
>on the server in "scored.messages" I didn't find anything interesting,
>too. But just tell me what additional info is needed.
The SCore-D syslog is output via network. If the network has some
problem, no syslog is output.
The current SCore high availability features can only recover from
host (PC) failure. There is no network error recovery mechanism.
----
Atsushi HORI
SCore Developer
Swimmy Software, Inc.
_______________________________________________
SCore-users mailing list
SCore-users @ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users
SCore-users-jp メーリングリストの案内