[SCore-users] sc_watch did'nt recognize crash of scored.exe on one node
Atsushi HORI
hori at swimmy-soft.com
Wed Nov 13 14:03:57 JST 2002
Hi.
>But sc_watch didn't notice that error - it simply does nothing.
>
>So the question is: How is sc_watch monitoring all the
>score daemons, that it didn't notice the immediate crash of one of them ?
Well, sc_watch samples scored activity, if it is working or not. The
default interval is 10 minutes.
>The node is now in the defected list, but if I get some hints
>how to determine why the scored.exe is crashing immidiately on that
>node or somebody want to look a it, I'll try to debug that further.
The difficulty of watch dog timer, in general, is that the higher the
frequency might not result in the higher the accuracy in time domain.
Think about the case that Linux kernel is very busy (for swapping
memory, for example) and has not enough time to schedule SCore
processes. The sc_watch process may not have any response by the time
of next sampling, and this results in rebooting SCore processes. But
actually OS kernel is simlply but really heavy-loaded.
>On the node there are no related messages in the syslog file and
>on the server in "scored.messages" I didn't find anything interesting,
>too. But just tell me what additional info is needed.
The SCore-D syslog is output via network. If the network has some
problem, no syslog is output.
The current SCore high availability features can only recover from
host (PC) failure. There is no network error recovery mechanism.
----
Atsushi HORI
SCore Developer
Swimmy Software, Inc.
More information about the SCore-users
mailing list