[SCore-users] sc_watch did'nt recognize crash of scored.exe on one node

Atsushi HORI hori at swimmy-soft.com
Wed Nov 13 14:03:57 JST 2002


Hi.

>But sc_watch didn't notice that error - it simply does nothing.
>
>So the question is: How is sc_watch monitoring all the
>score daemons, that it didn't notice the immediate crash of one of them ? 

Well, sc_watch samples scored activity, if it is working or not. The 
default interval is 10 minutes.

>The node is now in the defected list, but if I get some hints
>how to determine why the scored.exe is crashing immidiately on that
>node or somebody want to look a it, I'll try to debug that further.

The difficulty of watch dog timer, in general, is that the higher the 
frequency might not result in the higher the accuracy in time domain. 
Think about the case that Linux kernel is very busy (for swapping 
memory, for example) and has not enough time to schedule SCore 
processes. The sc_watch process may not have any response by the time 
of next sampling, and this results in rebooting SCore processes. But 
actually OS kernel is simlply but really heavy-loaded.

>On the node there are no related messages in the syslog file and
>on the server in "scored.messages" I didn't find anything interesting, 
>too. But just tell me what additional info is needed.

The SCore-D syslog is output via network. If the network has some 
problem, no syslog is output.

The current SCore high availability features can only recover from 
host (PC) failure. There is no network error recovery mechanism.

----
Atsushi HORI
SCore Developer
Swimmy Software, Inc.




More information about the SCore-users mailing list