[SCore-users] sc_watch did'nt recognize crash of scored.exe on one node

Hermann Lauer hermann.lauer at iwr.uni-heidelberg.de
Tue Nov 12 02:00:33 JST 2002


Dear Score Users,

score-5.0.1 stopped with the following error in sc_watch:

<194> ULT: Exception Signal (11)

SCOUT: Session done.
[10/Nov/2002,00:37:45] System failure detected.
[10/Nov/2002,00:37:45] System has been shutdown.
[10/Nov/2002,00:37:45] Local Action: /opt/score/etc/replace.sh
[10/Nov/2002,00:37:57] Rebooting System [3 times, second retry]: /opt/score/deploy/bin.i386-debian-linux2_4/scored.exe 

As you can see, sc_watch restarted the scored.

But then, immediately again appeared the same error message in sc_watch:

<194> ULT: Exception Signal (11)

I checked the the node with the score number <194>, and the
only score related processes are "scoutd.exe" and "scremote.exe".
So "scored.exe" indeed seems to have got a "signal 11, segmentation fault"
 - if that's the right interpretation of "ULT: Exception Signal (11)".

But sc_watch didn't notice that error - it simply does nothing.

So the question is: How is sc_watch monitoring all the
score daemons, that it didn't notice the immediate crash of one of them ? 

The node is now in the defected list, but if I get some hints
how to determine why the scored.exe is crashing immidiately on that
node or somebody want to look a it, I'll try to debug that further.

On the node there are no related messages in the syslog file and
on the server in "scored.messages" I didn't find anything interesting, 
too. But just tell me what additional info is needed.

Many thanks,

  greetings
    Hermann
-- 
Netzwerkadministration/Zentrale Dienste, Interdiziplinaeres 
Zentrum fuer wissenschaftliches Rechnen der Universitaet Heidelberg
IWR; INF 368; 69120 Heidelberg; Tel: (06221)54-8236 Fax: -5224
Email: Hermann.Lauer at iwr.uni-heidelberg.de



More information about the SCore-users mailing list