[SCore-users-jp] [SCore-users] Any issues with downgrading kernel?

Jure Jerman jure.jerman @ rzs-hm.si
2005年 2月 17日 (木) 00:15:15 JST


Hi,

I have one major few minor questions to the list:

1. We did an upgrade on our 14 nodes Xeon cluster of:
SCore 5.4 -> 5.8.2
RedHat 7.2 -> FedoraCore 1
Linux 2.4.18 -> Linux 2.4.21 (vanilla with SCore patch).

What we are observing now is that some of machines are randomly
freezing (tcp stack is still alive, so we can still ping the machines
but nothing else). The freeze always happens when there is some score
job running on that node. The kernel usually stops in kswapd. The score
jobs get sometimes state T but sometimes the remain in S.
I would say that the problem is not related to hardware because almost
all nodes are freezing (but the first one much more often).

My next target to eliminate is a kernel and here comes the question:
Is it possible to use older kernel patches (= older kernels) witch
recent SCore distribution? (with 5.4 (=2.4.18 kernel) we did not have a
single freeze?

And now one marginal questions and a marginal suggestion:

question: Is it a normal behavior that if someone wants to run
scout, he has to restart msgbserv?

tuba0:/home/jure# scout -g tuba
SCOUT: Failed to lock MessageBoard.
tuba0:/home/jure# /etc/rc.d/init.d/msgbserv restart
tuba0:/home/jure# scout -g tuba
SCOUT: Spawning done.
SCOUT: session started.


Suggestion:
Sceptic doesn't cope well with our frezed machines because it never
exits from a check. I wrote my own version of it which is using
nagios ssh check which is exiting after timeout period. It would be
nice to have something like a timeout implemented in sceptic.

Sorry for this long post,

best greetings, Jure Jerman
_______________________________________________
SCore-users mailing list
SCore-users @ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users



SCore-users-jp メーリングリストの案内